Flexera One IT Asset Management - NA - Intermittent UI Loading and Batch Processing Delays
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Flexera One IT Asset Management - NA - Intermittent UI Loading and Batch Processing Delays

Timeframe: October 29th, 2024, 5:50 AM to October 29th, 2024, 6:30 AM PDT

Incident Summary

On Tuesday, October 29th, 2024, at 5:50 AM PDT, we experienced intermittent UI loading issues and batch processing delays affecting the Flexera One IT Asset Management (ITAM) platform in the NA region. A subset of tenants may have experienced slower UI load times, intermittent access issues, and delays in data updates and processing tasks. While the platform remained accessible, some users may have encountered slower performance or temporary access disruptions during this timeframe.

The issue was triggered by a significant backlog compounded by a large influx of inventory and usage data from multiple sources. This backlog led to database blocking events, resulting in connectivity issues that caused slow UI performance. Although the database managed to stabilize on its own after processing the large volume of data, a restart of the batch scheduler was needed to fully resolve the batch processing delays.

Following the restart, system functionality returned to normal, and extended monitoring confirmed platform stability by 6:30 AM PDT.

Root Cause

Primary Root Cause

A temporary database connection error in the NA region led to intermittent connectivity for batch processing and UI access, requiring a restart of the affected components.

Contributing Factors

• System-Level Constraints: An unusually high volume of data processing increased the demand on system resources, resulting in delays.
• Concurrency Management: A known limitation in tenant-specific concurrency controls allowed high-resource processes from certain tenants to occupy multiple connections simultaneously, leading to queue buildup and delays.

Remediation Actions

  1. Immediate Monitoring and Analysis: Technical teams began actively monitoring database activity and identified database blocking events related to a significant backlog of data processing tasks.
  2. Batch Scheduler Restart: To restore batch processing stability, the batch scheduler was restarted. This restart allowed pending tasks to resume smoothly, addressing batch processing delays.
  3. Extended Monitoring and Validation: Following the resolution, extended monitoring of the platform was conducted to ensure ongoing stability and verify that data processing and UI performance were back to normal.

Future Preventative Measures

  1. Enhanced Resource Management for Background Processes: We are working to optimize the execution and scheduling of resource-intensive background processes to prevent system resource overuse. This will ensure balanced resource allocation and maintain stable platform performance, even during peak operations.
  2. Infrastructure Upgrade: We are in the process of upgrading our system infrastructure, which will improve performance and resource handling. This upgrade will be gradually deployed across all regions over the coming months.
  3. Conflict Avoidance Mechanism: We are working to implement updates to prevent critical processes from competing for system resources, ensuring smooth operation during peak loads.
  4. Review of Concurrency Management: Teams reviewed concurrency management protocols to evaluate potential adjustments that would prevent a single set of processes from overloading system resources in future high-volume scenarios. This review aims to enhance handling of simultaneous data tasks across tenants.
  5. Proactive Monitoring: We will explore enhancing our monitoring systems to better track resource usage during background processes, ensuring issues are detected and mitigated before they affect platform stability.
Posted Nov 07, 2024 - 13:50 PST

Resolved
Issue Description: Earlier today, we encountered intermittent issues affecting the IT Asset Management platform in the NA region. During this period, some customers may have experienced UI loading difficulties and delays in batch task processing.

Priority: P2

Restoration Activity (Resolved): Our preliminary investigation has identified the cause of these intermittent issues, linked to a network-related database connection error. While services have stabilized, our teams are closely monitoring the platform and will further investigate the root cause to implement measures that prevent future occurrences. A full post-mortem report will be published over the coming days.
Posted Oct 29, 2024 - 08:04 PDT
This incident affected: Flexera One - IT Asset Management - North America (IT Asset Management - US Beacon Communication, IT Asset Management - US Inventory Upload, IT Asset Management - US Login Page, IT Asset Management - US Batch Processing System, IT Asset Management - US Business Reporting, IT Asset Management - US Restful APIs).