Flexera One - IT Asset Management - NA - Inventory file upload delay
Incident Report for Flexera System Status Dashboard
Postmortem

Description:  

Flexera One - IT Asset Management – NDI Inventory Data delayed

 Timeframe:  
May 2nd, 2024, at 5:50 PM PDT to May 6th ,2024, at 4:25 PM PDT

Incident Summary

On Thursday, May 2nd at 5:50 PM PDT, we encountered a recurrence of a known issue with delays in processing NDI inventory files uploaded to the IT Asset Management platform in the NA region. This issue could have potentially affected multiple organizations by delaying updates to their inventory records.

Upon identifying the issue, our technical teams quickly took action to restore normal operations. They pinpointed a traffic bottleneck caused by a significant amount of data from one tenant and re-routed this data by 6:14 PM PDT, which improved NDI processing speed. The teams continued to monitor the backlog closely and noticed significant improvement by early morning on May 4th. At 2:26 AM, they observed a substantial reduction in the backlog, with it reduced by half by 5:47 AM. By 5:17 PM that day, the backlog was confirmed to be half of what it was.

By May 6th at 1:21 PM PDT, the backlog was fully resolved, except for the high-traffic tenant which had already been moved to separate nodes. With the backlog under control, the team decided to remove the traffic segregation and reintegrate the tenant, while extending monitoring to detect any potential degradation in processing and new backlog generation.

However, on May 6th at 8:13 AM PDT, the team noticed slowness in NDI file processing for a few tenants and began investigating. They temporarily increased the number of nodes to boost processing speed and kept the services under observation. Throughout the incident, the teams continued to make additional improvements, including disabling non-urgent tasks to prevent bottlenecks and making further adjustments to inventory resources. They also conducted manual interventions, such as further segregating traffic from customers in the following days, to clear backlog queues promptly.

By 4:25 PM PDT, the teams confirmed that backlog processing was at an acceptable level, and further improvement was noted by 5:14 PM. As part of the remediation actions, concurrency settings were adjusted to reduce the likelihood of server blockages by individual tenants. Following extended monitoring, successful health checks and confirmation from the technical teams, the incident was declared as resolved on May 6th at 5:45 PM PDT with a normal amount of NDI backlog pending processing.

Further analysis revealed that traffic congestion led to processing bottlenecks on the IT Asset Management platform in the NA region, causing delays in NDI inventory file processing. The majority of occurrences were traced back to a single tenant, with potential exacerbation by other tenants sharing the same database. The technical teams have devised an action plan for a permanent fix, and further actions are being tracked under a high-priority problem record.

Root Cause

Primary root cause:

 Traffic congestion leading to processing bottlenecks on the IT Asset Management platform in the NA region, resulting in delays in processing NDI inventory files.

 Contributing factors:

 Majority of occurrences traced back to a single tenant, potentially exacerbated by other tenants sharing the same database.

Remediation Actions

·        Traffic Segregation Initiative: Implementing traffic segregation to isolate inventory nodes and accelerate processing.

·        Inventory Resource Scaling: Scaling up inventory resources to expedite data processing.

·        Task Prioritization: Disabling non-urgent tasks to alleviate bottlenecks.

·        Manual Intervention: Conducting manual interventions, such as further segregating traffic from tenants to promptly clear backlog queues.

·        Continuous Monitoring and Adjustment: Continued monitoring and adjustments to ensure optimal processing efficiency.

Future Preventative Measures

 Problem Record Establishment: A problem record has been opened to track the progress of both short and long-term measures, facilitating ongoing monitoring and improvement of system performance and reliability.

Short Term Measures:

·        Queue System Implementation: Deploying a queue system per tenant to prevent blocking between tenants and mitigate processing delays.

·        Concurrency Setting Adjustment: Modified concurrency settings to reduce the server's vulnerability to blockages caused by a single source, enhancing overall system stability.

·        Monitoring Enhancement: Strengthening monitoring capabilities to swiftly detect and address any emerging issues, ensuring proactive management of potential bottlenecks.

 

Long Term Plan:

UDI-ITAM Tenant-Specific Inventory Ingestion Pipeline Implementation: Scheduled for Q3 of this year, this initiative involves the development and deployment of a dedicated inventory ingestion pipeline for each tenant, enhancing system efficiency and resilience to prevent future incidents.

Posted May 23, 2024 - 05:13 PDT

Resolved
This incident has been resolved.
Posted May 06, 2024 - 17:54 PDT
Monitoring
As of 1:21 PM PDT, the backlog has been cleared. However, as a precautionary measure, we will continue monitoring into Monday to ensure that services operate at optimal levels with weekday inventory.
Posted May 05, 2024 - 18:09 PDT
Update
According to the latest observations, we have observed a favourable trend reflecting a reduction in the backlog. Our technical teams will continue to monitor the situation closely and will provide updates as needed.
Posted May 05, 2024 - 00:23 PDT
Update
Based on the latest assessment, we have observed a positive trend toward a reduction in the backlog. Our technical teams will continue to monitor and provide updates to our external customers as necessary
Posted May 04, 2024 - 09:20 PDT
Update
After a brief period of instability, processing has returned to normal rates. We'll be closely monitoring throughout the weekend and intervening manually if necessary. We'll keep you updated on any developments.
Posted May 03, 2024 - 18:10 PDT
Investigating
In our recent assessments, we have identified a notable decline following a brief period of accelerated processing. We are investigating it further.
Posted May 03, 2024 - 06:43 PDT
Update
We have implemented measures to reduce the traffic bottleneck, resulting in positive outcomes. We are continuing to monitor the environment for further progress.
Posted May 03, 2024 - 01:25 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 02, 2024 - 19:46 PDT
Investigating
Incident Description:
We are currently experiencing a delay in processing inventory files (NDI) on US production, impacting all customers. This might cause customers to observe delayed updates of their device inventory records.

Priority: P2

Restoration activity:
Our technical teams are actively involved and are evaluating the situation. Additionally, we are exploring potential solutions to rectify the issue as quickly as possible. We sincerely apologize for any inconvenience this may have caused.
Posted May 02, 2024 - 19:13 PDT
This incident affected: Flexera One - IT Asset Management - North America (IT Asset Management - US Inventory Upload).