Description:
Flexera One - IT Asset Management – NDI Inventory Data delayed
Timeframe:
May 2nd, 2024, at 5:50 PM PDT to May 6th ,2024, at 4:25 PM PDT
Incident Summary
On Thursday, May 2nd at 5:50 PM PDT, we encountered a recurrence of a known issue with delays in processing NDI inventory files uploaded to the IT Asset Management platform in the NA region. This issue could have potentially affected multiple organizations by delaying updates to their inventory records.
Upon identifying the issue, our technical teams quickly took action to restore normal operations. They pinpointed a traffic bottleneck caused by a significant amount of data from one tenant and re-routed this data by 6:14 PM PDT, which improved NDI processing speed. The teams continued to monitor the backlog closely and noticed significant improvement by early morning on May 4th. At 2:26 AM, they observed a substantial reduction in the backlog, with it reduced by half by 5:47 AM. By 5:17 PM that day, the backlog was confirmed to be half of what it was.
By May 6th at 1:21 PM PDT, the backlog was fully resolved, except for the high-traffic tenant which had already been moved to separate nodes. With the backlog under control, the team decided to remove the traffic segregation and reintegrate the tenant, while extending monitoring to detect any potential degradation in processing and new backlog generation.
However, on May 6th at 8:13 AM PDT, the team noticed slowness in NDI file processing for a few tenants and began investigating. They temporarily increased the number of nodes to boost processing speed and kept the services under observation. Throughout the incident, the teams continued to make additional improvements, including disabling non-urgent tasks to prevent bottlenecks and making further adjustments to inventory resources. They also conducted manual interventions, such as further segregating traffic from customers in the following days, to clear backlog queues promptly.
By 4:25 PM PDT, the teams confirmed that backlog processing was at an acceptable level, and further improvement was noted by 5:14 PM. As part of the remediation actions, concurrency settings were adjusted to reduce the likelihood of server blockages by individual tenants. Following extended monitoring, successful health checks and confirmation from the technical teams, the incident was declared as resolved on May 6th at 5:45 PM PDT with a normal amount of NDI backlog pending processing.
Further analysis revealed that traffic congestion led to processing bottlenecks on the IT Asset Management platform in the NA region, causing delays in NDI inventory file processing. The majority of occurrences were traced back to a single tenant, with potential exacerbation by other tenants sharing the same database. The technical teams have devised an action plan for a permanent fix, and further actions are being tracked under a high-priority problem record.
Root Cause
Primary root cause:
Traffic congestion leading to processing bottlenecks on the IT Asset Management platform in the NA region, resulting in delays in processing NDI inventory files.
Contributing factors:
Majority of occurrences traced back to a single tenant, potentially exacerbated by other tenants sharing the same database.
Remediation Actions
· Traffic Segregation Initiative: Implementing traffic segregation to isolate inventory nodes and accelerate processing.
· Inventory Resource Scaling: Scaling up inventory resources to expedite data processing.
· Task Prioritization: Disabling non-urgent tasks to alleviate bottlenecks.
· Manual Intervention: Conducting manual interventions, such as further segregating traffic from tenants to promptly clear backlog queues.
· Continuous Monitoring and Adjustment: Continued monitoring and adjustments to ensure optimal processing efficiency.
Future Preventative Measures
Problem Record Establishment: A problem record has been opened to track the progress of both short and long-term measures, facilitating ongoing monitoring and improvement of system performance and reliability.
Short Term Measures:
· Queue System Implementation: Deploying a queue system per tenant to prevent blocking between tenants and mitigate processing delays.
· Concurrency Setting Adjustment: Modified concurrency settings to reduce the server's vulnerability to blockages caused by a single source, enhancing overall system stability.
· Monitoring Enhancement: Strengthening monitoring capabilities to swiftly detect and address any emerging issues, ensuring proactive management of potential bottlenecks.
Long Term Plan:
UDI-ITAM Tenant-Specific Inventory Ingestion Pipeline Implementation: Scheduled for Q3 of this year, this initiative involves the development and deployment of a dedicated inventory ingestion pipeline for each tenant, enhancing system efficiency and resilience to prevent future incidents.