Description: Flexera One - IT Asset Management – NDI Inventory Data delayed
Timeframe:
March 13th, 2024, 7:08 PM to March 27th, 7:13 AM PDT
April 17th, 2024, 12:12 AM to April 21st, 2024, 4:12 PM PDT
Incident Summary
On Wednesday, April 17th, at 12:12 AM PDT, we experienced a recurrence of an issue with delays in processing NDI inventory files uploaded to the IT Asset Management platform in the NA region. This could potentially have impacted multiple organizations, resulting in delayed updates to their inventory records.
After identifying the issue, our technical teams implemented several measures to restore normal operations. This included deploying additional inventory nodes to increase throughput, which were operational by 11:25 AM PDT. Despite these efforts, monitoring over the subsequent hours did not show significant improvements. The technical team continued to observe inventory drops within the 24-hour timeframe.
During the investigation, we observed an unusual increase in wait times on one of the databases associated with a process related to relocating computers to new organizational units, with the majority of occurrences traced back to a single tenant. This could have been further escalated by other tenants sharing the same database.
To address this issue, at 1:09 PM PDT, we implemented traffic segregation to isolate inventory nodes, aiming to accelerate processing. This resulted in an immediate improvement, with a significant increase in data processing. However, later in the day, we observed a decline in throughput, indicating that traffic from a single source might not have been the sole bottleneck. Consequently, we reverted the change. Nevertheless, we remained proactive and scaled up inventory resources to expedite processing, while continuing to explore additional solutions.
Throughout the remainder of the day, our monitoring indicated a slight improvement, with a corresponding decrease in backlog. On Thursday, April 18th, at 7:28 AM PDT, processing rates increased once more, yet did not reach satisfactory levels. Following investigations and discussions, the technical teams decided to further augment inventory resources. However, given the existing backlog, it was anticipated that processing would extend into the weekend to fully catch up.
Throughout the rest of the incident, teams continued to implement additional improvements. This involved disabling non-urgent tasks to prevent bottlenecks and making further adjustments to inventory resources. Additionally, they conducted additional manual interventions, like further segregating traffic from tenants in the following days, to promptly clear the backlog queues.
On Sunday, April 21st, at 9:51 AM PDT, our technical teams confirmed that backlog processing was nearly complete, with new inventory being processed at optimal levels. However, the decision was made to extend monitoring into the weekdays due to the potential for a lower incoming inventory rate compared to weekdays. Later that same day, at 4:12 PM PDT, backlog processing was fully resolved.
After extended monitoring into the Monday, we continued to observe positive outcomes with weekday normal rate of incoming inventory. Following the successful health checks, and confirmation from our technical teams, we considered the incident resolved on Monday, April 22nd.
Following extended monitoring into Monday, we observed continued positive outcomes, with data processing maintaining its weekday normal rate. After successful health checks and confirmation from our technical teams, we concluded the incident as resolved on Monday, April 22nd.
Root Cause
Primary root cause:
Traffic congestion leading to processing bottlenecks on the IT Asset Management platform in the NA region, resulting in delays in processing NDI inventory files.
Contributing factors:
• Unusual increase in wait times on one of the databases associated with relocating computers to new organizational units.
• Majority of occurrences traced back to a single tenant, potentially exacerbated by other tenants sharing the same database.
Remediation Actions
• Traffic Segregation Initiative: Implementing traffic segregation to isolate inventory nodes and accelerate processing.
• Inventory Resource Scaling: Scaling up inventory resources to expedite data processing.
• Task Prioritization: Disabling non-urgent tasks to alleviate bottlenecks.
• Manual Intervention: Conducting manual interventions, such as further segregating traffic from tenants to promptly clear backlog queues.
• Continuous Monitoring and Adjustment: Continued monitoring and adjustments to ensure optimal processing efficiency.
Future Preventative Measures
Problem Record Establishment: A problem record has been opened to track the progress of both short and long-term measures, facilitating ongoing monitoring and improvement of system performance and reliability.
Short Term Measures:
• Queue System Implementation: Deploying a queue system per tenant to prevent blocking between tenants and mitigate processing delays.
• Concurrency Setting Adjustment: Modifying concurrency settings to reduce the server's vulnerability to blockages caused by a single source, enhancing overall system stability.
• Monitoring Enhancement: Strengthening monitoring capabilities to swiftly detect and address any emerging issues, ensuring proactive management of potential bottlenecks.
Long Term Plan:
UDI-ITAM Tenant-Specific Inventory Ingestion Pipeline Implementation: Scheduled for Q3 of this year, this initiative involves the development and deployment of a dedicated inventory ingestion pipeline for each tenant, enhancing system efficiency and resilience to prevent future incidents.