Description: Flexera One – IT Asset Management (ITAM) – NA – Slow Load Times and Errors
Timeframe: September 4th, 2024, 11:00 AM to September 4th, 2024, 12:30 PM PDT
Incident Summary
On Wednesday, September 4th, 2024, at 11:00 AM PDT, we experienced an issue affecting the IT Asset Management (ITAM) platform in the NA region. Customers may have encountered slow load times and errors when trying to access or perform any functions within ITAM. These issues were primarily reported by users in the NA region, with no impact observed in the EU or APAC regions.
As part of an automated process designed to optimize system performance during peak load times, a new instance was deployed in the NA region. However, this instance inadvertently contributed to traffic misrouting within the environment, leading to degraded performance and errors. Our technical team immediately took prompt action and removed the problematic instance, but this did not resolve the issue.
Further investigation revealed that a scheduled maintenance activity in another region had unintentionally affected the traffic configuration for the NA environment, which caused the continued slow load times and errors.
By 12:30 PM PDT, the traffic routing was corrected, restoring normal performance. Following additional validations and internal health checks, confirmation was received from multiple customers that the ITAM platform was functioning as expected. The incident was officially declared resolved by 12:46 PM PDT.
Root Cause
Primary Root Cause
The disruption was caused by an internal process error during scheduled maintenance, which inadvertently affected traffic routing in the NA region and resulted in slow load times and errors.
Contributing Factors
• Cross-Region Impact: The process targeting another environment unintentionally impacted the NA region.
• Limited Region-Specific Testing: Testing was conducted in a single region, which did not catch the issue before it impacted the NA environment.
Remediation Actions
Future Preventative Measures
In response to this incident, a thorough Root Cause Analysis (RCA) was conducted with all relevant teams to identify areas for improvement. The discussions emphasized the importance of strengthening region-specific testing and improving our operational processes to avoid similar issues in the future. Based on these findings, we have outlined the following preventative measures to enhance system performance, reduce cross-region impacts, and ensure more rigorous testing and validation before deploying changes.