Description: Flexera One - IT Visibility - APAC, EU & NAM - Data Inaccuracy Issue
Timeframe: September 20th, 5:05 AM to September 24th, 6:23 PM PDT
Incident Summary
On Wednesday, September 20th, 2023, at 5:05 AM PDT, our technical team identified an issue with our IT Visibility platform, resulting in data inaccuracies. As a result, there may have been inaccuracies in customer data across various areas, including Flexera One Dashboards, Data Exports, API queries, and ServiceNow. Customers across all regions, including APAC, EU, and NA, could have been affected by this issue.
Our technical teams were promptly mobilized to address the issue. The root cause was determined to be a change implemented on September 14th, 2023. During the change, an unintended modification was deployed into production, impacting the system’s error handling abilities, and further leading to data processing delays and the accumulation of a substantial backlog. Consequently, it had a direct effect on the accuracy of customer data. On September 20th, at 6:30 AM PDT, an emergency hotfix was implemented in the PROD environment. Subsequently, data reloading was initiated for the affected organizations.
As of September 22nd, at 1:29 AM PDT, a substantial number of organizations had already received updates. However, numerous organizations, especially those with extensive datasets, were still awaiting their updates. To expedite the processing, the technical teams decided to further enhance the environment by increasing its capacity for concurrent job processing.
Around 7:06 AM PDT, substantial progress had been made in streaming data for most organizations. However, intermittent pod terminations were observed, causing interruptions to ongoing jobs. Larger organizations with extended data processing tasks were notably affected by these interruptions.
In an effort to enhance system performance, the team implemented additional measures, but despite these improvements, the system still struggled to handle the increased data processing load. On September 24th, at 10:40 AM PDT, the technical team's ongoing investigation highlighted that the major factor contributing to the persistent performance issues during the incident was the resource contention problem, significantly hindering backlog processing.
On September 24th at 11:11 AM PDT, multiple measures were implemented by the technical team to address the performance issues. These actions included adjustments to resource allocations and scaling configurations, with the goal of improving overall system performance.
These enhancements led to expedited data processing, with remaining jobs successfully completing by 6:23 PM PDT, marking the successful resolution of the backlog and resumption of real-time data processing.
Root Cause
Primary Root Cause:
The root cause of the incident was an unintended modification that occurred during a change implemented on September 14, 2023. This modification negatively impacted the error handling capabilities of the system, resulting in data processing delays and a significant backlog, which ultimately affected the accuracy of customer data.
Contributing Cause:
Resource contention issues, highlighted on September 24, 2023, significantly hindered backlog processing, further contributing to the performance issues during the incident.
Remediation Actions
Future Preventative Measures