Flexera One - IT Visibility - APAC, EU & NAM - Data Inaccuracy Issue

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - IT Visibility - APAC, EU & NAM - Data Inaccuracy Issue

Timeframe: September 20th, 5:05 AM to September 24th, 6:23 PM PDT

Incident Summary

On Wednesday, September 20th, 2023, at 5:05 AM PDT, our technical team identified an issue with our IT Visibility platform, resulting in data inaccuracies. As a result, there may have been inaccuracies in customer data across various areas, including Flexera One Dashboards, Data Exports, API queries, and ServiceNow. Customers across all regions, including APAC, EU, and NA, could have been affected by this issue.

Our technical teams were promptly mobilized to address the issue. The root cause was determined to be a change implemented on September 14th, 2023. During the change, an unintended modification was deployed into production, impacting the system’s error handling abilities, and further leading to data processing delays and the accumulation of a substantial backlog. Consequently, it had a direct effect on the accuracy of customer data. On September 20th, at 6:30 AM PDT, an emergency hotfix was implemented in the PROD environment. Subsequently, data reloading was initiated for the affected organizations.

As of September 22nd, at 1:29 AM PDT, a substantial number of organizations had already received updates. However, numerous organizations, especially those with extensive datasets, were still awaiting their updates. To expedite the processing, the technical teams decided to further enhance the environment by increasing its capacity for concurrent job processing.

Around 7:06 AM PDT, substantial progress had been made in streaming data for most organizations. However, intermittent pod terminations were observed, causing interruptions to ongoing jobs. Larger organizations with extended data processing tasks were notably affected by these interruptions.

In an effort to enhance system performance, the team implemented additional measures, but despite these improvements, the system still struggled to handle the increased data processing load. On September 24th, at 10:40 AM PDT, the technical team's ongoing investigation highlighted that the major factor contributing to the persistent performance issues during the incident was the resource contention problem, significantly hindering backlog processing.

On September 24th at 11:11 AM PDT, multiple measures were implemented by the technical team to address the performance issues. These actions included adjustments to resource allocations and scaling configurations, with the goal of improving overall system performance.

These enhancements led to expedited data processing, with remaining jobs successfully completing by 6:23 PM PDT, marking the successful resolution of the backlog and resumption of real-time data processing.

Root Cause

Primary Root Cause:

The root cause of the incident was an unintended modification that occurred during a change implemented on September 14, 2023. This modification negatively impacted the error handling capabilities of the system, resulting in data processing delays and a significant backlog, which ultimately affected the accuracy of customer data.

Contributing Cause:
Resource contention issues, highlighted on September 24, 2023, significantly hindered backlog processing, further contributing to the performance issues during the incident.

Remediation Actions

  1. Urgent Hotfix: On September 20, 2023, at 6:30 AM PDT, an emergency hotfix was applied to the production environment to address the root cause.
  2. Data Reloading: Subsequently, data reloading was initiated for the affected organizations, starting with those that had already received updates.
  3. Enhanced Capacity: To expedite processing, particularly for organizations with extensive datasets, the technical teams decided to enhance the environment by increasing its capacity for concurrent job processing.
  4. Identifying Resource Contention: On September 24, 2023, at 10:40 AM PDT, the technical team identified resource contention as a significant contributor to the ongoing performance issues.
  5. Performance Remediation: Subsequently, at 11:11 AM PDT on September 24th, measures were taken to adjust resource allocations and scaling configurations to improve overall system performance.

Future Preventative Measures

  1. Enhanced Monitoring: We have implemented a robust and continuous monitoring process that runs checks for multiple days following a production deployment. This process will focus on validating organization counts and critically assessing the performance of essential services and components to quickly identify and address issues.
  2. Streamlined Collaboration and Testing: We have established a unified platform to facilitate improved collaboration, testing, and validation during scheduled releases. This platform will enable teams to work seamlessly and ensure thorough testing and validation of all changes before they are deployed into production, reducing the risk of similar incidents.
Posted Oct 24, 2023 - 11:40 PDT

Resolved

The remaining data has been successfully processed, and this incident is now resolved. We apologize for any inconvenience this issue may have caused. We remain committed to ensuring the continued stability and performance of our services. To achieve this, we will conduct a thorough analysis with our teams and implement additional solutions to prevent future occurrences. We appreciate your understanding and patience throughout this incident.
Posted Sep 24, 2023 - 18:24 PDT

Update

The recent changes have had a significant impact. Additional EU orgs have successfully completed, and in NA, only one org remains pending. We sincerely appreciate your patience as we work diligently towards resolution.
Posted Sep 24, 2023 - 13:21 PDT

Update

Our technical team has been putting in significant effort over the weekend, and following the recent enhancements, we've achieved a substantial increase in throughput, making the streaming process significantly faster. While there's still some backlog in processing orgs, these changes are expected to have effectively addressed the bottleneck that was delaying the streaming process.
Posted Sep 24, 2023 - 12:08 PDT

Update

Currently, we have a limited number of customers remaining for the restreaming process in both the EU and NA regions. We are also diligently investigating and addressing any potential performance-related issues. Your patience during this process is greatly appreciated, and we will continue to provide updates accordingly.
Posted Sep 24, 2023 - 08:31 PDT

Update

Our technical team has made adjustments to enhance system stability, which is expected to improve processing for larger orgs. Additionally, they have conducted further fine-tuning of the system. We are monitoring the results and outcomes closely for any additional improvements.
Posted Sep 23, 2023 - 13:06 PDT

Update

Our technical team has been actively addressing streaming issues for several organizations in the EU and NAM regions. While hardware streaming has completed successfully, there have been challenges with software streaming, including occasional failures and slow performance. Our technical team has been conducting an investigation into this problem and has identified a possible root cause. Currently, they are gathering evidence and formulating a plan to address it.
Posted Sep 23, 2023 - 04:55 PDT

Update

We are making steady progress, but there is still a substantial amount of data to process. Our technical teams will remain vigilant, monitoring the situation throughout today and into tomorrow. We will keep you updated on any significant progress or developments as they occur. We sincerely apologize for any inconvenience caused.
Posted Sep 22, 2023 - 11:35 PDT

Update

We have successfully streamed a significant amount of data, and there are only a limited number of organizations left to process in North America and Europe. We're steadily making progress toward completion.
Posted Sep 22, 2023 - 09:25 PDT

Update

Our technical team is currently investigating the underlying cause of the streaming slowdown that occurred yesterday. While the exact cause is still under investigation, our team has implemented an alternative approach to expedite processing. This strategic move allows us to handle a larger volume of tasks concurrently, leading to significant progress in catching up.
Posted Sep 22, 2023 - 06:08 PDT

Update

Our technical teams are closely monitoring the environment to ensure smooth operation. While APAC organizations are now up to date, EU and NAM organizations are still in the process. We have experienced some slow streaming and have enlisted additional subject experts to address the problem and ensure seamless processing. We apologize for the inconvenience and appreciate your patience as we work to resolve this issue.
Posted Sep 21, 2023 - 13:45 PDT

Update

Our technical team is continuing to monitor the progress as data for affected organizations is currently undergoing reloading, and this process will continue in the upcoming hours.
Posted Sep 21, 2023 - 10:14 PDT

Update

Our technical team is diligently overseeing the ongoing progress. However, the data reconciliation process is experiencing delays, possibly attributed to the volume of orgs and data that require catching up. We will keep you updated with any significant developments and progress.
Posted Sep 20, 2023 - 11:17 PDT

Monitoring

We have implemented a fix across all environments, and our team is actively monitoring the backlog data processing.
Posted Sep 20, 2023 - 08:03 PDT

Update

Our investigation into the issue is ongoing, and our technical team is actively developing a solution to address it.
Posted Sep 20, 2023 - 07:17 PDT

Investigating

Incident Description: We are presently experiencing an issue with our IT Visibility platform, resulting in data inaccuracies. Specifically, there is a potential for inaccuracies in customer data across various areas, including Flexera One Dashboards, Data Exports, API queries, and ServiceNow. Customers across all regions, including APAC, EU, and NA, could be affected by this issue.

Priority: P2

Restoration Update: Our technical teams are actively addressing the problem and are diligently working towards a resolution. We apologize for any inconvenience this may have caused.
Posted Sep 20, 2023 - 05:25 PDT
This incident affected: Flexera One - IT Visibility - North America (IT Visibility US), Flexera One - IT Visibility - Europe (IT Visibility EU), and Flexera One - IT Visibility - APAC (IT Visibility - APAC).