Flexera One - IT Visibility - NA - Data Processing Delayed

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - IT Visibility - NAM – Data Processing was Delayed

Timeframe: November 29th, 4:30 PM to December 3rd, 2:03 PM PST

Incident Summary

On Wednesday, November 29th, at 4:30 PM PST, we observed an unexpected surge in activity within one of our IT Visibility services, leading to delays in writing to downstream services. Consequently, data processing was hindered, causing customers in the NA region to receive outdated information. It's important to note that despite this issue, the user interface remained fully functional throughout the entire incident.

Our investigation revealed an unexpected disruption during deployment, resulting in the loss of a critical caching component. At 4:34 PM PST, our team promptly addressed the issue by migrating to an alternative caching solution, ensuring continued stability and performance.

After the deployment, we conducted health checks to verify the operational status of all services. Unfortunately, the newly introduced service didn't initially offer visibility into statistics for measuring latency. To address this, the team initiated a plan to work on a solution for obtaining these readings. Simultaneously, continuous monitoring of the environment persisted into the following day.

On December 1st, at 1:45 PM PST, the team successfully formulated a solution to provide additional data points. However, we refrained from immediate deployment to production due to the challenge posed by the necessity of a service restart, which carried the risk of potentially replaying the organizations' data.

The team opted not to implement any further changes to ensure sustained stability and continued to conduct manual spot checks. Due to the substantial volume of data, it took an extended period to complete processing. On December 3rd at 2:03 PM PST, the team verified that the backlog had been successfully processed, and real-time data processing resumed. Subsequently, the incident was considered resolved.

Root Cause

The incident originated from an unforeseen disruption during deployment, exacerbated by the subsequent surge in activity.

Remediation Actions

  1. Caching Solution Migration: Implemented a prompt migration to an alternative caching solution to restore stability and performance after the loss of a critical caching component during deployment.
  2. Verification of Backlog Processing: Verified the successful processing of backlog on December 3rd at 2:03 PM PST, facilitating the resumption of real-time data processing and marking the resolution of the incident.

Future Preventative Measure

Enhanced Stability and Monitoring Post-Service Transition: Since transitioning to the new service, we have observed continued stability. Furthermore, enhancements implemented post-incident in the production environment provide additional data points for more robust monitoring and improved visibility into data currency.

Posted Dec 28, 2023 - 18:37 PST

Resolved

This incident has been resolved.
Posted Dec 03, 2023 - 14:25 PST

Update

To mitigate potential risks, our technical team has decided to refrain from making additional changes for now. We have observed significant progress in backlog processing, and the team will continue to closely monitor the situation.
Posted Dec 02, 2023 - 07:44 PST

Update

Our technical team has devised an enhancement, and we are currently conducting additional testing to ensure the effectiveness of the change. This cautious approach is driven by the need for thorough validation of the change.

The team is diligently monitoring and conducting further testing of the proposed change. Given the complexity of the issue, we are taking careful steps to restore stability and ensure ongoing system reliability. We will provide updates as the situation develops.
Posted Dec 01, 2023 - 14:17 PST

Update

We encountered deployment challenges, prompting our team to swiftly transition to a more effective solution. Despite overall service health, we have identified data inconsistencies. Our ongoing efforts are dedicated to gaining additional insights for improved diagnosis and mitigation.
Posted Dec 01, 2023 - 06:58 PST

Update

Data processing is still in progress. All services are in good health, and the existing metrics align with the expected trends. Further updates will be provided as more developments unfold.
Posted Nov 30, 2023 - 09:17 PST

Identified

The issue has been identified and a fix is being implemented.
Posted Nov 29, 2023 - 18:56 PST

Investigating

Incident Description:
We are currently experiencing data processing delays in our IT Visibility Platform in the NA region. This issue does not impact access to ITV, however some data may not be up-to-date.

Priority: 2

Restoration activity:
Technical teams have been engaged and are currently investigating.
Posted Nov 29, 2023 - 17:17 PST
This incident affected: Flexera One - IT Visibility - North America (IT Visibility US).