Description: Flexera One - IT Visibility - NAM – Data Processing was Delayed
Timeframe: November 29th, 4:30 PM to December 3rd, 2:03 PM PST
Incident Summary
On Wednesday, November 29th, at 4:30 PM PST, we observed an unexpected surge in activity within one of our IT Visibility services, leading to delays in writing to downstream services. Consequently, data processing was hindered, causing customers in the NA region to receive outdated information. It's important to note that despite this issue, the user interface remained fully functional throughout the entire incident.
Our investigation revealed an unexpected disruption during deployment, resulting in the loss of a critical caching component. At 4:34 PM PST, our team promptly addressed the issue by migrating to an alternative caching solution, ensuring continued stability and performance.
After the deployment, we conducted health checks to verify the operational status of all services. Unfortunately, the newly introduced service didn't initially offer visibility into statistics for measuring latency. To address this, the team initiated a plan to work on a solution for obtaining these readings. Simultaneously, continuous monitoring of the environment persisted into the following day.
On December 1st, at 1:45 PM PST, the team successfully formulated a solution to provide additional data points. However, we refrained from immediate deployment to production due to the challenge posed by the necessity of a service restart, which carried the risk of potentially replaying the organizations' data.
The team opted not to implement any further changes to ensure sustained stability and continued to conduct manual spot checks. Due to the substantial volume of data, it took an extended period to complete processing. On December 3rd at 2:03 PM PST, the team verified that the backlog had been successfully processed, and real-time data processing resumed. Subsequently, the incident was considered resolved.
Root Cause
The incident originated from an unforeseen disruption during deployment, exacerbated by the subsequent surge in activity.
Remediation Actions
Future Preventative Measure
Enhanced Stability and Monitoring Post-Service Transition: Since transitioning to the new service, we have observed continued stability. Furthermore, enhancements implemented post-incident in the production environment provide additional data points for more robust monitoring and improved visibility into data currency.