Description: Flexera One - IT Visibility & Common Spend Analytics Dashboards - EU - Data Processing Delayed
Timeframe: October 31,2024, 1:31 AM PDT to November 1, 2024, 2:13 AM PDT
Incident Summary
On Thursday, October 31, 2024, at 1:31 AM PDT, a data processing delay was detected, impacting IT Visibility (ITV) and Common Spend Analytics (CSA) dashboards in the EU region. Although the dashboards remained accessible, they did not display the most recent data. Other ITV services were unaffected and operated normally.
We raised a support ticket with the service provider as soon as we detected the issue. The issue was traced to an infrastructure problem within our service provider's data center in the EU, caused by an unexpected hardware failure. Although the initial problem was mitigated, a secondary issue emerged, disrupting processes and report computations for some workspaces in the EU region. A support ticket was opened with the service provider, who began restoring impacted workspaces.
By 07:11 AM PDT, the service provider confirmed they were making good progress on restoring several affected workspaces. To maintain stability during this process, we temporarily paused specific data loading functions and closely monitored the situation until full restoration was achieved. By the end of the incident, all services were fully operational. A complete data load and end-to-end checks were conducted to verify that processes were functioning correctly, and the incident was marked as resolved on November 1, 2024, at 2:13 AM PDT.
Root Cause
The root cause of this incident was an unexpected hardware failure within our service provider's data center in Europe. This failure affected critical infrastructure supporting ETL processes and report computations in the EU region. Although the initial issue was mitigated, residual effects from the hardware failure led to delays in data processing and updates for some workspaces.
Remediation Actions
· Service Provider Coordination: Collaborated with the service provider to address the infrastructure issue and ensure the complete restoration of affected data processing functions.
· Stability Measures: Temporarily paused specific data loading functions to stabilize the environment during the restoration process.
· Verification of Full Restoration: Conducted a full data reload and end-to-end checks to confirm all services were fully restored and functioning as expected.
Future Preventative Measures
· Enhanced Infrastructure Monitoring: Work with the service provider to implement additional monitoring for critical infrastructure components to detect and address potential hardware failures proactively.
· Regular Risk Assessment: Conduct regular risk assessments of the service provider’s data centers and recovery protocols, focusing on reducing dependencies that could lead to service delays.
· Request for Service Provider RCA: Requested a detailed RCA from the service provider, along with their planned future mitigation actions, to prevent recurrence of similar infrastructure issues.