Description: Flexera One - IT Visibility - NAM - ITV Dashboards Data Processing Delayed
Timeframe: October 14 ,2024, 1:00 AM PDT to October 15, 2024, 12:31 AM PDT
Incident Summary
On Monday, October 14th, 2024, at 1:00 AM PDT, the IT Visibility (ITV) dashboard in the North America region encountered a disruption in data processing. The issue stemmed from our service provider’s side, resulting in significant delays in updating dashboard data. While customers retained access to the dashboards and all other ITV services remained fully operational, the latest data was unavailable.
The disruption was caused by a backlog of data following scheduled maintenance by the service provider, which paused their ETL pipeline, allowing input files to accumulate. After maintenance concluded, the backlog took longer than anticipated to process. Additionally, a bug in the service provider’s infrastructure caused the ETL process to stall and time out. A custom timeout setting prolonged the issue further, delaying recovery efforts.
The service provider deployed additional resources and adjusted log file management to address the issue. By 10:26 AM PDT, progress was seen. Our technical teams engaged the Service provider on a technical bridge which was completed at 12:41 PM. By 12:31 AM PDT on October 15th, the backlog had been fully processed, and normal data processing operations resumed. Our teams verified the resolution, and the incident was declared as resolved.
Root Cause
The disruption in data processing for the IT Visibility (ITV) dashboard, was caused by a backlog of data resulting from scheduled maintenance performed by our service provider. The maintenance paused the pipeline, leading to a buildup of input files. After maintenance was completed, the backlog took longer to process due to a combination of increased data volume and a bug in the service provider's infrastructure, which caused the process to stall and time out.
Remediation Actions
· Increased Resource Allocation: Additional resources, including memory and parallelism, were allocated by the service provider to handle the increased workload.
· Log File Management Adjustments: Log file management was adjusted to prevent processes from stalling.
· Enhanced Process Monitoring: The process was closely monitored to ensure the backlog was cleared and regular data processing resumed.
Future Preventative Measures
The following actions have been taken by the Service provider to ensure we avoid similar instances in the future:
· Enhanced Monitoring: Implement enhanced monitoring to detect potential delays and backlog accumulation early, allowing for quicker response times.
· Bug Fix Implementation: The service provider will investigate and address the bug in their infrastructure that caused the process to stall, preventing the recurrence of similar issues.