Flexera One - IT Visibility - NAM - ITV Dashboards Data Processing Delayed
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Flexera One - IT Visibility - NAM - ITV Dashboards Data Processing Delayed

Timeframe:  October 14 ,2024, 1:00 AM PDT to October 15, 2024, 12:31 AM PDT

Incident Summary

On Monday, October 14th, 2024, at 1:00 AM PDT, the IT Visibility (ITV) dashboard in the North America region encountered a disruption in data processing. The issue stemmed from our service provider’s side, resulting in significant delays in updating dashboard data. While customers retained access to the dashboards and all other ITV services remained fully operational, the latest data was unavailable.

The disruption was caused by a backlog of data following scheduled maintenance by the service provider, which paused their ETL pipeline, allowing input files to accumulate. After maintenance concluded, the backlog took longer than anticipated to process. Additionally, a bug in the service provider’s infrastructure caused the ETL process to stall and time out. A custom timeout setting prolonged the issue further, delaying recovery efforts.

The service provider deployed additional resources and adjusted log file management to address the issue. By 10:26 AM PDT, progress was seen. Our technical teams engaged the Service provider on a technical bridge which was completed at 12:41 PM. By 12:31 AM PDT on October 15th, the backlog had been fully processed, and normal data processing operations resumed. Our teams verified the resolution, and the incident was declared as resolved.

Root Cause

The disruption in data processing for the IT Visibility (ITV) dashboard, was caused by a backlog of data resulting from scheduled maintenance performed by our service provider. The maintenance paused the pipeline, leading to a buildup of input files. After maintenance was completed, the backlog took longer to process due to a combination of increased data volume and a bug in the service provider's infrastructure, which caused the process to stall and time out.

Remediation Actions

·        Increased Resource Allocation: Additional resources, including memory and parallelism, were allocated by the service provider to handle the increased workload.

·        Log File Management Adjustments: Log file management was adjusted to prevent processes from stalling.

·        Enhanced Process Monitoring: The process was closely monitored to ensure the backlog was cleared and regular data processing resumed.

Future Preventative Measures

The following actions have been taken by the Service provider to ensure we avoid similar instances in the future:

·        Enhanced Monitoring: Implement enhanced monitoring to detect potential delays and backlog accumulation early, allowing for quicker response times.

·        Bug Fix Implementation: The service provider will investigate and address the bug in their infrastructure that caused the process to stall, preventing the recurrence of similar issues.

Posted Oct 23, 2024 - 02:03 PDT

Resolved
The issue is resolved and all data processing has completed.
Posted Oct 15, 2024 - 00:38 PDT
Update
Data processing is progressing as expected, with no issues detected. Our team is continuously monitoring the situation to ensure everything remains on track. We’ll keep you informed with further updates as necessary.
Posted Oct 14, 2024 - 15:58 PDT
Update
We have been in close collaboration with our service provider to address the current data processing issue. Measures such as increasing system resources and optimizing processes have been implemented to accelerate data handling. Additionally, we are working together to ensure that every step in the data processing flow is thoroughly monitored and adjusted as needed to maintain accuracy and efficiency.

We are closely tracking the situation and will continue working with the provider until the issue is fully resolved. Further updates will be shared as progress is made.
Posted Oct 14, 2024 - 14:06 PDT
Update
Our team has conducted a thorough internal review to isolate the issue and has not found any recent changes or anomalies that could have contributed to the problem.

We are in close collaboration with our service provider, and further actions have been taken to improve data processing. At this point, data processing is running successfully, and we are closely monitoring the situation. We will provide further updates as more information becomes available.
Posted Oct 14, 2024 - 10:43 PDT
Update
We are actively working with our service provider to address the issue impacting the timely updates of IT Visibility dashboards in the NAM region. Dashboards remain accessible, but the data may not be current.

Our teams are closely monitoring the situation, and we are making progress toward a resolution.

Further updates will be provided as soon as new information becomes available.
Posted Oct 14, 2024 - 09:24 PDT
Investigating
Incident Description: We have been alerted to a problem affecting the processing of IT visibility dashboard powered by Gooddata in the NAM region. The issue has been identified on the service provider's end and is currently being addressed. As a result, customers may experience delays in the availability of updated dashboard data.

Priority: P2

Restoration Activity: A support request has been submitted to the service provider. The service provider has confirmed the issue on their side and further discussions are ongoing to ensure a timely resolution.

We will keep an eye on the situation closely and provide updates as we continue to look into it.
Posted Oct 14, 2024 - 01:11 PDT
This incident affected: Flexera One - IT Visibility - North America (IT Visibility US).