Flexera One - IT Asset Management - NA - Inventory Processing Disruption

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - IT Asset Management - NA - Inventory Processing Disruption

Timeframe: July 19th, 5:30 PM to July 21st, 2:00 AM PDT

‌

Incident Summary

On July 19th, 2024, at 5:30 PM PDT, we experienced a disruption in our IT Asset Management Platform, specifically affecting inventory processing in the NA region. Customers may have experienced delays in inventory processing and potential backlogs during this disruption. Additionally, affected client beacons may have encountered delays in data updates. Inventory processing resumed around 2:00 AM PDT on July 21st, following a failover event to one of the clusters.

Preliminary investigations indicated an issue with one of our data processing interfaces. Monitoring revealed a substantial increase in error responses and target response times coinciding with the disruption. Usage data, which bypasses this interface, continued to flow normally.

Our current monitors did not alert us to the issue as they were just above the threshold due to the normal flow of usage data. This oversight led to a backlog and delay in processing inventory files.

We continued to monitor the environment closely. On July 21st at 6:00 PM PDT, additional health checks confirmed that inventory processing and data handling were consistently at normal levels. After validation from all involved teams, the incident was resolved and transitioned to a problem management investigation.

Subsequent investigation revealed that the primary cause of the disruption was a critical server responsible for performing
authentication requests experiencing a failure due to a storage issue. Additionally, the load balancer handling data redirection
encountered significant errors, including numerous 502 errors and increased response times, further contributing to the disruption.

Root Cause

Primary Root Cause

The primary cause of the disruption was a critical server responsible for performing authentication requests experiencing a failure due to a storage issue.

Contributing Factors

• Load Balancer Errors: The load balancer handling data redirection encountered significant errors, including numerous 502 errors and increased response times, which further contributed to the disruption.
• Monitoring Thresholds: Current monitoring thresholds did not trigger alerts as they were just above the threshold due to the normal flow of usage data, leading to a delay in identifying the issue.

Remediation Actions

Cluster Failover: The system initiated a failover event to an alternate cluster, which restored inventory processing functionality.
Continuous Monitoring: Monitored the environment closely to ensure that inventory processing and data handling were stable and consistently at normal levels.
Health Checks: Conducted additional health checks to validate the stability and performance of the system after the failover.
Disk Space Management: Identified the storage issue on the critical server and created a task to increase the disk space to prevent similar failures in the future.

Future Preventative Measures

Disk Space Management Enhancements: We have increased the disk size to address storage issues. This proactive measure ensures that sufficient storage capacity is available to support ongoing operations and reduces the likelihood of similar disruptions in the future.
Enhanced Monitoring and Alerting: Our technical teams have been tasked with implementing a comprehensive monitoring and alerting system to detect high disk usage and other critical metrics across services. This system will ensure timely alerts are sent to the relevant teams, enabling them to address potential issues before they impact operations.

Posted Aug 08, 2024 - 12:49 PDT

Resolved

The backlog has reduced significantly to acceptable levels, and the upload rate is now back to normal. As a result, the incident has been resolved and will be transitioned to problem management for further investigation to determine the underlying cause and improve monitoring capabilities for faster detection and resolution in the future.

Posted Jul 21, 2024 - 18:04 PDT

Investigating

Incident Description: We experienced a disruption with inventory processing in the NA environment on the IT Asset Management platform over the weekend. Although this issue has been resolved, there may be ongoing impacts, including delays in inventory processing and a backlog in client beacons.

Priority: P2

Restoration Activity: We have observed some failures that likely contributed to the issue and have engaged additional SMEs to thoroughly investigate the underlying causes and full scope of the impact. Our team is actively monitoring the situation to ensure that all processing tasks are completed successfully and to address any residual effects. We will provide updates as we continue to work through this.

Posted Jul 21, 2024 - 09:47 PDT

This incident affected: Flexera One - IT Asset Management - North America (IT Asset Management - US Beacon Communication, IT Asset Management - US Inventory Upload).