Description: Flexera One - NAM- UI loading issues
Timeframe: May 3, 2025, 2:56 AM PST to May 3, 2025, 04:57 AM PST
Incident Summary
On Saturday, May 3, 2025, at 2:56 AM PST, our engineering teams detected a service disruption impacting theUser Interface in the North America production environment. It was identified that the customers were unable to access the platform, with some encountering HTTP 500 errors. The issue was confirmed to be isolated to production, as the staging environment remained unaffected.
The UI failures were traced to several unresponsive core APIs that were timing out due to database performance degradation. Investigation revealed that a recent configuration change had altered message queue behavior, causing messages to resurface and be reprocessed repeatedly. This led to a surge in duplicate messages, a growing backlog, and excessive load on downstream systems such as the scheduling and indexing services, all contributing to the database strain and API failures.
By 3:51 AM PST, the database service provider performed a successful failover to stabilize the system. To prevent further impact, the affected service was temporarily paused, the queue configuration was corrected to stop repeated message processing, and the database was cleaned by removing redundant entries generated during the incident. After completing system checks and validating the environment, services were restored and the ITAM UI returned to full functionality by 04:57 AM PST.
Root Cause
The root cause of the incident was a database performance degradation triggered by a misconfiguration in message queue behavior following a recent change :
Remediation Actions
Future Preventative Measures