Flexera One - UI - NAM- UI loading issues

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - NAM- UI loading issues

Timeframe: May 3, 2025, 2:56 AM PST to May 3, 2025, 04:57 AM PST

Incident Summary

On Saturday, May 3, 2025, at 2:56 AM PST, our engineering teams detected a service disruption impacting theUser Interface in the North America production environment. It was identified that the customers were unable to access the platform, with some encountering HTTP 500 errors. The issue was confirmed to be isolated to production, as the staging environment remained unaffected.

The UI failures were traced to several unresponsive core APIs that were timing out due to database performance degradation. Investigation revealed that a recent configuration change had altered message queue behavior, causing messages to resurface and be reprocessed repeatedly. This led to a surge in duplicate messages, a growing backlog, and excessive load on downstream systems such as the scheduling and indexing services, all contributing to the database strain and API failures.

By 3:51 AM PST,  the database service provider performed a successful failover to stabilize the system. To prevent further impact, the affected service was temporarily paused, the queue configuration was corrected to stop repeated message processing, and the database was cleaned by removing redundant entries generated during the incident. After completing system checks and validating the environment, services were restored and the ITAM UI returned to full functionality by 04:57 AM PST.

Root Cause

The root cause of the incident was a database performance degradation triggered by a misconfiguration in message queue behavior following a recent change :

  • Messages that were not acknowledged within the timeout period were reprocessed repeatedly.
  • This resulted in message duplication, processing overload, and a buildup of redundant data in the database.
  • The database became overwhelmed, causing core APIs to time out, which led to the UI being unavailable to users.

Remediation Actions

  • Database Failover: Coordinated and executed a failover to a stable database replica with assistance from the service provider.
  • Service Pause: Temporarily halted the affected service to prevent additional load and contain the impact.
  • Queue Reconfiguration: Updated the queue configuration to stop repeated message resurfacing and reprocessing.
  • Database Cleanup: Deleted redundant entries that had been continuously generated due to the queue misbehavior.
  • System Validation: Performed thorough system health checks to confirm the stability of all components.
  • Service Restoration: Resumed normal service operation once all validations passed.

Future Preventative Measures

  • Queue Configuration Review: Enforce stricter visibility timeout settings and introduce safeguards against automatic message reprocessing loops.
  • Enhanced Monitoring & Alerting: Implement alerts for abnormal queue behavior, duplicate message spikes, and API latency thresholds.
  • Downstream Protection: Assess the downstream effects of infrastructure changes, particularly those affecting processing behavior.
  • Change Management Enhancements: Apply more rigorous validation and testing for configuration changes impacting queue and database behavior.
  • Data Integrity Checks: Establish automated checks for identifying and removing redundant or duplicate records in critical systems.
Posted May 13, 2025 - 03:11 PDT

Resolved

The service provider has implemented corrective actions on their end, which have successfully restored the affected services. Our teams have completed health checks and confirmed that the issue has been resolved.
Posted May 03, 2025 - 05:08 PDT

Update

Upon further investigation, our teams have determined that the underlying database issue may be affecting access to the Flexera One UI across the North America (NAM) region. As a result, customers may experience difficulty logging into the platform, potentially impacting access to all Flexera One applications. A high-severity case has been opened with the service provider, and we are upgrading the priority of this issue to P1.
Posted May 03, 2025 - 03:56 PDT

Identified

Our teams have identified that the issue stems from a database problem originating on our service provider’s end. A support ticket has already been opened, and we are actively coordinating with the provider to work toward a resolution. Additionally, the priority of this issue has been downgraded due to its currently limited impact.
Posted May 03, 2025 - 03:26 PDT

Investigating

Incident Description: We are currently investigating an issue affecting the User Interface (UI) of IT Asset Management (ITAM) services in the US region. Impacted users may experience problems with the UI not loading as expected.

Priority: P1

Restoration Activity: Our technical team is actively working on identifying the root cause and restoring services. Further updates will be provided as we continue our efforts to resolve the incident.
Posted May 03, 2025 - 02:57 PDT
This incident affected: Flexera One - IT Asset Management - North America (IT Asset Management - US Login Page).