Cloud Management Platform - NAM - Self-Service Shard 3 & 4 - Platform Currently Inaccessible

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Cloud Management Platform - NAM - Self-Service Shard 3 & 4 - Platform Inaccessible

Timeframe: May 26, 2025, 6:14 AM PST to May 26, 2025, 8:58 AM PDT

Incident Summary

 

On Monday, May 26, 2025, at 6:14 AM PDT, our teams detected a service disruption impacting both the Self-Service and Cloud Management Platform in the North America (NAM) region. Customers hosted on Shard 3 and Shard 4 were unable to access the affected platforms.

Initial investigation revealed connectivity issues between internal servers and the backend service cluster. The initial attempts to restore the services were unsuccessful due to persistent load balancer issues. Further investigation revealed that newly launched worker nodes were not being automatically registered with the load balancer’s target groups, which prevented backend services from being reachable. This led to service outages for customers on Shard 3 and Shard 4. The engineering team implemented an immediate fix by manually adding all worker nodes to the load balancer, successfully restoring platform access by 8:58 AM PST. Some minor slowness was observed afterwards as the system stabilized.

Root Cause

 

The root cause of the incident was a target registration failure at the cloud service Load Balancer (LB) level, specifically due to a target limit quota being exceeded.

  • New worker nodes launched as part of autoscaling were not automatically registered with the load balancer’s target groups.
  • This prevented backend services from receiving traffic, leading to an outage on both Shard 3 and Shard 4.
  • The issue was caused by a quota limitation at the cloud service infrastructure level.

Remediation Actions

 

Immediate Action:

 

·        Manual Registration of Worker Nodes: Added all active worker nodes to the load balancer’s target groups to restore connectivity.

·        Restoration of Backend Service Connectivity: Ensured that backend services on Shard 3 and Shard 4 became reachable after re-establishing the target group associations.

·        Quota Increase Request to Cloud service provider: Contacted cloud service provider to increase the load balancer target group quota, which was approved and applied without delay.

 

Permanent Solution:

 

·        Traffic Optimization: Removed unnecessary load from the load balancer to better utilize available slots and prevent reaching quota limits.

·        Proactive Quota Monitoring: Implemented monitoring and alerting to detect when the number of targets on a load balancer falls below expected levels.

Future Preventative Measures 

 

·        Proactive Monitoring: Implemented alerting to notify when the number of available targets goes below the threshold on the LB.

·        Documentation Updates: Update internal documentation to reflect best practices and lessons learned.

·        Team Retrospective: Conduct a cross-functional review to integrate findings into operational procedures.

Posted Jun 05, 2025 - 05:02 PDT

Resolved

The platform is now stable and fully accessible, with no further issues observed. Our teams implemented an interim solution that successfully restored access. A permanent fix will be implemented during a future maintenance window, which will be communicated in advance. This incident has been resolved.
Posted May 26, 2025 - 09:47 PDT

Monitoring

Access to the platform remains available through the temporary solution applied earlier. We are actively monitoring performance, as some slowness may still be observed. Work on the long-term resolution continues in parallel.
Posted May 26, 2025 - 09:15 PDT

Update

Access to the platform has been restored through a temporary solution. Customers may continue to experience some slowness during this period. Work on the long-term resolution remains in progress.
Posted May 26, 2025 - 09:04 PDT

Update

Our teams have identified issues with traffic routing and are progressing with mitigation efforts, including the setup of an alternate environment. The platform remains inaccessible at this time, and we will provide further updates as work continues.
Posted May 26, 2025 - 08:41 PDT

Identified

Our teams have identified an underlying infrastructure issue and are actively working on a mitigation plan, which includes routing traffic through an alternate path. Restoration efforts are ongoing, and we are closely monitoring the situation. Further updates will be shared as progress continues.
Posted May 26, 2025 - 07:38 PDT

Investigating

Issue Description: We are currently investigating an issue impacting both the Self-Service and Cloud Management Platform in the North America (NAM) region. Customers on Shard 3 and Shard 4 may be unable to access these services at this time.

Priority: P1

Restoration Activity: Our technical teams are actively engaged and assessing the situation. We are exploring potential solutions to restore functionality as quickly as possible.
Posted May 26, 2025 - 06:14 PDT
This incident affected: Legacy Cloud Management (Cloud Management Dashboard - Shard 3, Cloud Management Dashboard - Shard 4, Self-Service - Shard 3, Self-Service - Shard 4).