Description: Cloud Management Platform - NAM - Self-Service Shard 3 & 4 - Platform Inaccessible
Timeframe: May 26, 2025, 6:14 AM PST to May 26, 2025, 8:58 AM PDT
Incident Summary
On Monday, May 26, 2025, at 6:14 AM PDT, our teams detected a service disruption impacting both the Self-Service and Cloud Management Platform in the North America (NAM) region. Customers hosted on Shard 3 and Shard 4 were unable to access the affected platforms.
Initial investigation revealed connectivity issues between internal servers and the backend service cluster. The initial attempts to restore the services were unsuccessful due to persistent load balancer issues. Further investigation revealed that newly launched worker nodes were not being automatically registered with the load balancer’s target groups, which prevented backend services from being reachable. This led to service outages for customers on Shard 3 and Shard 4. The engineering team implemented an immediate fix by manually adding all worker nodes to the load balancer, successfully restoring platform access by 8:58 AM PST. Some minor slowness was observed afterwards as the system stabilized.
Root Cause
The root cause of the incident was a target registration failure at the cloud service Load Balancer (LB) level, specifically due to a target limit quota being exceeded.
Remediation Actions
Immediate Action:
· Manual Registration of Worker Nodes: Added all active worker nodes to the load balancer’s target groups to restore connectivity.
· Restoration of Backend Service Connectivity: Ensured that backend services on Shard 3 and Shard 4 became reachable after re-establishing the target group associations.
· Quota Increase Request to Cloud service provider: Contacted cloud service provider to increase the load balancer target group quota, which was approved and applied without delay.
Permanent Solution:
· Traffic Optimization: Removed unnecessary load from the load balancer to better utilize available slots and prevent reaching quota limits.
· Proactive Quota Monitoring: Implemented monitoring and alerting to detect when the number of targets on a load balancer falls below expected levels.
Future Preventative Measures
· Proactive Monitoring: Implemented alerting to notify when the number of available targets goes below the threshold on the LB.
· Documentation Updates: Update internal documentation to reflect best practices and lessons learned.
· Team Retrospective: Conduct a cross-functional review to integrate findings into operational procedures.