Description: Cloud Management Platform - Self-Service Shard 3 & Shard 4 - Service Degradation
Timeframe: March 14, 2025, 6:17 AM PST toMarch 14, 2025, 7:51 AM PST
Incident Summary
On March 14, 2025, at 6:17 AM PST, our teams were notified of an issue affecting customers using the Cloud Management Platform (CMP) on Self-Service Shard 3 and Shard 4 via monitoring alerts. Due to this issue, the affected might have been unable to use self-service capabilities, such as creating or accessing Cloud Apps. Additionally, customers on Shard 4 may have experienced intermittent login issues with the CM & RightScale platform. Our teams promptly started their investigation and detected a front-end server in an unresponsive state. While replacing the server, our teams identified another server in a similar state. Our team successfully replaced the second server at 7:51 AM PST and continued to monitor the platform for stability.
Root Cause
Initial investigations revealed that a Front-End server had become unresponsive, leading to disruptions in self-service functionalities. Upon replacing the affected server at 7:27 AM PST, our team identified another unresponsive server. This second impacted server was replaced at 7:51 AM PST. Logs confirmed system stability after this replacement.
Remediation Actions
· Identified and replaced the unresponsive Front-End server at 7:27 AM PST.
· Detected and replaced an additional unresponsive server at 7:51 AM PST.
· Verified logs and system performance to confirm service restoration.
· Monitored the platform to ensure continued stability.
Future Preventative Measures
· Review existing automated health checks for Front-End servers to detect and mitigate failures proactively.
· Enhance logging and alerting mechanisms for quicker identification of unresponsive components.
· Conduct a post-mortem review to assess improvements in redundancy and failover strategies.