Description: Cloud Management Platform - Self-Service Shard 3 & Shard 4 - Service Degradation
Timeframe: March 16, 2025, 9:21 AM PST to March 16, 2025, 10:38 AM PST
Incident Summary
On March 16, 2025, from 9:21 AM PST, we experienced a recurrence of the issue affecting customers using the Cloud Management Platform (CMP) on Self-Service Shard 3 and Shard 4 via monitoring alerts. Due to this issue, the affected might have been unable to use self-service capabilities, such as creating or accessing Cloud Apps. Additionally, customers on Shard 4 may have experienced intermittent login issues with the CM & RightScale platforms. Our teams promptly started their investigation and detected a front-end server in an unresponsive state. Our team successfully replaced the server at 10:24 AM PST and continued to monitor the platform for stability. Post all validations, the service was declared as restored at 10:38 AM PST.
Root Cause
Investigations confirmed that similar to the occurrence on 14th March 2025, a Front-End server had become unresponsive, impacting self-service functionalities. The server was replaced, and additional validation checks were performed to ensure system stability before full service restoration.
Remediation Actions
· Identified and replaced the unresponsive Front-End server.
· Conducted thorough validation checks to confirm stability before declaring resolution.
· Monitored system logs and platform performance post-recovery.
Future Preventative Measures
· Review existing automated health checks for Front-End servers to detect and mitigate failures proactively.
· Strengthen redundancy measures to minimize the impact of similar failures in the future.
· Conduct a post-mortem review to assess improvements in redundancy and failover strategies.