Description: Cloud Management Platform – 500 Internal Server Error during scheduled release
Timeframe: January 24, 2025, at 11:50 PM PST to January 25, 2025, at 12:33 AM PST
Incident Summary
On Friday, January 25, 2025, at 11:50 PM PST, our monitoring alerts detected an issue affecting our Cloud Management Platform, where some customers may have experienced intermittent API call failures with "500 - Internal Server Errors."
During the investigation, the technical team identified an unresponsive server as the root cause. To fully restore connectivity and ensure stability, the team replaced the server as a permanent fix. The issue was fully resolved by January 25, 2025, at 12:33 AM PST
Root Cause
The root cause was the loss of connectivity between the server and the shard instance. This connectivity issue impacted a specific API call but did not cause a system-wide outage. The alerting systems were able to detect the issue in a timely manner and alert our teams.
Remediation Actions
· Connectivity Restoration: Restored connectivity between the replaced server and the shard instance.
· Server Replacement: As an additional precaution, replaced the unresponsive server again to ensure stability.
· Documentation Update: Updated internal documentation to include steps for verifying connectivity after server replacements and during server unresponsiveness. This ensures future issues of this nature are proactively addressed.
Future Preventative Measures
Connectivity Validation:
· Implement checks to verify connectivity between servers and shard instances after server replacements.
· Integrate these checks into the standard server replacement process.
Load Balancer Monitoring: