Description: Snow Atlas - APAC - Service Disruption Impacting Platform Access
Timeframe: July 27, 2025, 5:34 PM PDT to July 27, 2025, 7:11 PM PDT
Incident Summary
On Sunday, July 27, 2025, at 5:34 PM PDT, our monitoring systems detected a service disruption impacting the Snow Atlas platform in the APAC region. Users encountered “service unavailable” errors across multiple products.
Investigation revealed that the incident was caused by the unavailability of the messaging broker service, which prevented inter-service communication.
The root cause was traced to the messaging service running out of memory, leading to the shutdown of messaging servers, which failed to self-recover. Our teams manually restarted the servers, restoring normal service by 7:11 PM PDT.
Root Cause
Investigation by our technical teams determined that the central messaging broker service exhausted its available memory, triggering the shutdown of all messaging servers in the region. This resulted in a loss of inter-service communication, causing all services in APAC to fail. As the servers did not auto-recover, a manual restart was required to restore functionality.
Remediation Actions
Future Preventative Measures
· Post-Mortem Review – Conduct a detailed post-mortem to determine the root cause of the messaging broker’s memory exhaustion and the failure of servers to auto-recover. Define and implement improvement actions based on the findings.
· Message Handling Optimization – Review and optimize message handling processes to prevent backlog-related delays during recovery.
· Enhanced Monitoring – Strengthen monitoring of the messaging service to enable earlier detection of similar failures.