Description: Snow Atlas - APAC - Service Disruption Impacting Platform Access
Timeframe: October 27, 2024, 4:07 PM PDT to October 27, 2024, 8:48 PM PDT
Incident Summary
On Sunday, October 27th at 4:07 PM PDT, the Snow Atlas platform experienced a significant service disruption affecting users in the APAC region. Users encountered "503 Service Temporary Unavailable" errors while attempting to access various features, including Computers and Applications overviews. This issue impacted general access across multiple areas of the platform in the APAC region, while other regions remained unaffected.
During the incident investigation, it was observed that a few Kubernetes containers providing customers with UI and data services were in a restarting state due to connectivity issues with the messaging server. The primary cause of the incident was traced to a problem with a Kubernetes node, which experienced errors with its Kubelet service. This node failure resulted in the messaging server container on that node shutting down, preventing it from restarting on another node.
A manual rolling restart was performed to restore functionality across most pods, allowing services to resume for the majority of customers by 8:48 PM PDT. Post-rolling restart, service was restored for all customers except one, which was attributed to a configuration issue. Our teams restored the database for the remaining customer from the full backup, which fixed the issue
Root Cause
The incident was caused by a failure in the Kubernetes node’s Kubelet service , which caused the messaging server container to shut down and remain inactive. This failure prevented the messaging server container from initiating a self-recovery and automatically restarting on an available node. Manual intervention was required, and a rolling restart was performed, restoring functionality across most pods. We are currently awaiting an explanation from the managed Kubernetes service providers on why the kubelet service was not acting as intended.
Remediation Actions
· Manual Rolling Restart of Kubernetes Cluster: A manual rolling restart was performed to restore functionality across most pods, allowing services to resume for the majority of customers.
· Forced Removal of Broken Messaging Server Instance: The affected messaging server instance, which did not restart as expected, was forcibly removed, enabling other replicas to continue operating without disruption.
· Database Restoration for Impacted Customer: A database backup was restored for the remaining affected customer, resolving the access issue caused by a configuration discrepancy.
Future Preventative Measures
· Messaging Server Service Monitoring Enhancements: We are working to improve monitoring for the messaging server services, implementing automated alerts to notify support teams if one or more containers fail to operate as expected. This will enable quicker detection and intervention.
· Messaging Server Service Deprecation: The messaging server service that failed to recover autonomously is already scheduled for deprecation, with most systems dependent on it already migrated to alternative services.
· Support Tickets with Microsoft Azure: Support tickets have been raised with Microsoft Azure to investigate why the affected node was not automatically removed despite issues being flagged and other key aspects of the incident.