Description: Snow Atlas - APAC, UK South & West Europe - SAM Core Errors
Timeframe: November 17, 2024, at 11:04 PM PDT to November 18, 2024, at 9:45 AM PDT
Incident Summary
On Sunday, November 17th , 2024, at 11:04 PM PDT , we received customer reports indicating issues with the SAM Core functionality. Initially, the impact seemed confined to tenants in the APAC region, but subsequent reports revealed that customers in the West Europe and UK South regions were also affected. Our initial investigation determined that the errors were confined to specific pages.
Our engineering team discovered during the investigation that errors for one affected customer began after an application restart, typically triggered by health checks, when the Database Server appeared unavailable. Similar patterns were observed for other tenants. The web service in all production clusters exhibited a high rate of connection timeout errors to the database servers. Although the database servers remained operational with no logged errors, connections were intermittently slow or unresponsive. This slowness resolved itself over time, but persistent timeouts during the outages led to application restarts.
To restore service, SRE restarted the affected services on November 18, 2024, at 3:51 AM PDT, successfully resolving the issue for impacted tenants. After extended monitoring across all regions, the issue was declared resolved at 9:45 AM PDT.
Root Cause
Upon investigation, the root cause was identified as a network issue on the Kubernetes nodes in the cloud service. This issue caused intermittent connectivity problems between the SAM Core application and the Database Servers, resulting in a high rate of connection timeouts.
Remediation Actions
· Targeted Resolution: SRE restarted the services for affected tenants, resolving the reported issues and restoring functionality for those users.
· Preventative Restart: To ensure no further disruptions, SRE proactively restarted the services for all tenants across all production clusters.
Future Preventative Measures
· Enhanced Monitoring: Our teams are working to implement more granular monitoring for network health and database connectivity to proactively detect similar issues.
· Resilience Improvements: Collaborate with the SAM Core team to implement retry logic and exception handling during application startup, reducing susceptibility to transient database unavailability.
· Network Diagnostics: Work with our service provider to investigate the root cause of the network issue to identify and address underlying factors, reducing the risk of recurrence.