Description: Snow Atlas - APAC, US, UK South & West Europe - SAM Core Errors
Timeframe: December 1, 2024, 4:35 PM PDT to December 1, 2024, 7:49 PM PDT
Incident Summary
On Sunday, December 1st, 2024, at 4:35 PM PDT, customers reported a recurrence of a known issue with the SAM Core functionality. The incident impacted customers across the APAC, US, West Europe, and UK South regions.
Initial investigation revealed that errors were limited to specific pages. The issue was traced back to application restarts triggered by automated health checks. These restarts occurred when the database server appeared unavailable due to intermittent slow or unresponsive connections between the web services and the database servers. The database servers themselves remained operational and did not log any errors during the incident.
To mitigate the issue, the SRE team restarted the affected services at 6:34 PM PDT, which resolved the problem for most customers. By 7:49 PM PDT, all customers had their services restored, except for a few affected by scheduled maintenance. Configuration updates were applied to resolve these tenants' issues.
Root Cause
Upon investigation, the root cause was identified as a network issue on the Kubernetes nodes in the cloud service, which caused intermittent connectivity problems between the SAM Core application and the database servers. This led to:
Remediation Actions
· Targeted Resolution: SRE restarted the services for affected tenants, resolving the reported issues and restoring functionality for those users.
· Preventative Restart: To ensure no further disruptions, SRE proactively restarted the services for all tenants across all production clusters.
Future Preventative Measures
· Enhanced Monitoring: Implement more granular monitoring for network health and database connectivity to proactively detect similar issues.
· Resilience Improvements: Our technical teams have updated the logic and exception handling during application startup, reducing susceptibility to transient database unavailability.
· Platform Migration: Migrated the service to a different platform as a permanent fix to eliminate the root cause and avoid the recurrence of similar connection issues.