Flexera System Status Dashboard Status - Snow Atlas - APAC, UKS & WEU- Service Disruption Affecting SAM Core

Snow Atlas - APAC, UKS & WEU- Service Disruption Affecting SAM Core

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Snow Atlas - APAC, US, UK South & West Europe - SAM Core Errors

Timeframe: December 1, 2024, 4:35 PM PDT to December 1, 2024, 7:49 PM PDT

‌

Incident Summary

On Sunday, December 1st, 2024, at 4:35 PM PDT, customers reported a recurrence of a known issue with the SAM Core functionality. The incident impacted customers across the APAC, US, West Europe, and UK South regions.

Initial investigation revealed that errors were limited to specific pages. The issue was traced back to application restarts triggered by automated health checks. These restarts occurred when the database server appeared unavailable due to intermittent slow or unresponsive connections between the web services and the database servers. The database servers themselves remained operational and did not log any errors during the incident.

To mitigate the issue, the SRE team restarted the affected services at 6:34 PM PDT, which resolved the problem for most customers. By 7:49 PM PDT, all customers had their services restored, except for a few affected by scheduled maintenance. Configuration updates were applied to resolve these tenants' issues.

‌

Root Cause

Upon investigation, the root cause was identified as a network issue on the Kubernetes nodes in the cloud service, which caused intermittent connectivity problems between the SAM Core application and the database servers. This led to:

High Rate of Connection Timeouts: The network issues disrupted the application’s ability to reliably connect to the database servers, even though the servers themselves remained operational with no logged errors.
Health Check Triggers: Automated health checks detected these transient connectivity issues and incorrectly flagged the application as unhealthy, resulting in unnecessary restarts.

‌

Remediation Actions

· Targeted Resolution: SRE restarted the services for affected tenants, resolving the reported issues and restoring functionality for those users.

· Preventative Restart: To ensure no further disruptions, SRE proactively restarted the services for all tenants across all production clusters.

‌

Future Preventative Measures

· Enhanced Monitoring: Implement more granular monitoring for network health and database connectivity to proactively detect similar issues.

· Resilience Improvements: Our technical teams have updated the logic and exception handling during application startup, reducing susceptibility to transient database unavailability.

· Platform Migration: Migrated the service to a different platform as a permanent fix to eliminate the root cause and avoid the recurrence of similar connection issues.

Posted Dec 11, 2024 - 00:19 PST

Resolved

The issue has been resolved following a service restart. Our team will continue to monitor the platform closely for any residual issues.

Posted Dec 01, 2024 - 21:15 PST

Monitoring

Our team has restarted the service, which has resolved the issue for the majority of our customers. We are continuing to monitor the results to ensure full resolution.

Posted Dec 01, 2024 - 20:29 PST

Identified

The issue has been identified, and our team is implementing restorative actions. We are actively monitoring the services to ensure stability.

Posted Dec 01, 2024 - 19:25 PST

Update

We are continuing to investigate this issue.

Posted Dec 01, 2024 - 19:08 PST

Investigating

Incident Description: We are currently investigating an issue affecting Snow Atlas customers, so far identified in the APAC & WEU regions. Impacted customers may experience errors or limited functionality across various sections of the SAM Core interface. Further assessment is ongoing to determine if other regions are affected.

Priority: P2

Restoration Activity: Our technical teams are actively engaged and working to identify the root cause and resolve the issue. Further updates will be provided as progress is made.

Posted Dec 01, 2024 - 18:11 PST

This incident affected: Snow Atlas (Snow Atlas - Australia, Snow Atlas - Europe, Snow Atlas - UK South).