Snow Atlas - APAC, UK South & West Europe - SAM Core Errors
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Snow Atlas - APAC, UK South & West Europe - SAM Core Errors

Timeframe:  November 17, 2024, at 11:04 PM PDT to November 18, 2024, at 9:45 AM PDT

Incident Summary

On Sunday, November 17th , 2024, at 11:04 PM PDT , we received customer reports indicating issues with the SAM Core functionality. Initially, the impact seemed confined to tenants in the APAC region, but subsequent reports revealed that customers in the West Europe and UK South regions were also affected. Our initial investigation determined that the errors were confined to specific pages.

Our engineering team discovered during the investigation that errors for one affected customer began after an application restart, typically triggered by health checks, when the Database Server appeared unavailable. Similar patterns were observed for other tenants. The web service in all production clusters exhibited a high rate of connection timeout errors to the database servers. Although the database servers remained operational with no logged errors, connections were intermittently slow or unresponsive. This slowness resolved itself over time, but persistent timeouts during the outages led to application restarts.

To restore service, SRE restarted the affected services on November 18, 2024, at 3:51 AM PDT, successfully resolving the issue for impacted tenants. After extended monitoring across all regions, the issue was declared resolved at 9:45 AM PDT.

Root Cause

 Upon investigation, the root cause was identified as a network issue on the Kubernetes nodes in the cloud service. This issue caused intermittent connectivity problems between the SAM Core application and the Database Servers, resulting in a high rate of connection timeouts.

Remediation Actions

 

·        Targeted Resolution: SRE restarted the services for affected tenants, resolving the reported issues and restoring functionality for those users.

·        Preventative Restart: To ensure no further disruptions, SRE proactively restarted the services for all tenants across all production clusters.

Future Preventative Measures 

 

·        Enhanced Monitoring: Our teams are working to implement more granular monitoring for network health and database connectivity to proactively detect similar issues.

·        Resilience Improvements: Collaborate with the SAM Core team to implement retry logic and exception handling during application startup, reducing susceptibility to transient database unavailability.

·        Network Diagnostics: Work with our service provider to investigate the root cause of the network issue to identify and address underlying factors, reducing the risk of recurrence.

Posted Nov 27, 2024 - 21:50 PST

Resolved
The system is currently operational, and no further disruptions have been observed. The team suspects that recurring network problems impacting performance may have caused the issue.

A full root cause analysis will be conducted to identify the underlying cause and implement long-term measures to ensure stability and prevent future recurrences. A formal post-mortem report will also be shared in the coming days.
Posted Nov 18, 2024 - 09:55 PST
Update
Our teams are analyzing the issue to identify the underlying cause and implement a resolution. Efforts are ongoing, and we are monitoring the situation closely to minimize further impact. We will provide updates as more information becomes available.
Posted Nov 18, 2024 - 07:10 PST
Investigating
Our teams restarted the services , which resolved the issue for some of the customers. We are still investigating the issue and will provide further updates once they are available.
Posted Nov 18, 2024 - 02:46 PST
Update
We are continuing to monitor for any further issues.
Posted Nov 18, 2024 - 01:49 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 18, 2024 - 01:36 PST
Investigating
Our teams are currently investigating an issue affecting the Snow Atlas platform, where users may encounter errors when accessing SAM Core functionality. This issue impacts multiple customers in the APAC, UKSouth & West Europe regions, resulting in pages throwing errors or loading incorrectly for affected users within Snow Atlas SAM Core.

Priority: P2

Restoration Activity: Our technical team is actively working to resolve the issue and restore full functionality. We are monitoring the situation closely and will keep you informed of any developments.
Posted Nov 17, 2024 - 23:23 PST
This incident affected: Snow Atlas (Snow Atlas - Australia, Snow Atlas - Europe, Snow Atlas - UK South).