Snow Atlas - APAC - Service Disruption Impacting Platform Access
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Snow Atlas - APAC - Service Disruption Impacting Platform Access

Timeframe:  October 27, 2024, 4:07 PM PDT to October 27, 2024, 8:48 PM PDT

Incident Summary

On Sunday, October 27th at 4:07 PM PDT, the Snow Atlas platform experienced a significant service disruption affecting users in the APAC region. Users encountered "503 Service Temporary Unavailable" errors while attempting to access various features, including Computers and Applications overviews. This issue impacted general access across multiple areas of the platform in the APAC region, while other regions remained unaffected.

During the incident investigation, it was observed that a few Kubernetes containers providing customers with UI and data services were in a restarting state due to connectivity issues with the messaging server. The primary cause of the incident was traced to a problem with a Kubernetes node, which experienced errors with its Kubelet service. This node failure resulted in the messaging server container on that node shutting down, preventing it from restarting on another node.

A manual rolling restart was performed to restore functionality across most pods, allowing services to resume for the majority of customers by 8:48 PM PDT. Post-rolling restart, service was restored for all customers except one, which was attributed to a configuration issue. Our teams restored the database for the remaining customer from the full backup, which fixed the issue

Root Cause

 

The incident was caused by a failure in the Kubernetes node’s Kubelet service , which caused the messaging server container to shut down and remain inactive. This failure prevented the messaging server container from initiating a self-recovery and automatically restarting on an available node. Manual intervention was required, and a rolling restart was performed, restoring functionality across most pods. We are currently awaiting an explanation from the managed Kubernetes service providers on why the kubelet service was not acting as intended.

Remediation Actions

 

·        Manual Rolling Restart of Kubernetes Cluster: A manual rolling restart was performed to restore functionality across most pods, allowing services to resume for the majority of customers.

·        Forced Removal of Broken Messaging Server Instance: The affected messaging server instance, which did not restart as expected, was forcibly removed, enabling other replicas to continue operating without disruption.

·        Database Restoration for Impacted Customer: A database backup was restored for the remaining affected customer, resolving the access issue caused by a configuration discrepancy.

Future Preventative Measures 

 

·        Messaging Server Service Monitoring Enhancements: We are working to improve monitoring for the messaging server services, implementing automated alerts to notify support teams if one or more containers fail to operate as expected. This will enable quicker detection and intervention.

·        Messaging Server Service Deprecation: The messaging server service that failed to recover autonomously is already scheduled for deprecation, with most systems dependent on it already migrated to alternative services.

·        Support Tickets with Microsoft Azure: Support tickets have been raised with Microsoft Azure to investigate why the affected node was not automatically removed despite issues being flagged and other key aspects of the incident.

Posted Nov 12, 2024 - 02:13 PST

Resolved
Our team has restarted services for the remaining affected customer, which fixed the issue. All services have returned to normalcy.
Posted Oct 27, 2024 - 22:00 PDT
Identified
Our team has confirmed that all customers are operating normally, except for one. We are actively working to restore service for this remaining customer as quickly as possible.
Posted Oct 27, 2024 - 19:53 PDT
Update
Our initial investigation has identified a connectivity issue with a server, causing disruptions in the system's operations. While the logs suggest an unexpected server shutdown, the exact cause is still under review. Our team is working to resolve the issue and restore normal service as quickly as possible.
Posted Oct 27, 2024 - 18:33 PDT
Investigating
Issue Description: We are currently experiencing a service disruption affecting the Snow Atlas platform in the APAC region. Users may encounter "503 Service Temporary Unavailable" errors when accessing various features, such as Computers and Applications overviews. The issue is impacting general access across multiple areas of the platform in the APAC region, with no reports of issues in other regions.

Priority: P1

Restoration Activity: Our technical team is actively working on identifying the root cause and restoring services. Currently, no workaround is available. Further updates will be provided as we continue our efforts to resolve the incident.

We are closely monitoring the situation and will keep you informed as progress is made.
Posted Oct 27, 2024 - 17:03 PDT
This incident affected: Snow Atlas (Snow Atlas - Australia).