Snow Atlas - West Europe & East US 2 - SAM Core Permission Errors
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Snow Atlas - West Europe & East US 2 - SAM Core Permission Errors

Timeframe:  November 11, 2022, from 6:00 AMPDT to November 11, 2022, from 7:00 AM PDT

Incident Summary

On Monday, November 11, 2022, from 6:00 AM PDT , we encountered an issue affecting the Snow Atlas platform, where users in the West Europe and East US 2 regions encountered permission-related errors when accessing SAM Core functionality. The issue caused limited access to certain areas of the platform. Other regions remained unaffected, and no customer-reported incidents were received during this time. Multiple tenants in the West Europe and East US 2 regions experienced permission errors that restricted access to SAM Core. No impact was observed in other regions. The issue was traced to a scheduled maintenance event by our Cloud service provider around 6:00 AM PDT. By 6:15 AM PDT, all affected services had restarted and were fully operational. Full platform stabilization was delayed until 7:00 AM PDT due to the high volume of services attempting to reconnect simultaneously.

Root Cause

 

The issue stemmed from a scheduled maintenance by our Cloud service provider at approximately 6:00 AM PDT. During this event:

  1. Several virtual machines (VMs) in the West Europe and East US 2 regions experienced a brief freeze of approximately 6 seconds due to host migrations in the data center.
  2. This brief disruption caused critical Snow Atlas services, including SLM and Inventory, to lose their connections to database instances.
  3. These connection losses triggered automatic service restarts to restore functionality.
  4. Although services were fully restarted and operational by 6:15 AM PDT, platform stabilization was delayed due to a high volume of concurrent reconnection attempts across services.

Remediation Actions

 

·        Service Restarts: All affected services were automatically restarted, restoring their operational state by 6:15 AM PDT.

·        Traffic Monitoring: Continuous monitoring was performed to ensure no further connection losses occurred during the recovery process.

·        Full Stabilization: By 7:00 AM PDT, the platform was stabilized, and normal operations resumed.

Future Preventative Measures

 

 

·        Our teams are investigating the impact of concurrent reconnections during recovery scenarios to reduce delays and improve future stabilization times.

·        Mitigation strategies will be explored to prevent similar disruptions during scheduled maintenance events.

Posted Nov 22, 2024 - 03:38 PST

Resolved
The issue affecting access to SAM Core functionality on the Snow Atlas platform has been resolved. A brief interruption occurred due to a maintenance event in the data center, causing some service components to temporarily lose connection. All services were fully restored shortly afterward, and the platform is now operating normally.
Posted Nov 12, 2024 - 07:05 PST
Investigating
Issue Description: We are currently experiencing an issue affecting the Snow Atlas platform, where users may encounter permission-related errors when accessing SAM Core functionality. This issue is impacting multiple tenants in the West Europe and East US 2 regions, resulting in limited access to certain areas within Snow Atlas SAM Core.

Priority: P2

Restoration Activity: Our technical team is actively working to resolve the issue and restore full functionality. We are monitoring the situation closely and will keep you informed of any developments.
Posted Nov 12, 2024 - 06:34 PST
This incident affected: Snow Atlas (Snow Atlas - America, Snow Atlas - Europe).