Snow Atlas - APAC - Service Disruption Impacting Platform Access

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Snow Atlas - APAC - Service Disruption Impacting Platform Access

Timeframe: July 27, 2025,  5:34 PM PDT to July 27, 2025,  7:11 PM PDT

Incident Summary

 

On Sunday, July 27, 2025, at 5:34 PM PDT, our monitoring systems detected a service disruption impacting the Snow Atlas platform in the APAC region. Users encountered “service unavailable” errors across multiple products.

Investigation revealed that the incident was caused by the unavailability of the messaging broker service, which prevented inter-service communication.

The root cause was traced to the messaging service running out of memory, leading to the shutdown of messaging servers, which failed to self-recover. Our teams manually restarted the servers, restoring normal service by 7:11 PM PDT.

Root Cause

 

Investigation by our technical teams determined that the central messaging broker service exhausted its available memory, triggering the shutdown of all messaging servers in the region. This resulted in a loss of inter-service communication, causing all services in APAC to fail. As the servers did not auto-recover, a manual restart was required to restore functionality.

Remediation Actions

 

  • Our teams restarted the failed messaging servers, which allowed them to sync with other servers. Once message flow resumed, dependent services recovered automatically.
  • Post restart, our teams verified service health and customer access and continued to monitor the platform to ensure stability.

Future Preventative Measures

 

·        Post-Mortem Review – Conduct a detailed post-mortem to determine the root cause of the messaging broker’s memory exhaustion and the failure of servers to auto-recover. Define and implement improvement actions based on the findings.

·        Message Handling Optimization – Review and optimize message handling processes to prevent backlog-related delays during recovery.

·        Enhanced Monitoring – Strengthen monitoring of the messaging service to enable earlier detection of similar failures.

Posted Aug 15, 2025 - 05:35 PDT

Resolved

Following the implementation of the fix, the platform has remained stable. After a sustained period of monitoring, we are formally declaring the issue resolved.
Posted Jul 27, 2025 - 22:43 PDT

Monitoring

Our team successfully identified an issue with a messaging service that was contributing to the issue and implemented a fix. We are observing signs of recovery across the affected services. The platform is being closely monitored to ensure continued stability, and we will share additional updates as they become available.
Posted Jul 27, 2025 - 19:43 PDT

Investigating

Issue Description: We are currently experiencing a service disruption affecting the Snow Atlas platform in the APAC region. Users may encounter "service unavailable" errors when accessing Snow Atlas services. The issue is impacting general access across multiple areas of the platform in the APAC region, with no reports of issues in other regions.

Priority: P1

Restoration Activity: Our technical team is actively working on identifying the root cause and restoring services. Further updates will be provided as we continue our efforts to resolve the incident.

We are closely monitoring the situation and will keep you informed as progress is made.
Posted Jul 27, 2025 - 19:13 PDT
This incident affected: Snow Atlas (Snow Atlas - Australia).