Snow Atlas – APAC & East US 2 – Daily Update Job Failures

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Snow Atlas – APAC & East US 2 – Daily Update Job Failures

Timeframe:

  • Asia‑Pacific (APAC): July 29, 2025, 4:00 AM PDT to July 30, 2025, 4:00 AM PDT
  • East US 2 (EUS2): July 29, 2025, 10:00 PM PDT to July 30, 2025, 10:00 PM PDT

Incident Summary

On July 29, 2025, at 4:00 AM PDT, Snow Atlas experienced a disruption in its scheduled daily update jobs (DUJs) in the Asia‑Pacific (APAC) region. While customers were able to log in and access the platform, updated data was not available during this period.

Later the same day, similar failures occurred in the East US 2 (EUS2) region, where scheduled updates also did not complete. Some customers in North America missed two consecutive updates on July 29 and July 30.

During the disruption, technical teams confirmed that the failures were affecting a large number of tenants. To avoid further complications, the update jobs were not manually retriggered, and the focus remained on identifying and implementing a permanent fix.

The failures were traced to a mismatch between two dependent service components introduced during a recent deployment. In some cases, prior configuration overrides in the EUS2 prevented tenants from immediately receiving the corrected version, contributing to a second missed update.

The issue was resolved once the dependent services were realigned and configuration overrides cleared on July 30, 2025, at 12:45 AM PDT. Daily update jobs then resumed successfully: the APAC region was confirmed to be running normally by 4:00 AM PDT the same day, and the EUS2 region by 10:00 PM PDT. From that point forward, daily updates continued without interruption. No customer data was lost, and validations confirmed that updates were fully restored.

Root Cause

Primary Root Cause

The disruption was caused by dependent service updates that were not fully synchronized during a recent deployment. This misalignment prevented scheduled daily update jobs (DUJs) from completing successfully. In the North America region, prior configuration overrides delayed the application of the corrected version, contributing to a second missed update for some customers.

Contributing Factors

  • Service Synchronization: The synchronization required between dependent service updates was not captured during refinement, leading to widespread DUJ failures.
  • Configuration Overrides: In the North America region, existing overrides prevented some tenants from receiving the corrected update in time, resulting in consecutive DUJ failures.
  • Validation Gaps: The importance of coordinating these service updates was not highlighted during planning, allowing the issue to occur.

Remediation Actions

  1. Service Alignment: Dependent service components were realigned to ensure proper synchronization, addressing the mismatch that caused daily update job failures.
  2. Configuration Correction: Prior overrides in the North America region were removed, ensuring all tenants received the corrected version.

Post Recovery Validations: Comprehensive checks were performed to confirm that daily update jobs resumed successfully and that customer data remained intact.

Future Preventative Measures

  1. Improved End-to-End Testing: Efforts are underway to strengthen end-to-end testing processes to better identify potential issues prior to deployment.
  2. Enhanced Service Dependency Mapping: Work is being done to ensure that dependencies between services are clearly documented and coordinated during planning and refinement, reducing the likelihood of future misalignments.
  3. Deployment Observability Enhancements: Plans are in progress to improve visibility into deployment status, enabling earlier detection of issues during rollouts and quicker corrective actions when necessary.
  4. Streamlined Update Job Management: A new mechanism has been introduced to provide greater control over update job execution and synchronization flows, reducing the risk of partial or misaligned updates in future deployments. Additionally, efforts are underway to fully deprecate the DUJ framework altogether. The deployment associated with this incident was part of that broader initiative, aiming to eliminate these issues at their root.
Posted Aug 06, 2025 - 20:20 PDT

Resolved

The issue impacting Daily Update Jobs (DUJs) in the APAC and East US 2 regions has been resolved. Both regions have now been updated, and we have confirmed that DUJs are completing successfully in the US region as well.
Posted Jul 31, 2025 - 09:37 PDT

Update

We are continuing to monitor the situation closely. No new issues have been observed since the fix was applied, and upcoming Daily Update Jobs (DUJs) are expected to complete successfully. We will provide another update once the US‑based jobs are confirmed to have run as expected, or sooner if there are any changes.
Posted Jul 30, 2025 - 12:48 PDT

Monitoring

The failures observed in the APAC region were identified and already resolved. However, a configuration oversight in the US region resulted in it not receiving the updated version containing the fix. This issue was promptly identified and the fix has now been successfully deployed in the US region as well.

All upcoming Daily Update Jobs (DUJs) are expected to complete successfully. We will continue to monitor the situation closely until US-based jobs are confirmed to be running as expected and the issue is fully resolved.
Posted Jul 30, 2025 - 01:07 PDT

Update

Monitoring has confirmed that today’s Daily Update Jobs (DUJs) for our US customers have also failed. Our teams are actively investigating the underlying cause and are continuing to monitor service performance closely. We will provide further updates as additional information becomes available.
Posted Jul 30, 2025 - 00:20 PDT

Update

Our team continues to monitor the situation closely. While access to Snow Atlas remains available, some customers may continue to experience delays in data updates. We are also monitoring the next round of Daily Update Jobs to determine if any further manual intervention is required. Additional updates will be provided as more information becomes available.
Posted Jul 29, 2025 - 19:24 PDT

Investigating

Incident Description: We are currently investigating an issue affecting the Snow Atlas platform in the APAC and East US 2 regions. Customers may experience failed Daily Update Jobs (DUJs), which could cause delays in data processing. As a result, the latest data may not be visible in Snow Atlas.

Priority: P2

Restoration Activity: Our technical team is actively investigating. Preliminary findings point to a conflict encountered during scheduled update processing, which has impacted the completion of Daily Update Jobs. Work is underway to clear any backlog and ensure updates run successfully. We will continue to monitor progress closely and provide further updates as more information becomes available.
Posted Jul 29, 2025 - 17:36 PDT
This incident affected: Snow Atlas (Snow Atlas - America, Snow Atlas - Australia).