Description: Snow Atlas - Australia Southeast (ASE), East US2 & West Europe - Inventory Data Not Loading
Timeframe: January 29, 2025, 2:42 PM PST to January 31, 2025, 10:10 AM PST
Incident Summary
On January 29, 2025, at 2:42 PM PST, our monitoring systems detected an issue affecting inventory data availability for Snow Atlas customers in the Australia Southeast, East US2, and West Europe regions. While the platform itself remained accessible, affected customers reported that upon logging into their tenants, the Snowboard appeared blank, and the "Search for Computers" page displayed no data. Attempts to retrieve inventory data resulted in a "No information available" error.
Our technical teams were promptly engaged and suspected that the issue was caused by a recent deployment. Initial investigations confirmed that all infrastructure services were running as expected, ruling out infrastructure-related failures. An attempt was made to restart the service for one affected customer, but this did not resolve the issue.
To mitigate the impact, our teams migrated the database for the affected customers while continuing to work on a permanent fix. By January 30, 2025, at 4:18 AM PST, the issue was resolved for most customers. However, a configuration issue was identified for a subset of customers, preventing the restoration process. After updating the configuration settings, the issue was fully resolved by January 31, 2025, at 10:10 AM PST.
Root Cause
The issue was instigated by a recent deployment that inadvertently disrupted inventory data availability for affected customers. While the platform remained operational, the deployment introduced a misconfiguration that prevented inventory data from being displayed correctly. Additionally, a configuration discrepancy for certain customers further delayed the complete resolution.
Remediation Actions
· Database Migration: The affected customers’ databases were migrated to restore access to inventory data while further investigation into the root cause continued.
· Service Restart Attempt: A restart of the service was attempted as part of troubleshooting, but this action did not resolve the issue.
· Configuration Correction: A configuration discrepancy impacting certain customers was identified and updated to restore full functionality.
Enhanced System Monitoring: Continuous monitoring was performed post-mitigation to ensure service stability and prevent recurrence
Future Preventative Measures
· Thorough Root Cause Analysis: Conduct an in-depth root cause analysis to identify the exact failure points and determine preventive measures to avoid similar incidents in the future.
· Deployment Validation Enhancements: Strengthen pre-deployment validation processes to minimize the risk of misconfigurations.
· Automated Configuration Verification: Implement automated validation mechanisms to ensure configurations are applied correctly across all customer environments before deployment.
· Expanded Monitoring and Alerting: Enhance real-time monitoring and alerting capabilities to proactively detect and diagnose inventory data inconsistencies.