Description: Cloud Management Platform - NAM - Self-Service , CWF, and CM Shard 3 & 4 - Performance Degradation
Timeframe: October 27, 2025, 9:30 PM PST to October 27, 2025, 11:38 PM PST
Incident Summary
On Monday, October 27, 2025, at 9:30 PM PST, our teams identified performance degradation affecting the Self Service, CWF, and Cloud Management (CM) functionalities on Shard 3 and Shard 4 of the Cloud Management Platform in the North America region. This issue arose following a scheduled maintenance activity, during which the database infrastructure was downgraded to a lower instance type as part of our strategy.
During the post-maintenance validation, the teams observed a slowdown across the CMP platform, resulting in some operations taking longer than anticipated to complete. Consequently, customers in the NAM region may have experienced reduced performance or delays while using Self Service, CWF, and CM functionalities.
To address the degradation, we scaled up the database infrastructure to a higher resource configuration, which successfully restored normal performance on Shards 3 and 4 by 10:00 PM PST. We continued to monitor the situation closely, and by 11:38 PM PST, all affected functionalities, including Self Service, CWF, and CM, were confirmed to be fully restored and operating normally.
Root Cause
During the scheduled maintenance, the database infrastructure supporting Shard 3 and Shard 4 was downgraded to a lower instance type. This change was based on successful validation performed in the staging environment using the same configuration.
However, the production environment necessitated greater computational resources due to a surge in workload and concurrent user activity. Consequently, the lower instance type proved inadequate for the demands of the production environment, resulting in elevated resource utilization, database latency, and overall performance degradation on the CMP platform.
Remediation Actions
· Our teams initiated an investigation immediately upon detection of performance degradation. Database performance metrics were analyzed, revealing resource contention and high latency.
· The database instances were scaled up to a higher resource configuration to restore normal performance levels.
· Continuous post-restoration monitoring was carried out to ensure stability and validate recovery across all impacted functionalities.
Future Preventative Measures
· Enhanced Pre-Production Testing- Conduct comprehensive load and performance testing in pre-production environments that accurately replicate production-scale workloads before implementing infrastructure changes.
· Extended Maintenance Windows- Plan for larger maintenance windows for infrastructure-level changes to ensure sufficient time for post-maintenance validation and rollback, if necessary.