Description: Cloud Management Platform - Self-Service Shard 4 -Potential Errors while launching Cloud Apps or performing custom actions
Timeframe: April 18, 2025, 7:34 PM PST to April 18, 2025, 8:46 PM PST
Incident Summary
On Friday, 18th April 2025, at 7:34 PM PST, our monitoring systems detected intermittent database connection errors within the Cloud Management Platform (CMP) specific to Self-Service Shard 4. The issue primarily affected the scheduling service, which is responsible for handling Cloud App launch workflows and custom operations. As a result, customers may have experienced slowness, timeouts, or failures when attempting to launch cloud applications or execute custom tasks.
Errors were first observed in CMP logs showing connection resets between the scheduling service and the database. The issue was isolated to Shard 4 and did not affect other shards. After redeploying the impacted pods, the errors ceased, and no further anomalies were observed during continued monitoring.
Root Cause
During the investigation, the connection reset issues were traced back to a minor upgrade automatically applied by our cloud service provider to the underlying Database Service instance. This upgrade introduced temporary connectivity disruptions between the CMP scheduling service and its database.
Since the upgrade was performed automatically and without coordination, it briefly impacted active service connections, leading to intermittent failures and degraded performance.
Remediation Actions
· Redeployed affected pods in Shard 4 to re-establish healthy connections between the scheduling service and the database.
· Closely monitored system behaviour post-redeployment to ensure error recurrence did not occur.
· Validated full restoration of scheduling service functionality with no additional user-impacting symptoms.
Future Preventative Measures
· Auto-Upgrade Disabled: Our teams have disabled automatic upgrades to prevent uncoordinated changes to production systems.
· Service Hardening: Improve scheduling-service resilience with better error handling.