Cloud Management Platform - Self-Service Shard 4 -Potential Errors while launching Cloud Apps or performing custom actions

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Cloud Management Platform - Self-Service Shard 4 -Potential Errors while launching Cloud Apps or performing custom actions

Timeframe: April 18, 2025, 7:34 PM PST to April 18, 2025, 8:46 PM PST

Incident Summary

On Friday, 18th April 2025, at 7:34 PM PST, our monitoring systems detected intermittent database connection errors within the Cloud Management Platform (CMP) specific to Self-Service Shard 4. The issue primarily affected the scheduling service, which is responsible for handling Cloud App launch workflows and custom operations. As a result, customers may have experienced slowness, timeouts, or failures when attempting to launch cloud applications or execute custom tasks.

Errors were first observed in CMP logs showing connection resets between the scheduling service and the database. The issue was isolated to Shard 4 and did not affect other shards. After redeploying the impacted pods, the errors ceased, and no further anomalies were observed during continued monitoring.

Root Cause

During the investigation, the connection reset issues were traced back to a minor upgrade automatically applied by our cloud service provider to the underlying Database Service instance. This upgrade introduced temporary connectivity disruptions between the CMP scheduling service and its database.

Since the upgrade was performed automatically and without coordination, it briefly impacted active service connections, leading to intermittent failures and degraded performance.

Remediation Actions

·        Redeployed affected pods in Shard 4 to re-establish healthy connections between the scheduling service and the database.

·        Closely monitored system behaviour post-redeployment to ensure error recurrence did not occur.

·        Validated full restoration of scheduling service functionality with no additional user-impacting symptoms.

Future Preventative Measures

·        Auto-Upgrade Disabled: Our teams have disabled automatic upgrades to prevent uncoordinated changes to production systems.

·        Service Hardening: Improve scheduling-service resilience with better error handling.

Posted Apr 22, 2025 - 03:15 PDT

Resolved

Our team restarted the services, which successfully cleared the errors. The platform was monitored for an extended period, and no further errors were observed.
Posted Apr 18, 2025 - 21:02 PDT

Investigating

Incident Description:
Our team has identified a potential issue that could impact customers using the Cloud Management Platform (CMP) on Self-Service Shard 4. As a result, customers may experience errors when launching Cloud Apps. They might also face intermittent custom operation failures.

Priority: P2

Restoration activity:
Our technical teams are actively involved and are evaluating the situation. Additionally, we are exploring potential solutions to rectify the issue as quickly as possible.
Posted Apr 18, 2025 - 20:19 PDT
This incident affected: Legacy Cloud Management (Cloud Management Dashboard - Shard 4, Self-Service - Shard 4).