Cloud Management Platform - Self-Service Shard 3 & Shard 4 - Service Degradation

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Cloud Management Platform - Self-Service Shard 3 & Shard 4 - Service Degradation

Timeframe:  March 14, 2025, 6:17 AM PST toMarch 14, 2025, 7:51 AM PST

Incident Summary

On March 14, 2025, at 6:17 AM PST, our teams were notified of an issue affecting customers using the Cloud Management Platform (CMP) on Self-Service Shard 3 and Shard 4 via monitoring alerts. Due to this issue, the affected might have been unable to use self-service capabilities, such as creating or accessing Cloud Apps. Additionally, customers on Shard 4 may have experienced intermittent login issues with the CM & RightScale platform. Our teams promptly started their investigation and detected a front-end server in an unresponsive state. While replacing the server, our teams identified another server in a similar state. Our team successfully replaced the second server at 7:51 AM PST and continued to monitor the platform for stability.

Root Cause

Initial investigations revealed that a Front-End server had become unresponsive, leading to disruptions in self-service functionalities. Upon replacing the affected server at 7:27 AM PST, our team identified another unresponsive server. This second impacted server was replaced at 7:51 AM PST. Logs confirmed system stability after this replacement.

Remediation Actions 

·        Identified and replaced the unresponsive Front-End server at 7:27 AM PST.

·        Detected and replaced an additional unresponsive server at 7:51 AM PST.

·        Verified logs and system performance to confirm service restoration.

·        Monitored the platform to ensure continued stability.

Future Preventative Measures 

·        Review existing automated health checks for Front-End servers to detect and mitigate failures proactively.

·        Enhance logging and alerting mechanisms for quicker identification of unresponsive components.

·        Conduct a post-mortem review to assess improvements in redundancy and failover strategies.

Posted Mar 21, 2025 - 03:18 PDT

Resolved

All services are now operating as expected, and logs continue to show stable performance. This incident has been resolved.
Posted Mar 14, 2025 - 08:22 PDT

Monitoring

The affected services have been restored, and logs indicate stable performance. We will continue to monitor the system to ensure stability.
Posted Mar 14, 2025 - 08:02 PDT

Update

The team has identified the issue and is actively working on mitigation. We will provide further updates as progress is made.
Posted Mar 14, 2025 - 07:32 PDT

Update

The impacted server has been replaced, and we are monitoring the system to ensure full recovery.
Posted Mar 14, 2025 - 07:29 PDT

Identified

Initial investigations indicate that a front-end server has become unresponsive. Our team is actively working on a replacement to restore functionality. We will provide further updates as progress is made.
Posted Mar 14, 2025 - 07:14 PDT

Investigating

Issue Description: We are currently investigating an issue that may impact customers using the Cloud Management Platform (CMP) on Self-Service Shard 3 and Shard 4. As a result, customers may be unable to use self-service capabilities, including creating or accessing Cloud Apps. Additionally, customers on Shard 4 may experience intermittent issues when logging into the RightScale platform.

Priority: P1

Restoration Activity:
Our technical teams are actively engaged and assessing the situation. We are also exploring potential solutions to restore functionality as quickly as possible.
Posted Mar 14, 2025 - 06:55 PDT
This incident affected: Legacy Cloud Management (Self-Service - Shard 3, Self-Service - Shard 4).