Cloud Management Platform - Self-Service Shard 3 & Shard 4 - Service Degradation

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Cloud Management Platform - Self-Service Shard 3 & Shard 4 - Service Degradation

Timeframe:  March 16, 2025, 9:21 AM PST to March 16, 2025, 10:38 AM PST

Incident Summary

On March 16, 2025, from 9:21 AM PST, we experienced a recurrence of the issue affecting customers using the Cloud Management Platform (CMP) on Self-Service Shard 3 and Shard 4 via monitoring alerts. Due to this issue, the affected might have been unable to use self-service capabilities, such as creating or accessing Cloud Apps. Additionally, customers on Shard 4 may have experienced intermittent login issues with the CM & RightScale platforms. Our teams promptly started their investigation and detected a front-end server in an unresponsive state. Our team successfully replaced the server at 10:24 AM PST and continued to monitor the platform for stability. Post all validations, the service was declared as restored at 10:38 AM PST.

Root Cause

Investigations confirmed that similar to the occurrence on 14th March 2025, a Front-End server had become unresponsive, impacting self-service functionalities. The server was replaced, and additional validation checks were performed to ensure system stability before full service restoration.

Remediation Actions 

·        Identified and replaced the unresponsive Front-End server.

·        Conducted thorough validation checks to confirm stability before declaring resolution.

·        Monitored system logs and platform performance post-recovery.

Future Preventative Measures 

·        Review existing automated health checks for Front-End servers to detect and mitigate failures proactively.

·        Strengthen redundancy measures to minimize the impact of similar failures in the future.

·        Conduct a post-mortem review to assess improvements in redundancy and failover strategies.

Posted Mar 21, 2025 - 05:02 PDT

Resolved

This incident has been resolved.
Posted Mar 16, 2025 - 11:11 PDT

Update

The platform has returned to normal operations. Our team will continue to monitor the system to ensure sustained stability.
Posted Mar 16, 2025 - 10:43 PDT

Monitoring

The technical team has identified and replaced the affected server and is now conducting validation checks to ensure stability.
Posted Mar 16, 2025 - 10:29 PDT

Identified

Issue Description: We are currently investigating an issue that may impact customers using the Cloud Management Platform (CMP) on Self-Service Shard 3 and Shard 4. As a result, customers may be unable to use self-service capabilities, including creating or accessing Cloud Apps. Additionally, customers on Shard 4 may experience intermittent issues when logging into the RightScale platform.

Priority: P1

Restoration Activity: Our technical teams are actively engaged and assessing the situation. We are also exploring potential solutions to restore functionality as quickly as possible.
Posted Mar 16, 2025 - 09:49 PDT
This incident affected: Legacy Cloud Management (Self-Service - Shard 3, Self-Service - Shard 4).