Cloud Management Platform - US- Self-Service Shard 3 & 4 -Intermittent Errors while making specific API calls

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Cloud Management Platform – 500 Internal Server Error during scheduled release

Timeframe: January 24, 2025, at 11:50 PM PST to January 25, 2025, at 12:33 AM PST

‌

Incident Summary

On Friday, January 25, 2025, at 11:50 PM PST, our monitoring alerts detected an issue affecting our Cloud Management Platform, where some customers may have experienced intermittent API call failures with "500 - Internal Server Errors."

During the investigation, the technical team identified an unresponsive server as the root cause. To fully restore connectivity and ensure stability, the team replaced the server as a permanent fix. The issue was fully resolved by January 25, 2025, at 12:33 AM PST

‌

Root Cause

The root cause was the loss of connectivity between the server and the shard instance. This connectivity issue impacted a specific API call but did not cause a system-wide outage. The alerting systems were able to detect the issue in a timely manner and alert our teams.

‌

Remediation Actions

· Connectivity Restoration: Restored connectivity between the replaced server and the shard instance.

· Server Replacement: As an additional precaution, replaced the unresponsive server again to ensure stability.

· Documentation Update: Updated internal documentation to include steps for verifying connectivity after server replacements and during server unresponsiveness. This ensures future issues of this nature are proactively addressed.

‌

Future Preventative Measures

Connectivity Validation:

· Implement checks to verify connectivity between servers and shard instances after server replacements.

· Integrate these checks into the standard server replacement process.

Load Balancer Monitoring:

Enhance monitoring of load balancer components to identify and address unresponsiveness proactively.
Implement redundancy measures to minimize the impact from single server failures.

Posted Feb 11, 2025 - 04:32 PST

Resolved

The service has been replaced and the issue has been resolved.

Posted Jan 25, 2025 - 00:46 PST

Identified

Incident Description:
Our team has identified a potential issue that could impact customers using the Cloud Management Platform (CMP) on Self-Service Shard 3 & 4. As a result, some customers may experience intermittent errors while making specific API calls.

Priority: P3

Restoration activity:
Our technical teams are actively involved and are working to restore the services. They have identified the unresponsive service and are working to replace it.

Posted Jan 25, 2025 - 00:12 PST

This incident affected: Legacy Cloud Management (Self-Service - Shard 3, Self-Service - Shard 4).