Cloud Management Platform - US- Self-Service Shard 3 & 4 -Intermittent Errors while making specific API calls

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Cloud Management Platform – 500 Internal Server Error during scheduled release

Timeframe:  January 24, 2025, at 11:50 PM PST to January 25, 2025, at 12:33 AM PST

Incident Summary

On Friday, January 25, 2025, at 11:50 PM PST, our monitoring alerts detected an issue affecting our Cloud Management Platform, where some customers may have experienced intermittent API call failures with "500 - Internal Server Errors."

During the investigation, the technical team identified an unresponsive server as the root cause. To fully restore connectivity and ensure stability, the team replaced the server as a permanent fix. The issue was fully resolved by January 25, 2025, at 12:33 AM PST

Root Cause

The root cause was the loss of connectivity between the server and the shard instance. This connectivity issue impacted a specific API call but did not cause a system-wide outage. The alerting systems were able to detect the issue in a timely manner and alert our teams.

Remediation Actions 

·        Connectivity Restoration: Restored connectivity between the replaced server and the shard instance.

·        Server Replacement: As an additional precaution, replaced the unresponsive server again to ensure stability.

·        Documentation Update: Updated internal documentation to include steps for verifying connectivity after server replacements and during server unresponsiveness. This ensures future issues of this nature are proactively addressed.

Future Preventative Measures  

Connectivity Validation:

·        Implement checks to verify connectivity between servers and shard instances after server replacements.

·        Integrate these checks into the standard server replacement process.

Load Balancer Monitoring:

  • Enhance monitoring of load balancer components to identify and address unresponsiveness proactively.
  • Implement redundancy measures to minimize the impact from single server failures.
Posted Feb 11, 2025 - 04:32 PST

Resolved

The service has been replaced and the issue has been resolved.
Posted Jan 25, 2025 - 00:46 PST

Identified

Incident Description:
Our team has identified a potential issue that could impact customers using the Cloud Management Platform (CMP) on Self-Service Shard 3 & 4. As a result, some customers may experience intermittent errors while making specific API calls.

Priority: P3

Restoration activity:
Our technical teams are actively involved and are working to restore the services. They have identified the unresponsive service and are working to replace it.
Posted Jan 25, 2025 - 00:12 PST
This incident affected: Legacy Cloud Management (Self-Service - Shard 3, Self-Service - Shard 4).