Flexera One - RightScale Self-Service - NA - Slow Response/Errors
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Flexera One - RightScale Self-Service - Shard 3 - NA - Slow Response/Errors

Timeframe: March 14th, 7:06 PM to March 14th, 7:59 PM PDT

Incident Summary:

On March 14th at 7:06 PM PDT, an outage occurred that affected the RightScale Self-Service in Shard 3, leading to slow response times and server errors for customers using the service portal. This impacted their ability to manage resources, potentially causing delays or disruptions in operations.

The root cause was traced to an unusual failure within a critical component of the system, which was still functioning but at significantly reduced performance levels. This odd state required the component to be replaced.

To address the situation and restore normal service, the faulty component was replaced at 7:59 PM PDT. Following a series of health checks and thorough monitoring, the incident was successfully resolved.

Root Cause:

The root cause of the incident was an unusual failure within a critical system component, which continued to function at significantly reduced performance levels. This failure led to slow response times and server errors, ultimately affecting customers' ability to access and manage resources via the service portal.

Contributing Cause:

The monitoring system's inability to detect and alert the team to the component's partially functional state. This lack of timely detection and intervention allowed the issue to persist, affecting customers' ability to access and manage resources via the service portal.

Corrective Actions:

  1. Upon identifying the faulty component, it was replaced with a fully functional one to restore the system's stability and performance.
  2. We will be conducting a thorough evaluation of the existing monitoring systems to identify any shortcomings and implement improvements to better detect issues, especially in cases where components are partially functional.
  3. We will explore ways to improve our alert mechanisms. Our aim is to enhance the alert systems to provide more timely notifications of potential issues, allowing the team to respond and intervene promptly when needed.
Posted Mar 24, 2023 - 15:02 PDT

Resolved
Incident Description: On March 14th, we experienced an intermittent issue affecting the RightScale Self-Service Shard 3 instance. Customers may have encountered slow response or internal server errors when attempting to access the RightScale Self-Service Portal. As a result, they may have been unable to launch or manage cloud resources effectively, leading to potential delays or disruptions in their cloud operations.

Priority: P2

Restoration Activity: Technical staff have identified the issue and completed remediation activities. In addition, health checks and monitoring have confirmed services are now stable. This incident has been resolved.
Posted Mar 20, 2023 - 10:22 PDT
This incident affected: Legacy Cloud Management (Cloud Management Dashboard - Shard 3, Self-Service - Shard 3).