Description: Flexera One – RightScale Self Service – NA – Errors Accessing Available CloudApps
Timeframe: July 28th, 2:08 PM to August 4th, 7:00 PM PDT
Incident Summary
On July 28th, at 2:08 PM PDT, during scheduled Maintenance, on the Cloud Management and Self-Service environments, to migrate production credentials from our legacy service to our new Cloud Service in the US, we encountered issues with our RightScale Self-Service Portal and Billing Services in Cloud Cost Optimization. As a result, customers may have experienced errors while accessing Available CloudApps from RightScale. This may have prevented customers from deploying, configuring, and managing applications across different cloud service provider platforms. In addition, some customers may have also experienced issues accessing the Billing configuration page from Cloud Cost Optimization.
Upon investigation, staff found that the authentication service on one of the clusters was experiencing resource contention issues due to several failing connection requests. At 4:11 PM PDT, staff reconfigured the impacted service. Technical Staff discovered that one of the load balancers was overwhelmed with incoming traffic. At 5:02 PM PDT, the impacted load balancer was restarted. However, health checks revealed that incoming requests were still failing.
The staff continued their investigation overnight, and on July 29th at 6:44 AM PDT, staff deployed additional optimizations into the environment. On August 1, at 12:27 PM PDT, staff updated the instances on the authentication service and migrated to an enhanced version with more memory. The infrastructure was further scaled up to handle sudden traffic surges and prevent server overload. Monitoring and health checks showed that the Billing Configuration feature was accessible again via Cloud Cost Optimization. However, Available CloudApps were still inaccessible via RightScale Self Service.
Additional SMEs were engaged to assist with the investigation. Staff also worked with the impacted customers directly to mitigate and isolate the problem. Meanwhile, staff continued to research and work on implementing a long-term solution. On August 2, at 1:43 PM PDT, additional resources were deployed to provide further relief to the service. Staff also deployed enhanced alerting and monitoring capabilities in the environment.
After further analysis and research, on August 4th, at 4:43 PM PDT, technical staff identified the root cause to be a missing field in the Self Service configuration that was required to authenticate the credentials for incoming requests. In addition, the error handling capabilities to detect missing fields from the Self Service were discovered to be missing. At 6:13 PM PDT, technical staff deployed the code fixes to re-introduce the missing field to the new cloud service, following which subsequent calls were successfully authenticating with the service. After additional monitoring and health checks, the incident was declared resolved on August 4th, at 7:00 PM PDT.
Primary Root Cause:
Technical staff identified the root cause to be a missing field in the Self Service configuration that was required to authenticate the credentials for incoming requests.
Contributing Causes:
• The error handling capabilities to detect missing fields from Self Service were missing
• Due to multiple failing incoming requests, the service was overwhelmed resulting in a downstream impact on the applications
Corrective Actions: