Flexera One – RightScale Self Service – Errors Accessing Available CloudApps
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Flexera One – RightScale Self Service – NA – Errors Accessing Available CloudApps

Timeframe: July 28th, 2:08 PM to August 4th, 7:00 PM PDT

Incident Summary

On July 28th, at 2:08 PM PDT, during scheduled Maintenance, on the Cloud Management and Self-Service environments, to migrate production credentials from our legacy service to our new Cloud Service in the US, we encountered issues with our RightScale Self-Service Portal and Billing Services in Cloud Cost Optimization. As a result, customers may have experienced errors while accessing Available CloudApps from RightScale. This may have prevented customers from deploying, configuring, and managing applications across different cloud service provider platforms. In addition, some customers may have also experienced issues accessing the Billing configuration page from Cloud Cost Optimization.

Upon investigation, staff found that the authentication service on one of the clusters was experiencing resource contention issues due to several failing connection requests. At 4:11 PM PDT, staff reconfigured the impacted service. Technical Staff discovered that one of the load balancers was overwhelmed with incoming traffic. At 5:02 PM PDT, the impacted load balancer was restarted. However, health checks revealed that incoming requests were still failing.

The staff continued their investigation overnight, and on July 29th at 6:44 AM PDT, staff deployed additional optimizations into the environment. On August 1, at 12:27 PM PDT, staff updated the instances on the authentication service and migrated to an enhanced version with more memory. The infrastructure was further scaled up to handle sudden traffic surges and prevent server overload. Monitoring and health checks showed that the Billing Configuration feature was accessible again via Cloud Cost Optimization. However, Available CloudApps were still inaccessible via RightScale Self Service.

Additional SMEs were engaged to assist with the investigation. Staff also worked with the impacted customers directly to mitigate and isolate the problem. Meanwhile, staff continued to research and work on implementing a long-term solution. On August 2, at 1:43 PM PDT, additional resources were deployed to provide further relief to the service. Staff also deployed enhanced alerting and monitoring capabilities in the environment.

After further analysis and research, on August 4th, at 4:43 PM PDT, technical staff identified the root cause to be a missing field in the Self Service configuration that was required to authenticate the credentials for incoming requests. In addition, the error handling capabilities to detect missing fields from the Self Service were discovered to be missing. At 6:13 PM PDT, technical staff deployed the code fixes to re-introduce the missing field to the new cloud service, following which subsequent calls were successfully authenticating with the service. After additional monitoring and health checks, the incident was declared resolved on August 4th, at 7:00 PM PDT.

Primary Root Cause:

Technical staff identified the root cause to be a missing field in the Self Service configuration that was required to authenticate the credentials for incoming requests.

Contributing Causes:

• The error handling capabilities to detect missing fields from Self Service were missing
• Due to multiple failing incoming requests, the service was overwhelmed resulting in a downstream impact on the applications

Corrective Actions:

  • Technical staff deployed code fixes to re-introduce the missing field to the new cloud service, following which subsequent calls have been successful in the environment
  • Enhanced alerting and monitoring capabilities have been implemented in the environment
  • Error handling capabilities to detect missing fields from the Self Service have deployed in production
  • Change and release Processes will be enhanced to uplift coordination and testing among technical staff prior to and during deployments to avoid future outages
Posted Aug 18, 2022 - 21:00 PDT

Resolved
Health checks and monitoring have confirmed that services have been stable since the code fix was implemented last night. This incident has been resolved.
Posted Aug 05, 2022 - 09:58 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 04, 2022 - 21:19 PDT
Update
Technical teams are analyzing all the components to isolate and identify the problem.
Posted Aug 04, 2022 - 06:45 PDT
Update
We are continuing to troubleshoot this issue. We are reviewing the activity logs to isolate and identify the contributing factor(s).
Posted Aug 03, 2022 - 13:34 PDT
Investigating
Incident Description: Some customers may experience errors while accessing Available CloudApps from RightScale. This may prevent customers from deploying, configuring, and managing applications across different cloud service provider platforms.


Restoration activity: Technical teams have been engaged and are investigating
Posted Aug 03, 2022 - 12:43 PDT
This incident affected: Legacy Cloud Management (Self-Service - Shard 3, Self-Service - Shard 4) and Flexera One - Cloud Management - North America (Cloud Cost Optimization - US).