Description: Flexera One - Cloud Cost Optimization (CCO) - NAM - Service degradation
Timeframe: August 7, 2025, 8:00 PM PST to August 8, 2025, 8:38 AM PDT
Incident Summary
On August 7, 2025, at 8:00 PM PST, our teams identified an issue within the Cloud Cost Optimization (CCO) platform that affected a subset of customers in the North America (NAM) region. Although the platform remained accessible, impacted users experienced reduced functionality in areas such as recommendations, bill ingestion, shared costs, cost allocation, and related services.
Additionally, some users received “500 – Internal Server Error” messages when attempting these operations. The issue was traced to a specific API endpoint, which disrupted the execution of most policy templates for the affected customers.
Our technical teams promptly engaged to restore normal service functionality. They discovered that the policy engine was overwhelmed due to the simultaneous execution of multiple large policies, leading to memory exhaustion. To mitigate this, our teams adjusted memory allocations to better handle the load and increased overall memory capacity. As a result, services were restored by August 8, 2025, at 8:38 AM PDT.
Following the restoration, our teams continued to monitor the policy runs, which required additional time due to existing dependencies. After thorough monitoring, the issue was confirmed as fully resolved by August 8, 2025, at 1:20 PM PDT.
Root Cause
The incident was caused by a backend issue where the policy engine was overwhelmed due to multiple large policies being executed simultaneously. As a result, services in the cluster experienced out-of-memory conditions, which degraded performance and caused API failures.
Remediation Actions
· Issue Identification: Detected the failing API endpoint and traced the problem to the backend policy engine.
· Resource Scaling: Increased memory allocation for affected pods to stabilize the services.
· Optimization: Applied memory utilization optimizations to improve efficiency.
· Validation: Confirmed recovery as the UI became responsive and API calls succeeded.
· Customer Guidance: Advised impacted customers to re-run non-meta policies manually.
· Extended Monitoring: Closely monitored meta parent policy runs, which required more time due to dependencies.
Future Preventative Measures
· Memory Management Improvements: Enhance memory monitoring and tuning within the policy engine.
· Policy Execution Optimization: Optimize how the system handles concurrent large policy runs.
· Workload Controls: Implement handling mechanisms for heavy workloads.
· Engineering Enhancements: Deliver improvements under active epics focused on the policies framework.