Flexera One - Cloud Cost Optimization (CCO) - NAM - Service degradation

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - Cloud Cost Optimization (CCO) - NAM - Service degradation

Timeframe: August 7, 2025, 8:00 PM PST to August 8, 2025, 8:38 AM PDT

Incident Summary

On August 7, 2025, at 8:00 PM PST, our teams identified an issue within the Cloud Cost Optimization (CCO) platform that affected a subset of customers in the North America (NAM) region. Although the platform remained accessible, impacted users experienced reduced functionality in areas such as recommendations, bill ingestion, shared costs, cost allocation, and related services.

Additionally, some users received “500 – Internal Server Error” messages when attempting these operations. The issue was traced to a specific API endpoint, which disrupted the execution of most policy templates for the affected customers.

Our technical teams promptly engaged to restore normal service functionality. They discovered that the policy engine was overwhelmed due to the simultaneous execution of multiple large policies, leading to memory exhaustion. To mitigate this, our teams adjusted memory allocations to better handle the load and increased overall memory capacity. As a result, services were restored by August 8, 2025, at 8:38 AM PDT.

Following the restoration, our teams continued to monitor the policy runs, which required additional time due to existing dependencies. After thorough monitoring, the issue was confirmed as fully resolved by August 8, 2025, at 1:20 PM PDT.

Root Cause

The incident was caused by a backend issue where the policy engine was overwhelmed due to multiple large policies being executed simultaneously. As a result, services in the cluster experienced out-of-memory conditions, which degraded performance and caused API failures.

Remediation Actions

·        Issue Identification: Detected the failing API endpoint and traced the problem to the backend policy engine.

·        Resource Scaling: Increased memory allocation for affected pods to stabilize the services.

·        Optimization: Applied memory utilization optimizations to improve efficiency.

·        Validation: Confirmed recovery as the UI became responsive and API calls succeeded.

·        Customer Guidance: Advised impacted customers to re-run non-meta policies manually.

·        Extended Monitoring: Closely monitored meta parent policy runs, which required more time due to dependencies.

Future Preventative Measures

·        Memory Management Improvements: Enhance memory monitoring and tuning within the policy engine.

·        Policy Execution Optimization: Optimize how the system handles concurrent large policy runs.

·        Workload Controls: Implement handling mechanisms for heavy workloads.

·        Engineering Enhancements: Deliver improvements under active epics focused on the policies framework.

Posted Aug 22, 2025 - 03:59 PDT

Resolved

Backend services have stabilized, and the platform is operating as expected with no further outstanding concerns. This incident is now resolved.
Posted Aug 07, 2025 - 13:27 PDT

Monitoring

The disruption was traced to a backend service issue, and corrective actions have been taken to restore functionality. The platform is now operating as expected.

Customer Action: If any non-meta policies remain in a pending state, we recommend manually triggering them using the “Run now” option. Meta parent policies may take additional time to complete automatically.

We will continue to monitor the platform to ensure full recovery.
Posted Aug 07, 2025 - 08:56 PDT

Investigating

Incident Description: We are currently experiencing an issue within the Cloud Cost Optimization (CCO) platform affecting a subset of our customers in the NAM region. While the platform remains accessible, affected customers may experience degraded functionality related to recommendations, bill ingestion, shared costs, cost allocation, and other related services. Users may also encounter “500 – Internal Server Error” messages during these operations.

Priority: P2

Restoration Activity: Our technical team is actively investigating the root cause and working to restore services as quickly as possible. We will continue to share updates as we make progress toward a resolution.
Posted Aug 07, 2025 - 05:33 PDT
This incident affected: Flexera One - Cloud Management - North America (Cloud Cost Optimization - US).