Flexera One - Cloud Cost Optimization - NAM - Delay in Bill processing
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Flexera One - Cloud Cost Optimization - NAM - Delay in Bill processing

Timeframe:  August 31st, 2024, 04:10 AMPDT to September 2nd, 2024, 08:41 AMPDT

Incident Summary

On Saturday, August 31st, 2024, at 04:10 AM PDT, we detected an issue affecting the Cloud Cost Optimization platform in the North America (NAM) region where some of our customers experienced delays in bill processing. Although the user interface remained accessible, it displayed outdated data due to a failure in processing imports. The incident was initially attributed to exceeding Cloud service processing limits, temporarily blocking bill processing for several affected organizations.

After diagnosing the issue, our technical team manually ran imports following the reset of the daily Cloud processing limit. Our team also collaborated closely with the service provider to implement a temporary quota increase and other enhancements, resulting in positive outcomes. By September 1st, 2024, at 07:15 AM PDT, bill processing had returned to normal for all customers except one.

Further investigation revealed that the root cause of the issue was a recently deployed code change. This change introduced additional data for all organizations, but the data volume for one of the customers exceeded its capacity, causing processing issues. To restore service, the team reverted the feature deployment for that specific customer, resolving the immediate problem. After extended monitoring, we declared the issue to be resolved on September 2nd, 2024, at 08:41 AM PDT.

Our teams are working on a permanent fix for this issue which is being tracked under a problem record.

Root Cause

The root cause of the incident was traced to a recent code change, which introduced additional data processing across all customer accounts. However, the volume of data exceeded the platform’s processing capabilities, triggering a failure in the import process. Additionally, the platform breached Google Cloud's daily processing limit, leading to a block on further processing and delays in bill processing for affected customers.

Remediation Actions

 

·        Manual Imports: Our technical team manually ran imports for several affected organizations to reduce the backlog and the impact on the customers.

·        Auto-Processing Disabled: Auto-processing was temporarily disabled during troubleshooting to ensure the platform could handle the manual imports without interference. It was re-enabled after all manual imports were complete.

·        Collaboration with Service Provider: The team worked closely with the service provider to implement a temporary quota increase and additional system enhancements, helping to stabilize the platform.

·        Revert Feature Deployment: To resolve the immediate issue, the team reverted the recent feature deployment for the customer generating the largest amount of traffic.

Future Preventative Measures

 

·        Scalable Data Processing Solution: A long-term solution is being planned to enhance the platform’s ability to handle larger data volumes, ensuring that such incidents do not occur in the future.

·        Improved Quota Monitoring and Management: Improved monitoring tools will be implemented to track quota usage and prevent limits from being exceeded.

·        Pre-Deployment Testing for Large Data Sets: Our teams are working to review and implement pre-deployment tests that simulate data volumes from the largest traffic generators.

Posted Sep 05, 2024 - 23:53 PDT

Resolved
We have identified that the issue was potentially caused by a code change deployed last week, which introduced an increased data volume across the system than expected. To address this, we have made adjustments to the deployment, which has resolved the immediate problem. We are also implementing further changes to optimize performance, with an update scheduled for tomorrow.

Currently, the import backlog is under control, and system usage is stable. Given these improvements, we are closing this incident. We will continue to work on long-term solutions to manage data volumes effectively and prevent similar issues in the future.

We will keep monitoring the system closely to ensure continued stability and performance.
Posted Sep 02, 2024 - 09:40 PDT
Update
We have observed significant improvements following the remedial actions taken by our teams. Our extended monitoring indicates that we are no longer hitting the limits after the fix was implemented. We will continue to investigate the root cause of the initial issue to prevent future occurrences and will provide further updates as we work towards a full resolution.
Posted Sep 02, 2024 - 05:34 PDT
Update
Our team has made significant progress in reducing the import backlog, which has now largely returned to normal levels. We are carefully managing system resources to ensure effective processing and are closely monitoring the situation to maintain stability.

Additionally, we are in discussions with our service provider to implement enhancements and upgrades to system capacity and are working to expedite this process. We continue to investigate the root cause of the initial issue to prevent any future occurrences and will provide further updates as we work toward a full resolution.
Posted Sep 01, 2024 - 08:25 PDT
Update
Our team has been collaborating closely with the service provider and has implemented enhancements that have resulted in positive outcomes. Additionally, we are working to implement a temporary increase in system capacity for the short term.
Posted Aug 31, 2024 - 17:22 PDT
Identified
The technical team has identified that the delays were caused by exceeding certain system limits, which affected processing. To mitigate the impact, we have initiated manual processes to address the backlog, and most tasks are progressing well, although some have needed additional attention.

Auto-processing was temporarily paused during troubleshooting and will be re-enabled once the backlog is back to normal levels. We are closely monitoring the system to ensure everything runs smoothly when auto-processing resumes.

While there are signs of improvement and a steady reduction in the backlog, we are also monitoring the system limits carefully to avoid further delays. We have engaged our support provider to assist in resolving the issue and are exploring options to expedite the resolution.

We will provide further updates as we continue to work toward a resolution.
Posted Aug 31, 2024 - 10:20 PDT
Investigating
Incident Description:
We are currently investigating an issue affecting our Cloud Cost Optimization platform in the North America (NAM) region. As a result, some customers may experience delays in bill processing. While the user interface remains operational, it may display outdated data.

Priority: P2

Restoration activity:
Our technical teams are actively involved and are evaluating the situation. Additionally, we are exploring potential solutions to rectify the issue as quickly as possible.
Posted Aug 31, 2024 - 04:35 PDT
This incident affected: Flexera One - Cloud Management - North America (Cloud Cost Optimization - US).