Description: Flexera One - Cloud Cost Optimization - NAM - Delay in Bill processing
Timeframe: August 31st, 2024, 04:10 AMPDT to September 2nd, 2024, 08:41 AMPDT
Incident Summary
On Saturday, August 31st, 2024, at 04:10 AM PDT, we detected an issue affecting the Cloud Cost Optimization platform in the North America (NAM) region where some of our customers experienced delays in bill processing. Although the user interface remained accessible, it displayed outdated data due to a failure in processing imports. The incident was initially attributed to exceeding Cloud service processing limits, temporarily blocking bill processing for several affected organizations.
After diagnosing the issue, our technical team manually ran imports following the reset of the daily Cloud processing limit. Our team also collaborated closely with the service provider to implement a temporary quota increase and other enhancements, resulting in positive outcomes. By September 1st, 2024, at 07:15 AM PDT, bill processing had returned to normal for all customers except one.
Further investigation revealed that the root cause of the issue was a recently deployed code change. This change introduced additional data for all organizations, but the data volume for one of the customers exceeded its capacity, causing processing issues. To restore service, the team reverted the feature deployment for that specific customer, resolving the immediate problem. After extended monitoring, we declared the issue to be resolved on September 2nd, 2024, at 08:41 AM PDT.
Our teams are working on a permanent fix for this issue which is being tracked under a problem record.
Root Cause
The root cause of the incident was traced to a recent code change, which introduced additional data processing across all customer accounts. However, the volume of data exceeded the platform’s processing capabilities, triggering a failure in the import process. Additionally, the platform breached Google Cloud's daily processing limit, leading to a block on further processing and delays in bill processing for affected customers.
Remediation Actions
· Manual Imports: Our technical team manually ran imports for several affected organizations to reduce the backlog and the impact on the customers.
· Auto-Processing Disabled: Auto-processing was temporarily disabled during troubleshooting to ensure the platform could handle the manual imports without interference. It was re-enabled after all manual imports were complete.
· Collaboration with Service Provider: The team worked closely with the service provider to implement a temporary quota increase and additional system enhancements, helping to stabilize the platform.
· Revert Feature Deployment: To resolve the immediate issue, the team reverted the recent feature deployment for the customer generating the largest amount of traffic.
Future Preventative Measures
· Scalable Data Processing Solution: A long-term solution is being planned to enhance the platform’s ability to handle larger data volumes, ensuring that such incidents do not occur in the future.
· Improved Quota Monitoring and Management: Improved monitoring tools will be implemented to track quota usage and prevent limits from being exceeded.
· Pre-Deployment Testing for Large Data Sets: Our teams are working to review and implement pre-deployment tests that simulate data volumes from the largest traffic generators.