Postmortem -
Read details
Sep 5, 23:53 PDT
Resolved -
We have identified that the issue was potentially caused by a code change deployed last week, which introduced an increased data volume across the system than expected. To address this, we have made adjustments to the deployment, which has resolved the immediate problem. We are also implementing further changes to optimize performance, with an update scheduled for tomorrow.
Currently, the import backlog is under control, and system usage is stable. Given these improvements, we are closing this incident. We will continue to work on long-term solutions to manage data volumes effectively and prevent similar issues in the future.
We will keep monitoring the system closely to ensure continued stability and performance.
Sep 2, 09:40 PDT
Update -
We have observed significant improvements following the remedial actions taken by our teams. Our extended monitoring indicates that we are no longer hitting the limits after the fix was implemented. We will continue to investigate the root cause of the initial issue to prevent future occurrences and will provide further updates as we work towards a full resolution.
Sep 2, 05:34 PDT
Update -
Our team has made significant progress in reducing the import backlog, which has now largely returned to normal levels. We are carefully managing system resources to ensure effective processing and are closely monitoring the situation to maintain stability.
Additionally, we are in discussions with our service provider to implement enhancements and upgrades to system capacity and are working to expedite this process. We continue to investigate the root cause of the initial issue to prevent any future occurrences and will provide further updates as we work toward a full resolution.
Sep 1, 08:25 PDT
Update -
Our team has been collaborating closely with the service provider and has implemented enhancements that have resulted in positive outcomes. Additionally, we are working to implement a temporary increase in system capacity for the short term.
Aug 31, 17:22 PDT
Identified -
The technical team has identified that the delays were caused by exceeding certain system limits, which affected processing. To mitigate the impact, we have initiated manual processes to address the backlog, and most tasks are progressing well, although some have needed additional attention.
Auto-processing was temporarily paused during troubleshooting and will be re-enabled once the backlog is back to normal levels. We are closely monitoring the system to ensure everything runs smoothly when auto-processing resumes.
While there are signs of improvement and a steady reduction in the backlog, we are also monitoring the system limits carefully to avoid further delays. We have engaged our support provider to assist in resolving the issue and are exploring options to expedite the resolution.
We will provide further updates as we continue to work toward a resolution.
Aug 31, 10:20 PDT
Investigating -
Incident Description:
We are currently investigating an issue affecting our Cloud Cost Optimization platform in the North America (NAM) region. As a result, some customers may experience delays in bill processing. While the user interface remains operational, it may display outdated data.
Priority: P2
Restoration activity:
Our technical teams are actively involved and are evaluating the situation. Additionally, we are exploring potential solutions to rectify the issue as quickly as possible.
Aug 31, 04:35 PDT