Description: Cloud Management Platform- NAM - Slowness in provisioning infrastructure
Timeframe: September 3, 2024, 7:38 PM PDT to September 3, 2024, 7:52 PM PDT
On September 03, 2024, at 7:38 PM PDT, our monitoring systems alerted us to an issue affecting the Cloud Management Platform in the North American (NAM) region. The affected customers may have experienced slowness or intermittent errors when provisioning infrastructure.
Although the system wasn’t completely down, the disruption resulted in a delay in processing requests.
The incident occurred during a scheduled pre-maintenance activity, where our teams were performing routine checks, including network testing. As part of these checks, a network test was conducted to verify connectivity, but it unknowingly interfered with the production network routes. This caused a temporary loss of connectivity, leading to the platform’s degraded performance.
Upon recognizing the issue, immediate action was taken by our team, and the network configuration changes were immediately reverted. Post reverting the changes the service started to improve and was fully recovered by 7:45 PM PDT.
Once the network was stable and fully operational, our teams conducted extensive monitoring of the platform to ensure its ongoing stability and to confirm that there were no lingering issues. At 7:52 PM PDT the incident was declared resolved, with the platform functioning normally.
The incident was triggered during a scheduled pre-maintenance activity, where network connectivity was being tested as part of routine checks. During the test, network configurations inadvertently interfered with existing network routes in the production environment. This caused a brief loss of connectivity, which resulted in slowness and provisioning errors.
· Immediate Rollback: Once the issue was identified, the team reverted the changes to restore the previous network configuration.
· Network Recovery: Although changes were reverted quickly, it took some time for full network functionality to return, during which performance was monitored.
· Post-Incident Monitoring: The platform was closely monitored for an extended period to ensure stability and prevent further issues.
· Enhanced Network Testing Protocols: We will revise the network testing processes to prevent future interference with production environments.
· Additional Pre-Maintenance Checks: We have now put additional checks in our pre-maintenance preparation procedures to ensure that any network changes are validated to avoid impacting live services.