Cloud Management Platform- NAM- Slowness in provisioning infrastructure

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Cloud Management Platform- NAM - Slowness in provisioning infrastructure

Timeframe: September 3, 2024, 7:38 PM PDT to September 3, 2024, 7:52 PM PDT

‌

Incident Summary

On September 03, 2024, at 7:38 PM PDT, our monitoring systems alerted us to an issue affecting the Cloud Management Platform in the North American (NAM) region. The affected customers may have experienced slowness or intermittent errors when provisioning infrastructure.

Although the system wasn’t completely down, the disruption resulted in a delay in processing requests.

The incident occurred during a scheduled pre-maintenance activity, where our teams were performing routine checks, including network testing. As part of these checks, a network test was conducted to verify connectivity, but it unknowingly interfered with the production network routes. This caused a temporary loss of connectivity, leading to the platform’s degraded performance.

Upon recognizing the issue, immediate action was taken by our team, and the network configuration changes were immediately reverted. Post reverting the changes the service started to improve and was fully recovered by 7:45 PM PDT.

Once the network was stable and fully operational, our teams conducted extensive monitoring of the platform to ensure its ongoing stability and to confirm that there were no lingering issues. At 7:52 PM PDT the incident was declared resolved, with the platform functioning normally.

‌

Root Cause

The incident was triggered during a scheduled pre-maintenance activity, where network connectivity was being tested as part of routine checks. During the test, network configurations inadvertently interfered with existing network routes in the production environment. This caused a brief loss of connectivity, which resulted in slowness and provisioning errors.

‌

Remediation Actions

· Immediate Rollback: Once the issue was identified, the team reverted the changes to restore the previous network configuration.

· Network Recovery: Although changes were reverted quickly, it took some time for full network functionality to return, during which performance was monitored.

· Post-Incident Monitoring: The platform was closely monitored for an extended period to ensure stability and prevent further issues.

‌

Future Preventative Measures

· Enhanced Network Testing Protocols: We will revise the network testing processes to prevent future interference with production environments.

· Additional Pre-Maintenance Checks: We have now put additional checks in our pre-maintenance preparation procedures to ensure that any network changes are validated to avoid impacting live services.

Posted Sep 12, 2024 - 05:01 PDT

Resolved

This incident has been resolved.

Posted Sep 03, 2024 - 22:35 PDT

Monitoring

This incident was caused by a brief connectivity issue that self-restored . Our teams are currently monitoring the services to ensure ongoing stability.

Posted Sep 03, 2024 - 21:33 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 03, 2024 - 21:09 PDT

Investigating

We have been alerted to a problem affecting our Cloud Management Platform in NAM , APAC & Europe regions. The impacted customers may experience slowness/errors in provisioning infrastructure.

Priority: P3

Restoration activity:
Our technical teams are actively involved and are evaluating the situation. Additionally, we are exploring potential solutions to rectify the issue as quickly as possible.

Posted Sep 03, 2024 - 20:31 PDT

This incident affected: Legacy Cloud Management (Cloud Management Dashboard - Shard 3, Cloud Management Dashboard - Shard 4, Self-Service - Shard 3, Self-Service - Shard 4).