Flexera One - IT Visibility - NA - Service Disruption

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - IT Visibility - NA - Service Disruption

Timeframe: September 18th, 11:43 AM to September 18th, 2:04 PM PDT

Incident Summary

On Monday, September 18th, 2023, at 11:43 AM PDT, we experienced an issue with our IT Visibility Platform, potentially impacting customers in North America. As a result, customers may have encountered issues when accessing Flexera One UI Data Dashboards, Data Exports, API access, and ServiceNow integration.

Our preliminary investigation showed that a widespread failure of pods was a contributing factor to the problem. Our technical team also conducted health checks to ensure the accessibility of other regions, confirming that the problem was isolated to North America. All attempts to restart our internal services resulted in failure.

At 12:11 PM, our further investigation indicated a potential network outage affecting our operations, leading our teams to suspect a problem with our service provider. By 12:18 PM PDT, we confirmed that the issue indeed originated from a network problem at the service provider's end.

Upon establishing contact with the service provider, we discovered that they were experiencing an issue affecting the availability of multiple Zones within the US Region, where network mappings were not properly propagated to the underlying hardware.

We closely monitored the service provider's troubleshooting efforts, and by 2:04 PM PDT, they had successfully restored their services tied to our operations. Our team confirmed that all our internal services, including IT Visibility Dashboards and data, were fully operational and accessible to customers without any additional problems.

The decision was made to keep monitoring the environment for a few more hours while the service provider worked towards complete recovery, to prevent any potential recurrence. After this extended monitoring period, our team verified that our services remained operational without any further issues. As a result, we considered the incident resolved.

Root Cause

The root cause of the issue was attributed to a network issue with our service provider.

Remediation Actions

  1. Preliminary Investigation: Our preliminary investigation showed that a widespread failure of pods was a contributing factor to the problem.
  2. Further Investigation: At 12:11 PM, our further investigation indicated a potential network outage affecting our operations.
  3. Contact with Service Provider: Upon establishing contact with the service provider, we discovered the issue on their end affecting the availability of multiple Zones within the US Region.
  4. Service Provider Resolution: The service provider successfully restored their services tied to our operations by 2:04 PM PDT.
  5. Continued Monitoring: The decision was made to keep monitoring the environment for a few more hours to prevent potential recurrence.

Future Preventative Measure
Obtain Post-Mortem from the Service Provider: We will obtain a post-mortem report from the service provider to understand the incident's cause and the steps they've taken to prevent future occurrences. If necessary, we will engage in discussions to enhance preventive measures.

Posted Oct 03, 2023 - 18:58 PDT

Resolved

This incident has been resolved.
Posted Sep 18, 2023 - 17:35 PDT

Monitoring

All of our services are presently online and operating stably. We are closely monitoring the situation and awaiting final confirmation from our vendor before officially closing out our incident.
Posted Sep 18, 2023 - 14:08 PDT

Update

Based on the most recent update from our vendor, they are actively resolving network latency and error issues in the US Region Availability Zones. These problems are specifically impacting certain instances in these zones, where network mappings are not being propagated to the underlying hardware.
Posted Sep 18, 2023 - 13:27 PDT

Identified

We have identified this issue to be on the vendor side. We are continuing to experience intermittent service failures, and our technical team is actively monitoring the situation. We apologize for any inconvenience this may have caused.
Posted Sep 18, 2023 - 12:25 PDT

Update

Our technical team suspects that this may be connected to a network disruption impacting one of our vendors. We are conducting a more detailed investigation.
Posted Sep 18, 2023 - 12:21 PDT

Investigating

Incident Description: We are currently experiencing an issue with our IT Visibility Platform, potentially impacting customers in North America. As a result, customers may experience issues when accessing Flexera One UI Data Dashboards, Data Exports, API access, and ServiceNow integration.

Priority: P1

Restoration Activity: Our technical teams are actively addressing the issue. We've observed some disruptions in our systems, and we're currently examining the situation.
Posted Sep 18, 2023 - 12:03 PDT
This incident affected: Flexera One - IT Visibility - North America (IT Visibility US).