Flexera One - Technology Intelligence Platform/IT Visibility - EU Region - Dashboards, Reports, and APIs Unavailable

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - Technology Intelligence Platform/IT Visibility - EU Region – Service Disruption

Timeframe: February 6, 2025, 4:15 AM PST – February 6, 2025, 6:15 AM PST

Incident Summary

On February 6, 2025, at 4:15 AM PST, we experienced a service disruption affecting Flexera One IT Visibility Dashboards, Reports, and Export APIs (including Technology Intelligence) in the EU region. While the platform remained accessible, customers may have encountered difficulties accessing dashboards, generating reports, using API services, and Power BI out-of-the-box (OOTB) reports.

The issue was traced to a failure in connecting to an external service provider’s infrastructure, which led to initialization failures for critical services. As new nodes were deployed, they were unable to establish a connection to the provider’s services due to an issue on their end. This disruption prevented services from coming online, leading to intermittent unavailability of several key services.

Upon detecting the issue at 4:32 AM PST, our technical team swiftly responded by rotating the affected nodes and manually intervening to restore the impacted services. This action intermittently restored service at 4:56 AM PST. At 5:23 AM PST, the service provider confirmed they had identified the issue on their end, related to difficulties in managing their infrastructure.

By 6:00 AM PST, our technical team took direct action to alter the service configuration for IT Visibility, scaling out a number of services to prevent further requests to the external service handling proxy and optimize resource usage. At 6:15 AM PST, core services, including critical components, were manually scaled in to restore full functionality, effectively ending the customer-impacting element of the incident.

Shortly after, the service provider confirmed full resolution, ensuring error rates returned to normal after deploying a configuration change to address the connectivity issue.

Root Cause

Primary Root Cause:

External Service Provider Outage: The issue arose due to a problem at the service provider's end, which prevented successful connections to their infrastructure. This disruption impacted the initialization of critical services required for normal platform operation.

Contributing Factors:

Pod Rotation Issues: As new nodes were deployed, these services were unable to connect to the external service provider’s infrastructure, leading to service initialization failures.
Increased Request Load: Automated retries and health check failures caused a significant volume of requests to the service provider, further delaying resolution.

Remediation Actions

Service Restoration: The team rotated the affected nodes, restored connectivity to the service provider, and manually adjusted configurations to bring services back online.
Service Configuration and Scaling Adjustments: At 6:00 AM PST, our technical team took action to scale services and reduce unnecessary load on the external provider’s infrastructure, ensuring improved performance and stability. By 6:15 AM PST, core services were manually scaled in to restore full functionality, effectively stabilizing the environment and ending the customer-impacting element of the incident.
Service Provider Confirmation: Shortly after, the service provider confirmed that their issue had been fully resolved, and error rates returned to normal after they deployed a configuration change to address the connectivity issue.

Future Preventative Measures

Local Configuration Mirroring: To prevent similar issues, we plan to mirror critical configurations locally, reducing dependency on external providers and mitigating service disruption risks.
Infrastructure Enhancements: Critical components will be deployed across multiple nodes to maintain availability, ensuring that services remain accessible even if one node fails.
Runbook Development: We are refining runbooks based on this incident to improve our incident response processes.
Post-Incident Review: A post-mortem analysis will be conducted to assess the steps taken, identify areas for improvement, and refine our operational procedures moving forward.

Posted Feb 18, 2025 - 15:03 PST

Resolved

Our technical team has taken action to bring services back online, and our third-party service provider has addressed the underlying issue. Services have returned to a stable state.

We will continue to monitor to ensure ongoing stability. This incident is now resolved.

Posted Feb 06, 2025 - 06:50 PST

Investigating

Incident Description: We are currently experiencing an issue affecting Flexera One IT Visibility Dashboards, Reports, and Export APIs (including Technology Intelligence) in the EU region. While the platform remains accessible, customers may encounter difficulties accessing dashboards, generating reports, and using API services. Power BI out-of-the-box (OOTB) reports may also be affected.

Priority: P2

Restoration Activity: The issue has been identified as originating from a third-party service provider, which is currently experiencing an outage. The service provider is actively investigating and working to resolve the issue. We are closely monitoring their progress and will provide updates as more information becomes available.

Posted Feb 06, 2025 - 05:46 PST

This incident affected: Flexera One - IT Visibility - Europe (IT Visibility EU).