Software Vulnerability Research – Service Disruption

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Software Vulnerability Research – Service Disruption

Timeframe:  March 28, 2025, 1:26 AM PST to March 28, 2025, 3:26 AM PST

Incident Summary

On March 28th, 2025, at 1:26 AM PST, we detected an issue affecting automated access to vulnerability data within the Software Vulnerability Research (SVR) platform. A subset of customers using integrations or scheduled processes to retrieve SVR data may have experienced interruptions. The SVR web application remained fully accessible initially, and customers manually accessing data through the interface were not impacted.

As the incident progressed, the SVR web application experienced degraded performance, followed by a brief period of inaccessibility. Technical teams identified instability in the SVR API infrastructure, with underlying services restarting at regular intervals. The impacted API services were isolated, and gateway behavior was monitored closely while the investigation was underway.

Further analysis identified a newly provisioned authentication server as the failing component. This server was experiencing degraded performance, which led to intermittent authentication failures and caused cascading crashes across dependent services. To address the issue, the failing authentication server instances were fully replaced, which led to partial recovery. However, stale cached connections continued to cause instability. The SVR application was then redeployed in full to clear these connections and stabilize the environment.

By 2:45 AM PST, the issue was resolved, and normal functionality was restored. After a period of stability monitoring, the incident was officially closed at 3:26 AM PST.

Root Cause

 

Upon investigation, the root cause was traced to a newly created authentication server that exhibited degraded performance, leading to API authentication failures and eventual server crashes. These failures triggered cascading disruptions of dependent services within the SVR API infrastructure.

Remediation Actions

 

Initial Response:

·        Recreated API server instances to attempt resolution.

·        Isolated affected services and closely monitored gateway behavior.

Root Cause Identification:

·        Investigated system logs, API behavior, and network connectivity.

·        Identified the newly provisioned authentication server as the critical failing dependency.

Infrastructure Recovery:

·        Fully replaced authentication server instances.

·        Observed partial recovery post-replacement.

Full Application Redeployment:

·        Redeployed the SVR application to clear cached connections and stabilize the platform.

Incident Closure:

Monitored the environment for stability and declared the issue resolved at 3:26 AM PST.

Future Preventative Measures

 

·        Enhanced Monitoring: Implement direct monitoring and alerting for server health at the application level.

·        Dependency Health Checks: Add backend service health checks to detect and isolate unstable dependencies earlier.

·        Proactive Automation: Improve automation to trigger alerts and recovery workflows in the event of service degradation.

·        Monitoring Coverage Review: Review and expand monitoring across all critical components to ensure visibility into service-level dependencies.

Posted Apr 09, 2025 - 01:36 PDT

Resolved

The SVR platform has remained stable across both the web application and automated access during the extended monitoring period. No further issues have been observed, and technical teams have confirmed that all services are functioning as expected. This incident has been resolved.
Posted Mar 28, 2025 - 03:29 PDT

Monitoring

The redeployment of the application has been completed. Initial sanity checks have passed, and the platform is currently functioning as expected. The incident has now moved into the monitoring phase to ensure continued stability across both the web application and automated access.
Posted Mar 28, 2025 - 02:59 PDT

Update

We are currently redeploying the Software Vulnerability Research (SVR) application to address ongoing service instability. As part of this process, a brief interruption is expected across the platform, affecting both the user interface and automated access to vulnerability data.

Our teams are actively working to restore full functionality and are closely monitoring system behavior during this transition. We will provide further updates as progress continues.
Posted Mar 28, 2025 - 02:46 PDT

Investigating

The API servers became unresponsive again after being reintroduced, and the issue has resurfaced. In addition, the SVR web application is now experiencing degraded performance.

We are actively investigating both the recurrence of the automated access issue and the new impact to the application. Monitoring and diagnostics are ongoing.
Posted Mar 28, 2025 - 02:27 PDT

Monitoring

The services supporting automated access to vulnerability data have been brought back online and are now responding as expected. Users are able to authenticate successfully, and we are closely monitoring performance and stability at this time. Further updates will be provided as needed.
Posted Mar 28, 2025 - 02:15 PDT

Update

Our teams have removed a backend component contributing to the issue and have temporarily isolated the services supporting automated access to vulnerability data. We are actively assessing the situation and working toward recovery. Further updates will be shared as progress continues.
Posted Mar 28, 2025 - 02:06 PDT

Investigating

Incident Description: We are currently investigating an issue affecting automated access to vulnerability data within the Software Vulnerability Research (SVR) platform. A subset of customers using integrations or scheduled processes to retrieve SVR data may be experiencing interruptions.

Note: This incident was initially classified as P2 during the early stages of investigation and troubleshooting, when the impact appeared to be limited to automated access only. As the situation progressed and broader platform instability was observed, including issues affecting the SVR web application - the priority was elevated to P1 to reflect the full scope of impact.

Priority: P1

Restoration Activity: Our technical teams have identified instability in the underlying API services that support automated access. Restoration efforts are underway, including isolating affected components and monitoring service behavior. Further updates will be provided as progress continues.
Posted Mar 28, 2025 - 01:25 PDT
This incident affected: Software Vulnerability Research.