Software Vulnerability Research - Service Disruption
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Software Vulnerability Research - Service Disruption

Timeframe:  March 21, 2024, 1:18 PM PDT to March 21, 2024, 2:53 PM PDT

Incident Summary

On Thursday, March 21, 2024, at 1:18 PM PDT, we identified an issue impacting our Software Vulnerability Research platform, which resulted in disruptions for some customers in the EU-West region attempting to access the platform during this time.

Upon detection of the issue, our technical teams immediately initiated an investigation to identify and resolve the root cause.

Further investigation revealed that the issue stemmed from a recent network change.

 At 2:52 PM PDT, our technical teams promptly took action to revert this change, successfully restoring normal service operations for affected customers in the EU-West region. Subsequently, our technical teams conducted health checks and confirmed that the service had resumed its normal operations, following which the incident was considered resolved at 2:53 PM PDT.

Root Cause

The disruption experienced in the Software Vulnerability Research services was attributed to a recent network change executed within the staging environment. This particular change unexpectedly affected certain production components and disrupted the flow of traffic between different components, leading to intermittent outages and eventual downtime of the service. Specifically, the dual-path connectivity for server zones introduced conflicts and congestion within the network, hindering operational efficiency.

Remediation Actions

 

·       Immediate Investigation: Technical teams initiated an immediate investigation upon detecting the issue to identify the root cause.

·       Identified Root Cause: Determined that the issue was caused by a recent network change implemented in the platform.

·       Change Reversion: Promptly reverted the problematic network change at 2:52 PM PDT to restore normal service operations.

·       Health Checks and Verification: Conducted thorough health checks post-reversion to ensure service had resumed normal operations.

Future Preventative Measures

  • Perform Health Checks Across Other Regions: Engaged network and system teams to conduct comprehensive health checks across all regions and environments.
  • Review Network Design: Initiated a comprehensive review of the network design, particularly focusing on areas where similar flaws could exist. Our technical teams confirmed that this design was limited to the impacted region only and other regions do not share the same network design.
  • Review and Strengthen Change Management Practices: We are working to evaluate the existing change management practices and policies and identify any gaps. This includes implementing stricter controls and validation checks to ensure that all changes undergo thorough scrutiny before implementation.
Posted Apr 29, 2024 - 00:21 PDT

Resolved
Incident Description: We encountered an outage affecting our Software Vulnerability Research platform, potentially causing disruptions for some customers attempting to access it during this period

Priority: P1

Impact Start: March 21, 2024, 1:20 PM PST
Impact End: March 21, 2024, 2:55 PM PST

Restoration Activity: Our technical team promptly identified and resolved the issue, successfully restoring the service to its normal state. Additionally, comprehensive health checks have been conducted and have verified the system's stability.
Posted Mar 21, 2024 - 16:21 PDT
This incident affected: Software Vulnerability Research.