Description: Software Vulnerability Research – Service Disruption
Timeframe: March 28, 2025, 1:26 AM PST to March 28, 2025, 3:26 AM PST
Incident Summary
On March 28th, 2025, at 1:26 AM PST, we detected an issue affecting automated access to vulnerability data within the Software Vulnerability Research (SVR) platform. A subset of customers using integrations or scheduled processes to retrieve SVR data may have experienced interruptions. The SVR web application remained fully accessible initially, and customers manually accessing data through the interface were not impacted.
As the incident progressed, the SVR web application experienced degraded performance, followed by a brief period of inaccessibility. Technical teams identified instability in the SVR API infrastructure, with underlying services restarting at regular intervals. The impacted API services were isolated, and gateway behavior was monitored closely while the investigation was underway.
Further analysis identified a newly provisioned authentication server as the failing component. This server was experiencing degraded performance, which led to intermittent authentication failures and caused cascading crashes across dependent services. To address the issue, the failing authentication server instances were fully replaced, which led to partial recovery. However, stale cached connections continued to cause instability. The SVR application was then redeployed in full to clear these connections and stabilize the environment.
By 2:45 AM PST, the issue was resolved, and normal functionality was restored. After a period of stability monitoring, the incident was officially closed at 3:26 AM PST.
Root Cause
Upon investigation, the root cause was traced to a newly created authentication server that exhibited degraded performance, leading to API authentication failures and eventual server crashes. These failures triggered cascading disruptions of dependent services within the SVR API infrastructure.
Remediation Actions
Initial Response:
· Recreated API server instances to attempt resolution.
· Isolated affected services and closely monitored gateway behavior.
Root Cause Identification:
· Investigated system logs, API behavior, and network connectivity.
· Identified the newly provisioned authentication server as the critical failing dependency.
Infrastructure Recovery:
· Fully replaced authentication server instances.
· Observed partial recovery post-replacement.
Full Application Redeployment:
· Redeployed the SVR application to clear cached connections and stabilize the platform.
Incident Closure:
Monitored the environment for stability and declared the issue resolved at 3:26 AM PST.
Future Preventative Measures
· Enhanced Monitoring: Implement direct monitoring and alerting for server health at the application level.
· Dependency Health Checks: Add backend service health checks to detect and isolate unstable dependencies earlier.
· Proactive Automation: Improve automation to trigger alerts and recovery workflows in the event of service degradation.
· Monitoring Coverage Review: Review and expand monitoring across all critical components to ensure visibility into service-level dependencies.