Description: Software Vulnerability Research- All regions- Intermittent login issues
Timeframe: June 7, 2025, 07:57PM PST to June 8, 2025, 00:50 AM PST
Incident Summary
On Saturday, June 7, 2025, at 07:57 PM PST, our teams detected an issue impacting the Software Vulnerability Research (SVR) application in all regions. Although the application remained operational, some users intermittently experienced issues while attempting to log in. Initial diagnostics indicated that endpoint health checks were passing and application pages were auto-resolving as expected. However, the application itself was unresponsive in several instances.
Our technical teams initiated an investigation and identified that the unresponsiveness was likely due to memory utilization reaching critical levels for the underlying messaging service leading to system instability.
In collaboration with our cloud service provider, we determined that the messaging service was overloaded due to a buildup of excessive and stale active connections. This overload led to communication delays within the system, which impacted the application's responsiveness and user experience.
To mitigate the issue, our technical teams created and executed a shutdown script to stop active connections and clear stale sessions from the messaging service. After monitoring the application and confirming that stability had been restored, the incident was declared resolved on June 8, 2025, at 12:50 AM PST.
Root Cause
The root cause of the incident was an overloaded messaging service, which had accumulated a large number of stale and active connections. This overload hindered internal message flow and degraded application performance, despite endpoint health checks returning successful results.
Remediation Actions
· Our technical teams developed and executed a shutdown script to:
· Monitored the application post-remediation to ensure stable functionality.
· Declared the incident resolved after sustained stability and no recurrence.
Future Preventative Measures
· New Messaging Service Instance- Provisioned a new messaging service instance with optimized configuration settings.
· Traffic Redirection- Redirected all application traffic to the new instance to ensure consistent performance.
· Old Instance Isolation - Retained the previous messaging service instance in isolation for further investigation.
· Provider Postmortem Engagement -Engaged with the cloud provider for a detailed analysis of the root cause and long-term prevention strategies.