Description: Flexera One UI - NA – Platform Experienced Performance Delays
Timeframe: April 18th, 2024, 1:11 PM to April 18th, 2024, 3:36 PM PDT
Incident Summary
On April 18th, at 2:29 PM PDT, the major incident management team received an alert concerning an issue affecting the Flexera One Platform in the NAM region, resulting in a noticeable performance degradation in the UI and slower response times for users.
We swiftly initiated an investigation and took proactive measures to mitigate the issue. By 3:01 PM PDT, performance was restored to normal levels after scaling up resources. However, the underlying cause of the increased resource demand remained unidentified.
At 3:32 PM PDT, upon analyzing the logs, we identified a significant surge in requests to one of the critical services, resulting in a notable impact on system resources. Further analysis revealed that the surge in requests was primarily attributed to backend tasks associated with child policies generated from meta parent policies. This influx of requests further led to the degradation of the critical service, with downstream impacts on the resources associated with it.
At 3:36 PM PDT, our technical teams confirmed that the affected service had returned to its normal state, prompting us to adjust the resources back to their usual levels. Subsequently, another round of health checks was conducted to validate sustained stability, marking the resolution of the incident.
Root Cause
A surge in backend tasks associated with child policies generated from meta parent policies led to a significant increase in requests to one of the critical services, subsequently impacting system resources and resulting in the degradation of the critical service.
Remediation Actions
Future Preventative Measures