Description: Flexera One- IT Asset Management- EU- Web UI slowness
Timeframe: February 3, 2025, at 12:04 AM PST to February 3, 2025, at 10:10 AM PST
Incident Summary
On February 3rd at 12:04 am PST, we received multiple reports from customers in the EU region where they were experiencing slowness and errors on the Flexera One IT Asset Management (ITAM) Web UI due to a high number of open or blocked sessions. Our technical teams were promptly engaged for investigation and remediation. During the investigation, they discovered an issue with a database server that was failing to retrieve stored procedures efficiently, resulting in cross-database locking and degraded UI performance. To mitigate the issue, a failover was performed, significantly improving performance by 4:42 am PST, with the UI loading properly despite session counts remaining relatively elevated. Extended monitoring confirmed that cross-database blocking had ceased post-failover, and the issue was declared resolved at 10:10 am PST. The elevated session counts were attributed to backlogged resolver processing, and no underlying product issue was identified.
Root Cause
The root cause of the incident was identified as a database performance bottleneck triggered by:
· Stored Procedure Retrieval Failure: The database server experienced failures in retrieving stored procedures, leading to frequent requests.
· System Procedure Blocking: Frequent requests caused a system procedure to become blocked.
· Cross-Database Locking: The blocked system procedure resulted in cross-database locking, which significantly impacted the ITAM Web UI's performance, leading to slowness and errors.
Contributing Factors:
Remediation Actions
· Initial Investigation and Session Termination: The team began investigating the high session count and attempted to terminate major blockage-causing sessions. However, new blocking sessions continued to emerge.
· Failover: As a decisive remediation step, a failover was performed. This action significantly improved the system's responsiveness and reduced the impact on customers.
· Post-Failover Monitoring: The team closely monitored the system after the failover. While the session count remained elevated, it stabilized, and no further cross-database blocking was observed.
· Extended Monitoring and Restoration Declaration: After extended monitoring confirmed the system's stability and the resolution of the performance issues, the incident was declared restored at 10:10 AM PST
Future Preventative Measures
· Enhanced Monitoring: Implement more granular monitoring to proactively detect similar issues.
· Capacity Planning and Scalability: Review and update capacity planning for the database cluster to ensure it can handle unexpected spikes in activity.
· Server Performance Optimization: Conduct a thorough review of Server performance metrics and identify potential bottlenecks.
· Resolver Processing Review: Investigate the backlogged resolver processing that caused elevated session counts after the failover and implement improvements to avoid future backlogs.