Flexera One - IT Asset Management - NA - Slow Load Times and Errors

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One – IT Asset Management (ITAM) – NA – Slow Load Times and Errors

Timeframe: September 4th, 2024, 11:00 AM to September 4th, 2024, 12:30 PM PDT

Incident Summary

On Wednesday, September 4th, 2024, at 11:00 AM PDT, we experienced an issue affecting the IT Asset Management (ITAM) platform in the NA region. Customers may have encountered slow load times and errors when trying to access or perform any functions within ITAM. These issues were primarily reported by users in the NA region, with no impact observed in the EU or APAC regions.

As part of an automated process designed to optimize system performance during peak load times, a new instance was deployed in the NA region. However, this instance inadvertently contributed to traffic misrouting within the environment, leading to degraded performance and errors. Our technical team immediately took prompt action and removed the problematic instance, but this did not resolve the issue.

Further investigation revealed that a scheduled maintenance activity in another region had unintentionally affected the traffic configuration for the NA environment, which caused the continued slow load times and errors.

By 12:30 PM PDT, the traffic routing was corrected, restoring normal performance. Following additional validations and internal health checks, confirmation was received from multiple customers that the ITAM platform was functioning as expected. The incident was officially declared resolved by 12:46 PM PDT.

‌

Root Cause

Primary Root Cause

The disruption was caused by an internal process error during scheduled maintenance, which inadvertently affected traffic routing in the NA region and resulted in slow load times and errors.

Contributing Factors

• Cross-Region Impact: The process targeting another environment unintentionally impacted the NA region.
• Limited Region-Specific Testing: Testing was conducted in a single region, which did not catch the issue before it impacted the NA environment.

‌

Remediation Actions

Traffic Configuration Correction: The misconfiguration in the traffic routing was identified and corrected, ensuring that traffic was properly routed to the NA environment without further errors.
Health Checks and Extended Monitoring: Health checks were conducted immediately after the fix, followed by extended monitoring to ensure platform stability.
Customer Communication and Verification: Reached out to affected customers to verify that the ITAM platform was fully operational and performing as expected, ensuring no residual impact.

Future Preventative Measures

In response to this incident, a thorough Root Cause Analysis (RCA) was conducted with all relevant teams to identify areas for improvement. The discussions emphasized the importance of strengthening region-specific testing and improving our operational processes to avoid similar issues in the future. Based on these findings, we have outlined the following preventative measures to enhance system performance, reduce cross-region impacts, and ensure more rigorous testing and validation before deploying changes.

Region-Targeting Methodology Review: All region-specific jobs and processes are now explicitly defined and properly validated, helping to prevent cross-region impacts during maintenance or updates.
Enhanced UAT Testing: Implement more extensive region-specific testing in UAT before deploying to production. Simulation tests to replicate the live environment should be conducted to validate that the platform performs consistently across all production regions.

Posted Sep 16, 2024 - 19:29 PDT

Resolved

We have continued to observe sustained stability, with no further occurrences. This incident has been resolved.

Posted Sep 05, 2024 - 12:02 PDT

Update

This incident has been resolved.

Posted Sep 04, 2024 - 12:48 PDT

Update

Our further investigation has revealed that recent operational actions may have inadvertently affected system configurations, leading to intermittent errors and slow load times.

The issue has now been resolved, and our health checks confirm that the system is stable and functioning normally. We will continue to monitor the situation to ensure ongoing stability and performance.

Posted Sep 04, 2024 - 12:42 PDT

Update

Our assessment indicates that an alarm triggered by our monitoring systems deployed a new instance, which is suspected to be contributing to the issue. The impacted instance has been removed, but the problem persists. We are investigating further.

Posted Sep 04, 2024 - 12:21 PDT

Investigating

Incident Description: We are currently investigating an issue affecting the IT Asset Management (ITAM) platform in the NA region. As a result, some customers may experience slow load times or errors when trying to access ITAM views.

Priority: P1

Restoration Activity: Our technical teams are actively involved and are evaluating the situation. Additionally, we are exploring potential solutions to rectify the issue as quickly as possible.

Posted Sep 04, 2024 - 11:52 PDT

This incident affected: Flexera One - IT Asset Management - North America (IT Asset Management - US Login Page).