Flexera One - SaaS Manager - APAC - Delay in New Data Availability

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - SaaS Manager - APAC - Delay in New Data Availability

Timeframe: March 28th, 2024, 6:02 PM to March 28th, 2024, 8:49 PM PDT

Incident Summary

At 6:02 PM PDT on Thursday, March 28, 2024, our alerting systems detected a service degradation affecting the SaaS Manager in the APAC region. While pages were still accessible, the incident impacted our ability to write new data.

Following detection, our technical teams initiated an investigation to assess the potential impact on customer data and availability. Initially, we attempted to mitigate the issue by scaling up our infrastructure; however, this action had a limited effect on resolving the service degradation.

Upon deeper analysis, we identified the root cause as specific database instances in a compromised state within our service provider's environment. Consequently, at 7:09 PM PDT, a support case was raised with our service provider to address the issue.

At 8:00 PM PDT, backend changes were initiated by our service provider to migrate us to a larger instance to stabilize the cluster. Upon completion of the backend changes, health checks passed, indicating resolution of the immediate issue.

However, it was noted that a backlog in processing persisted due to an issue with the reaggregation queue. Subsequent investigation revealed that certain jobs were not being processed, necessitating a restart of appropriate services to clear the backlog, which was completed at 8:49 PM PDT.

Once the backlog was cleared and queues were drained, monitoring of the system continued to ensure stability, indicating a successful resolution of the incident. Consequently, the incident was closed, and normal operations resumed.

Root Cause

The service degradation stemmed from high memory and CPU utilization spikes caused by specific database instances within our service provider's environment. The vendor database infrastructure was unable to effectively manage these spikes, leading to the degradation of the SaaS Manager service.

Remediation Actions

Immediate Mitigation: Upon detection of the issue, our technical teams attempted to scale up the database to the next tier. However, this measure proved ineffective due to the compromised state of the instances, which prevented our attempts to scale up independently. As a result, intervention from the service provider was required to address the issue.
Vendor Engagement: We engaged with the service provider to address the resource-intensive operations, seeking immediate solutions to restore service stability. The vendor identified the need to scale up and migrate our services to a bigger instance. The services returned to normalcy after the backend changes by the vendor.
Queue Clearing and Service Restart: Subsequent to backend changes, a thorough inspection and clearing of the backlog were performed, followed by the restart of the relevant services to ensure smooth operation.
Continuous Monitoring: Following resolution of the incident, continuous monitoring of the system was instituted to promptly detect and address any potential recurrence of high resource utilization spikes and preemptive remediation actions.

Future Preventative Measures

Cluster Tier Adjustment: The cluster has been upgraded to a higher tier, as recommended by the service provider. This adjustment aims to enhance the cluster's autoscaling capability.
Memory Optimization Code Change: A code change has been implemented to optimize memory usage, resulting in reduced memory requirements for the system.

Posted May 06, 2024 - 19:36 PDT

Resolved

This incident has been resolved.

Posted Apr 01, 2024 - 16:54 PDT

Update

This incident has been resolved.

Posted Mar 28, 2024 - 22:12 PDT

Monitoring

We have implemented measures to scale up the environment, resulting in positive outcomes. We are continuing to monitor the environment for further progress.

Posted Mar 28, 2024 - 19:56 PDT

Update

We are actively implementing various measures, including scaling up efforts, to address the issue and ensure optimal service for our customers.

Posted Mar 28, 2024 - 19:14 PDT

Investigating

Incident Description: Our monitoring systems have identified a service degradation within our SaaS Manager platform, impacting new data availability for customers in the APAC region. While the platform remains accessible, customers may experience instances of stale data being displayed.

Priority: P2

Restoration Activity: Our technical teams are actively involved and are evaluating the situation. Additionally, we are exploring potential solutions to rectify the issue as quickly as possible.

Posted Mar 28, 2024 - 18:17 PDT

This incident affected: Flexera One - IT Asset Management - APAC (IT Asset Management - APAC SaaS Manager).