IT Visibility - US - Data processing is currently Paused

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - IT Visibility - NAM – Data Processing Delayed

Timeframe: November 14th, 4:39 PM to November 20th, 3:25 PM PST

Incident Summary

On Tuesday, November 14th, at 4:39 PM PST, an issue arose with the processing of IT Visibility data in the US region. While customers were able to use the ITV UI without disruption, there was a delay in the processing of new data.

Our technical team noted a series of incidents in the past 24 hours where two nodes within the database cluster underwent multiple restarts, resulting in disruptions to both read and write processes. This, in turn, led to a backlog in the associated service. To address the issue, an S1 case was promptly initiated to engage our service provider for thorough investigation.

On November 14th, at 5:42 PM PST, our service provider identified that the input/output operations per second (IOPS) limits were exceeded, triggering server restarts as the system attempted to allocate additional memory.

Following a comprehensive investigation and discussions with the service provider, a configuration adjustment was implemented at 9:00 PM PST to restrict the concurrency of streaming. At 10:03 PM PST, a significant backlog of write operations persisted in the database. In response, the team took action to maximize the IOPS to the highest allowable limit.

During the monitoring at 10:43 PM PST, it was observed that the data ingestion process was up to date. However, the processing service was still struggling as the database couldn't handle write requests quickly enough.

Overnight, the team monitored and made adjustments to improve processing speed. However, due to storage constraints, the best scaling option involved transitioning to a higher tier, which required setting up a new cluster and migrating data. This process, historically taking around 2 days, led to the decision to allow the current system more time to stabilize before considering the transition. The technical team gave an additional time for the processing to complete.

On Wednesday, November 15th, at 8:15 PM PST, following significant improvements observed throughout the day, the team found that the database's writer queue had significantly dropped, and the system was operating at full capacity from both application and database perspectives. However, the downstream service had accumulated a backlog of a few hours.

As a proactive measure, on Thursday, November 16th, at 5:50 PM PST, the database scaling measures were untimately initiated, while our teams continued to work on enhancing the code.

Based on historical data, such scaling processes typically took 2-3 days. The accumulated backlog on Friday, November 17th, at 12:12 AM PST, still amounted to around 26 hours. We diligently addressed any issues affecting the ongoing scaling process, and at 2:26 PM PST, the technical team discovered that the growth of the backlog had slowed down, indicating that the database was still in the scaling process, with the second node almost ready and two more to go.

Over the weekend, the team continued to monitor progress and address any obstacles that arose. On Monday, November 20th, at 4:44 PM PST, the new node completed the synchronization process and became operational. The database throughput significantly improved with two new nodes in a ready state, and one of them was promoted to primary to handle write operations.

On Monday, November 20th, at 3:25 PM PST, after continuous monitoring and additional health checks, the technical team confirmed that the system was fully up to date, with data being available in real-time. Subsequently, the incident was considered resolved.

Root Cause Analysis:

Primary Root Causes:

  1. IOPS Limits Exceeded: System disruptions and restarts occurred due to surpassing input/output operations per second (IOPS) limits.
  2. Scaling Challenges: Transitioning to a higher tier and scaling the database led to delays, disruptions, and a processing backlog.

Contributing Factors:

  1. Concurrency Challenges: High levels of concurrency affected processing efficiency, necessitating adjustments.
  2. Storage Limitations: Scaling decisions were influenced by storage constraints, impacting overall system performance.
  3. Data Ingestion Discrepancy: Discrepancies between data ingestion and processing capabilities led to challenges in handling write requests efficiently.

Remediation Actions

  1. Concurrency Optimization: Adjusted configurations to manage streaming concurrency on November 14th, at 9:00 PM PST.
  2. IOPS Efficiency Enhancement: Tweaked configurations for optimal IOPS management on November 14th, at 10:03 PM PST.
  3. Proactive Scaling: Initiated comprehensive scaling for storage and data ingestion on November 16th at 5:50 PM PST.
  4. System Stabilization Period: Allowed dedicated time for system stabilization before considering further transitions.
  5. Node Synchronization: Monitored and seamlessly synchronized new nodes over the weekend, spanning from November 17th to November 19th.
  6. Continuous Health Monitoring: Conducted regular health checks and maintained continuous monitoring throughout the incident.

Future Preventative Measure

Infrastructure Enhancement for Performance Assurance: As part of our proactive strategy for future risk mitigation, we have elevated our data infrastructure to a high-performance tier. This enhancement involves utilizing advanced storage technology designed to provide a surplus of Input/Output Operations Per Second (IOPS), significantly exceeding our system's anticipated requirements. This strategic upgrade is anticipated to have a notable positive impact on overall system performance.

Posted Dec 19, 2023 - 12:50 PST

Resolved

Backlog processing has completed, and all data is now up-to-date. This incident is now resolved.
Posted Nov 20, 2023 - 15:29 PST

Update

Our recent enhancements are showing positive outcomes, contributing to system stability. Concurrently, we have implemented measures to boost overall efficiency. We'll keep you informed of any significant developments.
Posted Nov 20, 2023 - 06:37 PST

Update

We encountered a brief period of system instability and took immediate corrective actions. The system is currently in recovery. Simultaneously, we have implemented measures to enhance data processing efficiency. We will continue to provide updates as any further developments occur.
Posted Nov 19, 2023 - 10:56 PST

Update

We are closely monitoring the progress of the ongoing scaling changes. Meanwhile, backlog processing is still in progress, and we will continue to provide updates as progress is made.
Posted Nov 17, 2023 - 12:02 PST

Update

Backlog processing is talking longer than expected. We are scaling up our environment to improve processing throughout.
Posted Nov 17, 2023 - 00:25 PST

Update

Backlog processing is continuing as planned.
Posted Nov 15, 2023 - 21:09 PST

Update

We are currently addressing backlog processing, with our technical team closely overseeing the situation. They are also proactively developing plans to overcome any potential obstacles that may arise during this process.
Posted Nov 15, 2023 - 08:16 PST

Monitoring

The fix has been implemented successfully, and Data processing has resumed. We are currently working through the backlog.
Posted Nov 14, 2023 - 22:50 PST

Identified

The issue has been identified and a fix is being implemented.
Posted Nov 14, 2023 - 21:33 PST

Investigating

Incident Description:
Technical staff have identified an issue with IT Visibility processing in the US instance. As a result, backend processing has been paused while the issue is investigated.

Services are still available, however data is not being updated at this time.

Priority: 2

Restoration activity:
Technical teams have been engaged and are currently investigating.
Posted Nov 14, 2023 - 16:54 PST
This incident affected: Flexera One - IT Visibility - North America (IT Visibility US).