Description: Flexera One - IT Visibility - NAM – Data Processing Delayed
Timeframe: November 14th, 4:39 PM to November 20th, 3:25 PM PST
Incident Summary
On Tuesday, November 14th, at 4:39 PM PST, an issue arose with the processing of IT Visibility data in the US region. While customers were able to use the ITV UI without disruption, there was a delay in the processing of new data.
Our technical team noted a series of incidents in the past 24 hours where two nodes within the database cluster underwent multiple restarts, resulting in disruptions to both read and write processes. This, in turn, led to a backlog in the associated service. To address the issue, an S1 case was promptly initiated to engage our service provider for thorough investigation.
On November 14th, at 5:42 PM PST, our service provider identified that the input/output operations per second (IOPS) limits were exceeded, triggering server restarts as the system attempted to allocate additional memory.
Following a comprehensive investigation and discussions with the service provider, a configuration adjustment was implemented at 9:00 PM PST to restrict the concurrency of streaming. At 10:03 PM PST, a significant backlog of write operations persisted in the database. In response, the team took action to maximize the IOPS to the highest allowable limit.
During the monitoring at 10:43 PM PST, it was observed that the data ingestion process was up to date. However, the processing service was still struggling as the database couldn't handle write requests quickly enough.
Overnight, the team monitored and made adjustments to improve processing speed. However, due to storage constraints, the best scaling option involved transitioning to a higher tier, which required setting up a new cluster and migrating data. This process, historically taking around 2 days, led to the decision to allow the current system more time to stabilize before considering the transition. The technical team gave an additional time for the processing to complete.
On Wednesday, November 15th, at 8:15 PM PST, following significant improvements observed throughout the day, the team found that the database's writer queue had significantly dropped, and the system was operating at full capacity from both application and database perspectives. However, the downstream service had accumulated a backlog of a few hours.
As a proactive measure, on Thursday, November 16th, at 5:50 PM PST, the database scaling measures were untimately initiated, while our teams continued to work on enhancing the code.
Based on historical data, such scaling processes typically took 2-3 days. The accumulated backlog on Friday, November 17th, at 12:12 AM PST, still amounted to around 26 hours. We diligently addressed any issues affecting the ongoing scaling process, and at 2:26 PM PST, the technical team discovered that the growth of the backlog had slowed down, indicating that the database was still in the scaling process, with the second node almost ready and two more to go.
Over the weekend, the team continued to monitor progress and address any obstacles that arose. On Monday, November 20th, at 4:44 PM PST, the new node completed the synchronization process and became operational. The database throughput significantly improved with two new nodes in a ready state, and one of them was promoted to primary to handle write operations.
On Monday, November 20th, at 3:25 PM PST, after continuous monitoring and additional health checks, the technical team confirmed that the system was fully up to date, with data being available in real-time. Subsequently, the incident was considered resolved.
Root Cause Analysis:
Primary Root Causes:
Contributing Factors:
Remediation Actions
Future Preventative Measure
Infrastructure Enhancement for Performance Assurance: As part of our proactive strategy for future risk mitigation, we have elevated our data infrastructure to a high-performance tier. This enhancement involves utilizing advanced storage technology designed to provide a surplus of Input/Output Operations Per Second (IOPS), significantly exceeding our system's anticipated requirements. This strategic upgrade is anticipated to have a notable positive impact on overall system performance.