Flexera One - IT Asset Management - NA - Inventory file upload delay

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - IT Asset Management – NDI Inventory Data delayed

Timeframe:
March 13th, 2024, 7:08 PM to March 27th, 7:13 AM PDT
April 17th, 2024, 12:12 AM to April 21st, 2024, 4:12 PM PDT

Incident Summary

On Wednesday, April 17th, at 12:12 AM PDT, we experienced a recurrence of an issue with delays in processing NDI inventory files uploaded to the IT Asset Management platform in the NA region. This could potentially have impacted multiple organizations, resulting in delayed updates to their inventory records.

After identifying the issue, our technical teams implemented several measures to restore normal operations. This included deploying additional inventory nodes to increase throughput, which were operational by 11:25 AM PDT. Despite these efforts, monitoring over the subsequent hours did not show significant improvements. The technical team continued to observe inventory drops within the 24-hour timeframe.

During the investigation, we observed an unusual increase in wait times on one of the databases associated with a process related to relocating computers to new organizational units, with the majority of occurrences traced back to a single tenant. This could have been further escalated by other tenants sharing the same database.

To address this issue, at 1:09 PM PDT, we implemented traffic segregation to isolate inventory nodes, aiming to accelerate processing. This resulted in an immediate improvement, with a significant increase in data processing. However, later in the day, we observed a decline in throughput, indicating that traffic from a single source might not have been the sole bottleneck. Consequently, we reverted the change. Nevertheless, we remained proactive and scaled up inventory resources to expedite processing, while continuing to explore additional solutions.

Throughout the remainder of the day, our monitoring indicated a slight improvement, with a corresponding decrease in backlog. On Thursday, April 18th, at 7:28 AM PDT, processing rates increased once more, yet did not reach satisfactory levels. Following investigations and discussions, the technical teams decided to further augment inventory resources. However, given the existing backlog, it was anticipated that processing would extend into the weekend to fully catch up.

Throughout the rest of the incident, teams continued to implement additional improvements. This involved disabling non-urgent tasks to prevent bottlenecks and making further adjustments to inventory resources. Additionally, they conducted additional manual interventions, like further segregating traffic from tenants in the following days, to promptly clear the backlog queues.

On Sunday, April 21st, at 9:51 AM PDT, our technical teams confirmed that backlog processing was nearly complete, with new inventory being processed at optimal levels. However, the decision was made to extend monitoring into the weekdays due to the potential for a lower incoming inventory rate compared to weekdays. Later that same day, at 4:12 PM PDT, backlog processing was fully resolved.

After extended monitoring into the Monday, we continued to observe positive outcomes with weekday normal rate of incoming inventory. Following the successful health checks, and confirmation from our technical teams, we considered the incident resolved on Monday, April 22nd.

Following extended monitoring into Monday, we observed continued positive outcomes, with data processing maintaining its weekday normal rate. After successful health checks and confirmation from our technical teams, we concluded the incident as resolved on Monday, April 22nd.

Root Cause

Primary root cause:

Traffic congestion leading to processing bottlenecks on the IT Asset Management platform in the NA region, resulting in delays in processing NDI inventory files.

Contributing factors:

• Unusual increase in wait times on one of the databases associated with relocating computers to new organizational units.
• Majority of occurrences traced back to a single tenant, potentially exacerbated by other tenants sharing the same database.

Remediation Actions

• Traffic Segregation Initiative: Implementing traffic segregation to isolate inventory nodes and accelerate processing.
• Inventory Resource Scaling: Scaling up inventory resources to expedite data processing.
• Task Prioritization: Disabling non-urgent tasks to alleviate bottlenecks.
• Manual Intervention: Conducting manual interventions, such as further segregating traffic from tenants to promptly clear backlog queues.
• Continuous Monitoring and Adjustment: Continued monitoring and adjustments to ensure optimal processing efficiency.

Future Preventative Measures

Problem Record Establishment: A problem record has been opened to track the progress of both short and long-term measures, facilitating ongoing monitoring and improvement of system performance and reliability.

Short Term Measures:

• Queue System Implementation: Deploying a queue system per tenant to prevent blocking between tenants and mitigate processing delays.
• Concurrency Setting Adjustment: Modifying concurrency settings to reduce the server's vulnerability to blockages caused by a single source, enhancing overall system stability.
• Monitoring Enhancement: Strengthening monitoring capabilities to swiftly detect and address any emerging issues, ensuring proactive management of potential bottlenecks.

Long Term Plan:

UDI-ITAM Tenant-Specific Inventory Ingestion Pipeline Implementation: Scheduled for Q3 of this year, this initiative involves the development and deployment of a dedicated inventory ingestion pipeline for each tenant, enhancing system efficiency and resilience to prevent future incidents.

Posted May 03, 2024 - 19:09 PDT

Resolved

We have not observed any additional issues, and the measures implemented during the incident have ensured stability. However, our technical teams will continue monitoring and working offline to implement measures ensuring ongoing stability. Short-term enhancements are planned and will be tested and implemented over the coming days.

Additionally, we already have long-term measures planned for implementation in Q3 of this year. We apologize for any inconvenience this issue may have caused and remain committed to preventing similar occurrences in the future.

Posted Apr 23, 2024 - 18:52 PDT

Update

Currently, there is no backlog. However, we will continue monitoring for an extended period to ensure stability. Discussions are ongoing regarding the implementation of measures to prevent future occurrences.

Posted Apr 23, 2024 - 15:01 PDT

Update

We are continuing to observe significant enhancements in data processing. Our team is actively monitoring the situation, intervening as required, and engaging in discussions regarding short and long-term strategies. We will keep you informed of our progress.

Posted Apr 23, 2024 - 11:29 PDT

Update

Our technical team has confirmed the successful processing of the backlog. However, they have decided to extend the monitoring period into regular business hours to ensure that data processing remains at optimal levels. Discussions are ongoing regarding long-term solutions to improve our inventory management system and prevent any potential disruptions in the future.

Posted Apr 22, 2024 - 07:56 PDT

Monitoring

We have made substantial progress in clearing the backlog, and data processing is now operating at optimal levels with sustained improvements. As a proactive step, we'll maintain observation of the situation for some time before taking further action.

Posted Apr 21, 2024 - 10:28 PDT

Update

We consistently witness a decrease in the backlog, and our technical teams are diligently monitoring the progress.

Posted Apr 20, 2024 - 23:10 PDT

Update

We are closely monitoring the situation and observing positive outcomes following the enhancements, with a continued decrease in the backlog at optimal processing rates.

Posted Apr 20, 2024 - 17:21 PDT

Update

Efforts are underway to optimize the flow of data to ensure smoother operations. An increase in system resources has already shown promising results in resolving processing delays. We are continuously monitoring the situation to ensure steady progress.

Posted Apr 20, 2024 - 13:01 PDT

Update

Our team is actively monitoring and addressing issues to ensure seamless operations. Today, we encountered a problem with AD imports not occurring as expected in their current platform, prompting us to redirect them temporarily to the appropriate system. However, it's important to note that while new imports will continue as usual, existing imports and any pending ones won't be processed due to this change. Additionally, we are investigating other potential bottlenecks and analyzing traffic fluctuations to identify opportunities for improving our services.

Posted Apr 19, 2024 - 12:49 PDT

Update

Our teams are actively investigating this issue and have observed a consistent reduction in the backlog.

Posted Apr 19, 2024 - 02:52 PDT

Update

We are continuing to investigate this issue.

Posted Apr 18, 2024 - 17:00 PDT

Update

The team has observed a further decrease in the backlog. To accelerate processing, additional scaling efforts have been initiated. We are actively monitoring the progress and taking necessary steps to investigate the root cause and implement long-term measures for sustained stability.

Posted Apr 18, 2024 - 16:59 PDT

Update

The team has implemented measures to increase inventory capacity, aiming to further enhance throughput. Meanwhile, ongoing investigations are focused on identifying the root cause to implement lasting solutions for sustained system stability.

Posted Apr 18, 2024 - 10:12 PDT

Update

Our technical teams are currently investigating this issue and have noted a notable reduction in the backlog. We are closely monitoring the traffic and backlogs while actively seeking avenues to accelerate the resolution process.

Posted Apr 18, 2024 - 02:55 PDT

Update

In our efforts to provide immediate relief, our team implemented several proactive measures such as scaling up resources and segregating traffic. Initially, we observed positive outcomes. However, we are now experiencing a decrease in throughput. Our technical teams are continuing their investigation. We apologize for any inconvenience this issue may have caused.

Posted Apr 17, 2024 - 14:19 PDT

Update

We are currently implementing additional enhancements to improve throughput, and we will conduct evaluations afterward to measure progress. Meanwhile, we are closely monitoring the environment and exploring additional solutions to expedite resolution.

Posted Apr 17, 2024 - 09:14 PDT

Update

Our teams are actively working on isolating the root cause and resolving the issue. We will continue to provide updates as progress is made.

Posted Apr 17, 2024 - 08:31 PDT

Investigating

Incident Description:
We are currently experiencing a delay in processing inventory files (NDI) on US production, impacting all customers. This might cause customers to observe delayed updates of their device inventory records.

Priority: P2

Restoration activity:
Our technical teams are actively involved and are evaluating the situation. Additionally, we are exploring potential solutions to rectify the issue as quickly as possible. We sincerely apologize for any inconvenience this may have caused.

Posted Apr 17, 2024 - 00:22 PDT

This incident affected: Flexera One - IT Asset Management - North America (IT Asset Management - US Inventory Upload).