Flexera One - IT Asset Management - NA - Batch Failures Resulting in Temporary Active Directory User Import Suspension
Incident Report for Flexera System Status Dashboard
Postmortem

Description:  Flexera One - IT Asset Management - NA - Batch Failures Resulting in Temporary Active Directory User Import Suspension

Timeframe:  July 9th 2024, 12:30 PM PDT to July 12th 2024, 11:22 PM PDT

Incident Summary

On Tuesday, July 9th, 2024, at 12:30 PM PDT, our IT Asset Management platform in the North America region experienced a recurrence of a known issue involving Database connection errors during overnight batch processing tasks. This problem affected multiple organisations, leading to batch task failures and daily reconciliation delays.

 Our technical teams were promptly engaged, and their assessment indicated that these database connection errors were likely due to a bug in the Active Directory (AD) user import process, which had been causing a high number of blocked sessions and subsequent connection failures. In response, our teams actively discussed strategies to mitigate the problem and expedite the recovery process. As part of the proposed approach to managing recurring AD import failures, our technical team proactively lifted the previously enforced AD block. This action was intended to enhance our monitoring and management capabilities over the weekend.

 On July 12th, at 1:59 PM PDT, we removed the previously implemented AD block to better monitor and detect any blocked session alerts and to terminate any processes posing a risk. This was done to prevent further database and connection failures and to ensure a seamless and successful import/reconciliation process for our customers. During the weekend, we monitored the services closely and did not observe any further failures while our technical teams worked on a permanent fix.

On July 18th, at 8:56 AM PDT, after extended monitoring throughout the week, the incident was declared resolved.

The permanent fix has been successfully tested in the UAT environment and is pending deployment in prod. This is being tracked under a problem record.

Root Cause

 Upon investigation, our technical teams identified that the database connection errors were primarily caused by a bug in the Active Directory (AD) user import process. The import process was inadvertently causing a large number of blocked sessions, which led to database connection failures. The blocked sessions overwhelmed the system, preventing the successful completion of batch-processing tasks and disrupting the reconciliation process.

Remediation Actions:

Short-term Measures:

o   Proactively lifted the AD block on the inventory load balancer.

o   Monitored and detected any blocked session alerts to prevent further failures.

o   Verified functionality through performance testing.

Long-term Fix:

o   Split the ReconcileUsers procedure into 3 segments for better performance- ReconcileUsersDeletion, ReconcileUsers, ReconcileUsersUnknownOU.

Deployment:

o   Fix deployed to UAT environments and verified.

o   Requested production deployment which is expected to completed in August 2024.

Future Preventative Measures

  1. Optimization: Continuously review and optimize critical procedures to enhance performance and prevent similar incidents.
  2. Monitoring: Implement robust monitoring systems to detect anomalies and proactively address potential issues.
  3. Communication: Maintain transparent communication with customers and stakeholders during incidents and resolution processes.
Posted Aug 11, 2024 - 22:06 PDT

Resolved
We have extended our monitoring throughout this week and have detected no anomalies or recurrence of the issue. Our temporary measures have ensured sustained stability, and our long-term action plans are progressing smoothly towards implementation. Consequently, we have marked this incident as resolved and transitioned it to a problem management investigation.

This investigation will oversee the completion of the long-term fix, expected by the end of this week. A post-mortem report will be published in the coming days, detailing the actions taken for long-term measures.
Posted Jul 18, 2024 - 08:56 PDT
Update
There have been no new reports of issues, and our temporary measures continue to maintain system stability. According to our technical team's latest analysis, our testing and validation processes are on track. We anticipate moving forward with a full deployment by the end of this week.
Posted Jul 17, 2024 - 10:05 PDT
Update
The recent updates have successfully passed initial functionality tests. We are currently performing performance tests to ensure these enhancements work reliably. Our goal is to maintain consistent stability and efficiency of our systems.

Further updates will be provided as progress is made.
Posted Jul 16, 2024 - 06:52 PDT
Monitoring
The testing and validation phase is still ongoing, and our technical teams continue to make significant progress. While manual interventions and temporary measures have prevented any new issues, we have decided to prolong our monitoring process to ensure that the fixes remain effective and contribute to a stable, long-term solution. We appreciate your understanding as we work diligently to enhance our systems. We will continue to provide updates as we progress.
Posted Jul 15, 2024 - 16:10 PDT
Update
We have actively monitored the environment throughout the weekend and can confirm that there have been no batch failures or alerts related to connection issues. Our team continues to oversee batch processing tasks and Active Directory imports to ensure system stability.

Additionally, our technical team has completed the coding process and is now engaged in thorough testing and validation to ensure that all updates function as anticipated before their production deployment.

We are committed to maintaining a stable and reliable service and will provide updates as new information becomes available.
Posted Jul 15, 2024 - 08:05 PDT
Update
We have been actively monitoring the environment throughout the weekend. As of our latest evaluation, there have been no alerts or additional complications reported. Our team continues to closely observe the batch processing tasks and Active Directory imports. We are diligently working to identify the specific causes of the blockages to prevent future occurrences and ensure a stable service environment.

Thank you for your patience and understanding as we address this issue. We will provide further updates as they become available.
Posted Jul 14, 2024 - 09:12 PDT
Identified
Our teams have been actively discussing strategies to mitigate the problem and expedite the recovery process. In line with the proposed approach to managing recurring AD import failures, our technical team has proactively lifted the previously enforced AD block. This action is intended to enhance our monitoring and management capabilities throughout the weekend. We are closely monitoring for any blocked session alerts and are prepared to immediately terminate any processes that pose a risk, to prevent further database and connection failures.

Our primary objective continues to be ensuring a seamless and successful import/reconcile process for our customers. We plan to reassess the situation after the weekend and will decide whether to reimpose the AD block based on the outcomes. This strategy is aimed at maintaining system stability while addressing the underlying issue.

We will continue to provide updates as more information becomes available and further actions are taken.
Posted Jul 12, 2024 - 14:37 PDT
Update
Our technical teams have made substantial progress on the fix. Once development is finalized, we will conduct comprehensive testing and validation before proceeding with implementation. Concurrently, we are monitoring the environment to ensure ongoing stability. We will keep you updated as we make progress.
Posted Jul 12, 2024 - 11:03 PDT
Update
Due to the AD inventory block, we expect no further batch processing failures to occur. However, the temporary AD disablement is affecting the normal flow of user data imports for all customers.
Posted Jul 11, 2024 - 14:45 PDT
Investigating
Incident Description: We are currently experiencing a recurring issue with connection errors during overnight batch processing tasks on the IT Asset Management platform in the NA region. This issue is impacting multiple organizations, resulting in batch task failures and delayed reconciliations.

Our assessment indicates that these connection errors are likely due to a potential bug in the Active Directory (AD) user import process, which has been causing a high number of blocked sessions and subsequent connection failures over the past few nights.

Priority: P2

Restoration Activity: To mitigate the connection errors affecting overnight batch processing, we have temporarily disabled the AD user import process for all customers and adjusted our monitoring thresholds for quicker issue detection. Our team is working on a permanent fix to address the bug and ensure long-term stability. Continuous monitoring is in place to prevent further disruptions.
Posted Jul 11, 2024 - 14:12 PDT
This incident affected: Flexera One - IT Asset Management - North America (IT Asset Management - US Inventory Upload, IT Asset Management - US Batch Processing System).