Description: Flexera One - IT Asset Management - NA - Batch Failures Resulting in Temporary Active Directory User Import Suspension
Timeframe: July 9th 2024, 12:30 PM PDT to July 12th 2024, 11:22 PM PDT
Incident Summary
On Tuesday, July 9th, 2024, at 12:30 PM PDT, our IT Asset Management platform in the North America region experienced a recurrence of a known issue involving Database connection errors during overnight batch processing tasks. This problem affected multiple organisations, leading to batch task failures and daily reconciliation delays.
Our technical teams were promptly engaged, and their assessment indicated that these database connection errors were likely due to a bug in the Active Directory (AD) user import process, which had been causing a high number of blocked sessions and subsequent connection failures. In response, our teams actively discussed strategies to mitigate the problem and expedite the recovery process. As part of the proposed approach to managing recurring AD import failures, our technical team proactively lifted the previously enforced AD block. This action was intended to enhance our monitoring and management capabilities over the weekend.
On July 12th, at 1:59 PM PDT, we removed the previously implemented AD block to better monitor and detect any blocked session alerts and to terminate any processes posing a risk. This was done to prevent further database and connection failures and to ensure a seamless and successful import/reconciliation process for our customers. During the weekend, we monitored the services closely and did not observe any further failures while our technical teams worked on a permanent fix.
On July 18th, at 8:56 AM PDT, after extended monitoring throughout the week, the incident was declared resolved.
The permanent fix has been successfully tested in the UAT environment and is pending deployment in prod. This is being tracked under a problem record.
Root Cause
Upon investigation, our technical teams identified that the database connection errors were primarily caused by a bug in the Active Directory (AD) user import process. The import process was inadvertently causing a large number of blocked sessions, which led to database connection failures. The blocked sessions overwhelmed the system, preventing the successful completion of batch-processing tasks and disrupting the reconciliation process.
Remediation Actions:
Short-term Measures:
o Proactively lifted the AD block on the inventory load balancer.
o Monitored and detected any blocked session alerts to prevent further failures.
o Verified functionality through performance testing.
Long-term Fix:
o Split the ReconcileUsers procedure into 3 segments for better performance- ReconcileUsersDeletion, ReconcileUsers, ReconcileUsersUnknownOU.
Deployment:
o Fix deployed to UAT environments and verified.
o Requested production deployment which is expected to completed in August 2024.
Future Preventative Measures