Description: Flexera One - IT Asset Management - EU- Inventory Uploads Failing
Timeframe: April 3, 2025, 5:19 AM PST to April 3, 2025, 7:39 PM PST
Incident Summary
On April 3, 2025, at 5:19 AM PST, internal monitoring alerted the ITAM teams to an issue impacting inventory data processing for customers in the EU region. Initial investigation revealed that while the system remained accessible and services such as beacon status files and activity files were functioning, inventory uploads were failing. We also received reports from multiple customers who were facing the issue with inventory uploads.
The issue began on April 2, 2025, at 1:00 PM PST, but its intermittent nature delayed immediate detection. After further assessment, the incident was escalated to Priority 1 at 7:21 AM PST on April 3, once it was confirmed that all EU-region customers were potentially affected and inventory uploads were failing for all customers in the region.
Investigations confirmed that while servers were running all scheduled tasks successfully and could process test NDI files, they were not receiving data as expected from upstream systems. Concurrently, the technical team identified that requests were resulting in 502, 503, and 401 errors, indicating potential issues in the authentication layer.
The root cause was traced to a critical server responsible for handling authentication requests, which experienced degradation due to an underlying storage issue.
Root Cause
During the investigation, our technical teams identified that the disruption was caused by a failure in a critical authentication server, triggered by storage saturation due to excessive log file accumulation. This led to degraded server performance, which manifested as failures (HTTP 502, 503, and 401 errors) in processing incoming inventory uploads. As a result, servers were unable to receive data, despite operating normally.
Remediation Actions
· Upscaled server resources: Replaced the old server instance with a new one and increased storage allocation on the authentication server while keeping the infrastructure the same as the old one.
· Cleanup: Executed automated scripts to remove excess log files and clear up storage.
· Restored service: Inventory upload functionality resumed at 7:39 PM PST on April 3, 2025.
· Backlog monitoring: The teams closely monitored the backlog of inventory data and processing throughput.
· Issue closure: After confirming that all backlog was cleared and uploads were stable, the issue was marked as fully resolved at 2:02 AM PST on April 5, 2025.
Future Preventative Measures
· Implement proactive storage monitoring: Configure alerts for log file growth and disk space utilization on critical infrastructure components.
· Automated log management: Deploy scripts to automatically rotate and purge old logs to prevent storage saturation.
· Capacity planning review: Reassess baseline resource allocation (CPU, memory, disk) for high-impact servers to prevent performance bottlenecks.
· Documentation and runbooks: Update incident response procedures to include checks for authentication-related server saturation and storage issues.