Flexera One - IT Asset Management - EU- Inventory Uploads Failing

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One - IT Asset Management - EU- Inventory Uploads Failing

Timeframe: April 3, 2025, 5:19 AM PST to April 3, 2025, 7:39 PM PST

Incident Summary

On April 3, 2025, at 5:19 AM PST, internal monitoring alerted the ITAM teams to an issue impacting inventory data processing for customers in the EU region. Initial investigation revealed that while the system remained accessible and services such as beacon status files and activity files were functioning, inventory uploads were failing. We also received reports from multiple customers who were facing the issue with inventory uploads.

The issue began on April 2, 2025, at 1:00 PM PST, but its intermittent nature delayed immediate detection. After further assessment, the incident was escalated to Priority 1 at 7:21 AM PST on April 3, once it was confirmed that all EU-region customers were potentially affected and inventory uploads were failing for all customers in the region.

Investigations confirmed that while servers were running all scheduled tasks successfully and could process test NDI files, they were not receiving data as expected from upstream systems. Concurrently, the technical team identified that requests were resulting in 502, 503, and 401 errors, indicating potential issues in the authentication layer.

The root cause was traced to a critical server responsible for handling authentication requests, which experienced degradation due to an underlying storage issue.

Root Cause

During the investigation, our technical teams identified that the disruption was caused by a failure in a critical authentication server, triggered by storage saturation due to excessive log file accumulation. This led to degraded server performance, which manifested as failures (HTTP 502, 503, and 401 errors) in processing incoming inventory uploads. As a result, servers were unable to receive data, despite operating normally.

Remediation Actions

·        Upscaled server resources: Replaced the old server instance with a new one and increased storage allocation on the authentication server while keeping the infrastructure the same as the old one.

·        Cleanup: Executed automated scripts to remove excess log files and clear up storage.

·        Restored service: Inventory upload functionality resumed at 7:39 PM PST on April 3, 2025.

·        Backlog monitoring: The teams closely monitored the backlog of inventory data and processing throughput.

·        Issue closure: After confirming that all backlog was cleared and uploads were stable, the issue was marked as fully resolved at 2:02 AM PST on April 5, 2025.

Future Preventative Measures

·        Implement proactive storage monitoring: Configure alerts for log file growth and disk space utilization on critical infrastructure components.

·        Automated log management: Deploy scripts to automatically rotate and purge old logs to prevent storage saturation.

·        Capacity planning review: Reassess baseline resource allocation (CPU, memory, disk) for high-impact servers to prevent performance bottlenecks.

·        Documentation and runbooks: Update incident response procedures to include checks for authentication-related server saturation and storage issues.

Posted Apr 17, 2025 - 01:55 PDT

Resolved

This issue has been resolved. The backlog has reached an acceptable level and continues to decrease steadily.
Posted Apr 05, 2025 - 02:04 PDT

Update

Our teams have confirmed after extended monitoring, that inventory uploads are functioning normally without any issues. Due to the recent disruption, a backlog has accumulated; however, measures have been implemented to accelerate its processing. We are closely monitoring the platform and will share further updates as they become available.
Posted Apr 04, 2025 - 06:08 PDT

Monitoring

Our teams have scaled up the infrastructure, and uploads are now functioning as expected. We will continue to monitor the services closely.
Posted Apr 03, 2025 - 19:58 PDT

Identified

Our teams have identified the root cause of the issue and are actively working on a fix. We will provide further updates as soon as they become available.
Posted Apr 03, 2025 - 11:31 PDT

Update

Our technical teams have identified that a key service is returning 502 errors when attempting to connect, preventing data ingestion. They are actively investigating the root cause and working on a resolution.
Posted Apr 03, 2025 - 10:16 PDT

Update

Our teams are actively investigating the issue and have isolated it to a specific service component. They are now focused on identifying the root cause and working toward a swift resolution to restore services.
Posted Apr 03, 2025 - 07:37 PDT

Investigating

Incident Description: We are currently addressing an issue impacting inventory data processing for IT Asset Management (ITAM) services in the EU region for all customers. While the system remains accessible, affected users may face issues with inventory uploads failing.

Priority: P2

Restoration Activity: Our technical team is actively working on identifying the root cause and restoring services. Further updates will be provided as we continue our efforts to resolve the incident.
Posted Apr 03, 2025 - 05:28 PDT
This incident affected: Flexera One - IT Asset Management - Europe (IT Asset Management - EU Inventory Upload).