FlexNet Operations Intermittent Connectivity

Incident Report for Flexera System Status Dashboard

Postmortem

FlexNet Operations Cloud Incident Report

Description: FNP Activation request causing performance issue

Reported On: August 24, 2019

Tracking #: SSRE-328

Summary

Between August 24th and 31st the system experienced periods of degraded or unresponsive performance. This was the result of unexpected client behavior during license activation processes. The activity caused excessively high CPU load on a primary database which slowed down processing time on all services relying on it.

Investigation / Analysis

System alerts began triggering around 2AM Pacific on the 24th indicating a database problem. Restarts were performed in an effort to clear/reset connections to the system. Moving traffic to a secondary/replicate server was also done. Neither of these proved helpful. The problem continued post restart as well as followed to the secondary server. Some high volume traffic was then rerouted to the disaster recovery site. This proved effective for that traffic but did not resolve the high CPU condition in the primary data center.

Further analysis indicated that a certain SQL query was consuming the majority of the CPU. A review of the code showed that the query in question is dynamically generated based on the activity of the client attempting to license itself using FNP Activation. In this case, the client activity resulted the creation and execution of an unreasonable and non-performant SQL statement.

Researching the FNP client activity showed anomalous behavior. The client is expected to increment a sequence number with each request. In this case, that was not occurring. The FNO server application code did not adequately take that scenario into consideration when generating the SQL statement. The intermittent behavior of the service disruption can be tied to the activity of this client.

Root Cause

Anomalous client behavior in the FNP Trusted Storage Activation exposed an application design flaw where the application dynamically created a SQL statement that consumed all the database CPU when executed.

Resolution / Corrective Action

The immediate resolution was to alter the data related to the misbehaving client. This prevented the generation and execution of the problematic SQL statement.

The application has also been updated and deployed (Sept 12th) with new SQL which accounts for similar scenarios.

Posted Sep 16, 2019 - 11:38 PDT

Resolved

On September 1, a database change was completed to better manage certain inbound queries suspected of causing the performance issue. Since the change, we continued to closely monitor the systems’ progress and have confirmed all indicators point to this issue being resolved. In investigating the root cause of this issue, we identified further optimization that could be made to the code and are targeting this for an upcoming FlexNet Operations LLM release.

Posted Sep 05, 2019 - 07:32 PDT

Update

Based on the diagnostic information received following the update to 2019 R1.SP2.2 on August 29th, we have made an update to the DB to handle some inbound queries that were causing the high resource utilization on the application servers. This has resulted in significantly improved performance and stability. At this point of time, we continue to investigate what caused the suspect inbound queries. The system has been up 100% for the past 24 hours. We believe all the following open issues are resolved:

* UI sluggishness and general navigation
* Client applications or devices fail to activate
* File uploads via ESD

We will continue to update this page periodically but FlexNet operations is operable, stable, and fully functional.

Posted Sep 01, 2019 - 11:21 PDT

Update

Flexera Engineering, Database Administration, and Site Reliability Engineering teams are continuing to troubleshoot and investigate the root cause of the intermittent connectivity issues while keeping the applications operable. We will continue to update this page regularly and appreciate your patience.

Posted Aug 31, 2019 - 10:30 PDT

Update

While FlexNet Operations is currently operable, it is experiencing the following issues with suggested remediation and next steps:

Issues:
* UI sluggishness and general navigation which can be remediated by reloading the page again in a few seconds
* Client applications or devices fail to activate. Usually a retry of the activation call will be successful.
* File uploads via ESD. Using the FTP option to upload is typically more efficient as well as retrying the upload.

The update to production yesterday provided some diagnostic information that our teams are using to help identify the root cause. We will update this page regularly with updates on the situation.

Posted Aug 30, 2019 - 16:16 PDT

Investigating

All teams - Product Engineering, Database Administration, Site Reliability Engineering - continue to research the cause of the intermittent connectivity issues. Simultaneously, processes are in place to minimize impact and maintain stability as much as possible.

We have experienced some brief outages within the last few hours. Our teams are continuing to monitor and investigate.

Posted Aug 30, 2019 - 09:40 PDT

Update

The team has updated the production environment with 2019 R1.SP2.2. While this update has not completely resolved the outstanding issue, it has provided additional diagnostic information. The teams will use this information and are continuing to troubleshoot and investigate the root cause of the intermittent connectivity while keeping the applications operable. We will continue to update this page regularly. Thank you for your patience.

Posted Aug 29, 2019 - 17:03 PDT

Update

All teams are continuing to troubleshoot and investigate the root cause of the intermittent connectivity issues while keeping the applications operable. We will continue to update this page regularly. Thank you for your patience.

Posted Aug 29, 2019 - 08:29 PDT

Update

All teams - Product Engineering, Database Administration, Site Reliability Engineering - continue to research the cause of the intermittent connectivity issues. Simultaneously, processes are in place to minimize impact and maintain stability as much as possible.

The systems seem to have stabilized over the last couple of hours. The teams continue to monitor and investigate.

Posted Aug 28, 2019 - 16:16 PDT

Update

Flexera Engineering and SRE teams are continuing to troubleshoot and investigate the root cause of the intermittent connectivity issues while keeping the applications operable. We will continue to update this page regularly and appreciate your patience.

Posted Aug 28, 2019 - 12:05 PDT

Update

While both FlexNet Operations LLM and ALM are currently operable, they are experiencing the following issues with suggested remediation and next steps:

LLM Issues:
* UI sluggishness and general navigation which can be remediated by reloading the page again in a few seconds
* Client applications or devices fail to activate. Usually a retry of the activation call will be successful.
* File uploads via ESD. Using the FTP option to upload is typically more efficient as well as retrying the upload.

ALM Issue:
* File uploads via ESD. Using the FTP option to upload is typically more efficient as well as retrying the upload.

All teams - Product Engineering, Database Administration, Site Reliability Engineering - continue to research the root cause of the intermittent connectivity issues and will continue to update this page regularly.

Posted Aug 27, 2019 - 16:26 PDT

Update

The systems continue to be stable and operational since 11 AM PDT and the teams continue to monitor them.

All teams - Product Engineering, Database Administration, Site Reliability Engineering - continue to research the root cause of the intermittent connectivity issues and will continue to update this page regularly.

Posted Aug 26, 2019 - 15:59 PDT

Update

All teams - Product Engineering, Database Administration, Site Reliability Engineering - continue to research the cause of the intermittent connectivity issues. Simultaneously, an automated process has been put in place to minimize impact and maintain stability as much as possible.

The systems seem to have stabilized over the last couple of hours. The teams continue to monitor and investigate.

Posted Aug 26, 2019 - 12:03 PDT

Monitoring

All teams - Product Engineering, Database Administration, Site Reliability Engineering - continue to research the cause of the intermittent connectivity issues. Simultaneously, processes are in place to minimize impact and maintain stability as much as possible.

The systems seem to have stabilized over the last couple of hours. The teams continue to monitor and investigate.

Posted Aug 26, 2019 - 06:25 PDT

Update

The Product Development and DBA teams are continuing to troubleshoot. The SRE team is working through keeping impact to as little disruption as possible with close monitoring pro-active instance manipulation. By actively clearing stale processes and refreshing instances, we are trying to reduce the connection disruption.

Posted Aug 25, 2019 - 11:04 PDT

Update

Additional configuration changes have been applied in an effort to further stabilize database connectivity.

Root cause and remediation efforts are still underway.

Posted Aug 25, 2019 - 09:58 PDT

Update

Product Engineering has been engaged to further assist in application troubleshooting

Posted Aug 25, 2019 - 07:56 PDT

Investigating

The changes applied resolved the issue for a period of time, but the degradation has returned. We continue investigations.

Posted Aug 25, 2019 - 06:37 PDT

Monitoring

We have re-routed connections from the application servers to the databases and have the desired results. We will continue monitoring to confirm the issue has been resolved.

Posted Aug 25, 2019 - 05:51 PDT

Investigating

We are currently investigating exceptions encountered in the database re-boot process and are continuing the investigation into other potential causes.

Posted Aug 25, 2019 - 05:02 PDT

Identified

We have identified a potential cause of the degraded performance on the primary database and are performing an emergency re-boot of the system.

Posted Aug 25, 2019 - 03:20 PDT

Update

We have now made some configuration changes in an effort to reduce contention and are monitoring the results. Investigation is ongoing.

Posted Aug 25, 2019 - 02:49 PDT

Update

We have executed a precautionary measure and failed the primary database over the secondary and are monitoring to determine whether there is improvement. We continue to investigate the underlying cause.

Posted Aug 25, 2019 - 01:57 PDT

Update

We are continuing to investigate this issue.

Posted Aug 25, 2019 - 00:37 PDT

Investigating

It appears as though the same issue from earlier today has returned. We are experiencing intermittent connectivity issues and investigating.

Posted Aug 24, 2019 - 23:39 PDT

Monitoring

The applications have remained stable since approximately 10:15 AM PDT this morning. We are continuing our monitoring efforts and investigations into root cause analysis.

Posted Aug 24, 2019 - 16:06 PDT

Update

The systems have been stable for approximately 40 minutes. We are continuing to monitor and asses root cause of the issue.

Posted Aug 24, 2019 - 10:49 PDT

Update

The High Availability components have been restarted and appear to be functioning correctly. We are continuing to investigate root cause and remediation approach.

Posted Aug 24, 2019 - 10:19 PDT

Update

We are continuing to investigate and troubleshoot the issues. The Network Engineering team has verified there are no issues related to network functionality. The DBA team is working through High Availability components to determine if there are any issues leading to disruption.

Posted Aug 24, 2019 - 09:39 PDT

Update

We are continuing to investigate this issue.

Posted Aug 24, 2019 - 08:25 PDT

Update

We are experiencing intermittent connectivity issues in the LLM PROD applications. We are currently investigating the cause of the issue. It appears to be related to application connectivity to the database.

Posted Aug 24, 2019 - 07:35 PDT

Update

The issue appears to be related to database connectivity from the applications. We are investigating the proper course of action.

Posted Aug 24, 2019 - 06:52 PDT

Update

We are continuing to investigate the issue from both the application and database perspectives.

Posted Aug 24, 2019 - 06:06 PDT

Update

We are continuing to investigate this issue.

Posted Aug 24, 2019 - 05:25 PDT

Update

We are continuing to investigate this issue.

Posted Aug 24, 2019 - 04:50 PDT

Update

We are continuing to investigate this issue.

Posted Aug 24, 2019 - 04:16 PDT

Update

We are continuing to investigate this issue.

Posted Aug 24, 2019 - 03:38 PDT

Investigating

We are experiencing intermittent connectivity issues in the LLM PROD applications. We are currently investigating the cause of the issue.

Posted Aug 24, 2019 - 03:02 PDT