Virtual machine migration as root cause

In this use case, we’ll walk you through a situation in which a typical web application (the sample web application in this case) suffers from response time degradation. The application is comprised of a number of services that are served by multiple Tomcat servers running on a virtualized Windows machine.

By following the Dynatrace workflow, you’ll see which of the components serving the application were affected by this response time degradation incident. More importantly, you’ll understand the root cause of the problem, see how the problem evolved over time, and know how to ensure that the problem doesn’t reoccur in the future. 

The root cause of the problem in this example is virtual machine migration (vMotion). vMotion events can lead to ESXi CPU exhaustion. As a result, one or more of the virtual machines running on the same ESXi host, although not overloaded themselves, are unable to perform their tasks. This is a typical CPU resource contention scenario in which CPU ready time measurements play a key role. 

Problem first appears

It’s 2:22 PM, you look at your home dashboard and see that a problem alert has appeared on the Problems tile. The Applications, Hosts, and VMware tiles signal which elements of your environment are affected by this problem. You can tell at a glance that this problem, which began on the infrastructure level, now affects 5 hosts, at least 2 services (front-end and database), and 1 application. 

Tile counters change from white to red on the Dynatrace when problems are detected.
Tile counters change from white to red on the Dynatrace homepage when problems are detected
How problems evolve and merge

CPU resource contention incidents caused by vMotion tend to evolve gradually.

Depending on the duration of each incident, you may initially see several seemingly unrelated problems that are eventually consolidated into a single problem (the numbers may vary and you can see 5 or even 8 different alerts). Usually hosts on the infrastructure level are affected first, followed by server-side services. Service performance degradation leads to application slow down. Dynatrace correlates all such incidents with vMotion events that were recorded around the time the incident occurred. Once the vMotion events were identified as the root cause of this problem, Dynatrace merged all related incidents into the same problem. For more information on problem evolution, see What is the life cycle of a problem?

Begin your investigation

Click the Problems tile to see the Problems page. In this case, there is a response time degradation problem with the application. This is serious because it means that real users are experiencing this degradation.

Active problems are highlighted in red at the top of the Problems page. By looking at the count distribution of incidents on the left, you can see all the applications, services, and hosts that are affected by this problem.

The duration of the problem is relatively short—only 14 minutes in this case. By correlating the affected components and the time frame and the duration of the incidents (that had occurred around that time), Dynatrace determines that all these incidents are part of the same problem.

Don’t stop here, though. Dig deeper by clicking the problem card. You’ll then see how complex the problem is and what it impacts. This will help you make decisions about how to ensure that that this problem doesn’t appear again in the future. 

Here you can see that over 150 user actions per minute are affected by this problem (actually 159/min). Flash points extend to server-side services (7 affected) and infrastructure (6 components affected).

Gain insight into the problem

Click Analyze root cause to see health indicators related to the ESXi host where this problem originates. The 100% CPU call-out confirms that you’re dealing with a CPU saturation problem on this particular component (an ESXi host). 

High CPU consumption on the ESXi host is highlighted here.

CPU usage on this ESXi  host has reached 100%. A quick look at the Events section reveals what triggered this high CPU consumption—around the same time that Dynatrace identified this problem (2:28 PM), a virtual machine called ‘cpu-3-m5’ was migrated to this ESXi host.

Click the Consuming virtual machines below the chart to confirm that it was really the recently migrated machine that caused this problem.

Understand the root cause

Now we can confirm that it is in fact the virtual machine ‘eT-m5-Win7-64bit’ that is experiencing difficulties. The CPU Ready time measurement for this virtual machine is extremely high.

Now you can see that all the CPU resources of this EXSi host are being consumed by virtual machines other than eT-m5-Win7-64bit.

The CPU Ready time measurement of 80% tells you that virtual machine ‘eT-m5-Win7-64bit’ spends most of its time waiting for the hypervisor to assign it some CPU cycles. The other virtual machines on this ESXi host consume nearly 96% of this host’s available CPU. Virtual machine ’eT-m5-Win7-64bit’ has only 2.33% of all available host CPU available to it—not nearly enough to enable it to perform its tasks.

CPU Ready time is the root cause

A Ready time measurement above 5% is a symptom of CPU contention taking place on an ESXi host. A measurement above 10% is a serious problem. This measurement indicates that the host doesn’t have enough resources to satisfy the demand of all the virtual machines it hosts. As a result, virtual machines are competing for CPU cycles. The machines that lose this competition can’t perform their jobs.

Note that if you were to only monitor the load of this embattled virtual machine, you wouldn’t detect any significant issue. This is because the problem originates with the ESXi host, which is not capable of providing adequate processing cycles to this virtual machine.

Understand the dependencies

Still not 100% convinced that a lack of resources for virtual machine ’eT-m5-Win7-64bit’ is the root cause of this problem (application response time degradation)? It’s time to investigate the dependencies.

Return to the Problem details page and click the graph showing component dependencies to view how the problem evolved.

The graph that shows dependencies between all affected entities is the link to a special player which allows you to see how we figured out the root of the problem.

Now you can see that all the virtual machines involved in this problem reside on the same ESXi host. And the most adversely affected virtual machine is responsible for running the web application.

We can say now with confidence that we know what the root cause of this problem is. 

This problem has dependencies spanning across all tiers. You can follow how the problem evolved in time using our special player.

Prevent the recurrence of this problem

How can you safeguard against this problem occurring again in the future?

  • Move the virtual machine that runs your application to a different ESXi host where it will have sufficient resources.
  • Adjust your resource pool policy so that in the future the virtual machine that runs your application receives priority access to resources.