Root cause analysis of infrastructure issues

How does root cause analysis work?

One of the most powerful Dynatrace features is the ability to correlate events across time, user actions, and varying monitoring perspectives. Dynatrace not only monitors the infrastructure (hosts, networking, and virtualization) that supports your application’s services, Dynatrace also correlates the actions of individual users with the server-side services that support your web application. By correlating events across these monitoring perspectives, Dynatrace can pinpoint the root cause of problems in your application-delivery chain.

Dynatrace root-cause analysis works across both physical and virtualized infrastructure components.

How does Dynatrace identify the root cause of problems?

Once Dynatrace identifies a problem in one your web application’s infrastructure components, it looks forcorrelations between the problem and other events that took place around the same time, for example performance degradation on one of your user’s mobile devices. Problems are seldom one-time events; they usually appear in regular patterns and are often shown to be symptoms of larger infrastructure issues. If any other transactions that used the same components also experienced problems, then those transactions will be factored into the problem’s root-cause analysis.

When Dynatrace detects a correlation between an infrastructure problem and other monitored events or transactions, it presents you with details of the correlation and the related root cause analysis.

Dynatrace identifies a problem that affects 3 applications and 190 real users. Root cause is CPU saturation on the infrastructure level.
Problem detail page. Click the visual resolution path image for an expanded view.
Expanded visual resolution path view with Problem Evolution replay feature.

How can I find out which processes/components contributed to an infrastructure problem?

From any Problem detail page, click the Analyze root cause button to see the relevant Host page in the context of the problem you’re analyzing. Here you’ll see health details of the host where the infrastructure problem occurred.

Click the Consuming processes button on the bottom of any Host page to view a list of the processes that contributed to the health of the host.

Host page in the context of a problem. Note that extreme CPU consumption on the host has resulted in User action duration degradation problems.

After clicking the Consuming processes button you’ll be presented with the list of processes or components that contributed to this infrastructure problem. Click a specific process or component to drill into the details for further analysis.