How do I adjust the sensitivity of problem detection?

One key feature of Dynatrace is its ability to continuously monitor every aspect of your applications, services, and infrastructure and to automatically learn all the baseline metrics related to the performance of these components. Dynatrace automatically learns the baseline response times of your applications and services, factoring in variables such as geo-location, browser type, operating system, connection bandwidth, and user actions.

Such multidimensional reference values—for example, all users from New York viewing your application with a Chrome browser on Windows—are collected for all statistically meaningful combinations of these factors. Such intelligent and automatic baselining allows Dynatrace to detect anomalies at a highly granular level and to notify you of problems in real-time. Typical application and service-level anomalies reported by Dynatrace include failure rate increases, response time degradations, and spikes or drops in application traffic. On top of this automatic learning of reference values, Dynatrace allows you to define specific thresholds that specify at what levels deviations above baseline performance are severe enough to generate problem alerts. Keep in mind that these threshold settings only adjust the levels at which Dynatrace alerts you to detected anomalies. These settings don’t affect automatic performance baselining.

There are some use cases for which parameterization of automatic baselining algorithms may be beneficial:

  • Setting higher thresholds for applications and services that are still in development or are in the testing stage.
  • Setting lower thresholds for mission-critical services within your infrastructure (where default thresholds may be too tolerant).

Dynatrace distinguishes between an absolute threshold and a relative threshold for the median and the slowest 10 percent of each given metric. As you can see in the example below, the median thresholds for response time degradation are set to 100 ms (absolute) and 50% (relative) above the auto-learned baseline.

The threshold for the slowest 10% of the requests is set to 1,000 ms (absolute) and 10% (relative) above the auto-learned baseline. Dynatrace anomaly detection threshold settings also allow you to specify how many actions per minute should be observed before Dynatrace sends out problem alerts related to anomalies. This setting allows you to disable alerting for low traffic applications and services—baselining and alerting on low traffic applications often leads to unnecessary alerts.

Dynatrace offers anomaly detection thresholds for three types of anomalies: action duration degradation, load spikes and drops, and increases in failure rate, as shown below:

As an alternative to defining thresholds globally across your entire environment, you can disable global settings and instead fine tune threshold settings for individual applications and services using the application- and service-specific settings pages (see Application setup settings below). To do this, set the Use global anomaly detection settings switch to the Off position and set your custom settings. You can reverse this action anytime to return to globally defined thresholds.

Custom service thresholds work the same way, except that Dynatrace doesn’t alert you to traffic spikes/drops for services.

Override auto-baselines with fixed thresholds

In addition to automatically detecting all your applications, services, and running processes, Dynatrace also monitors your development and testing services—even build processes such as Jenkins. In instances where Dynatrace isn’t able to collect enough statistically relevant data for such services, automatic baselining isn’t the best approach to anomaly detection. For such situations where your development team knows better, Dynatrace provides fixed thresholds. Fixed thresholds allow you to overrule Dynatrace smart multidimensional baselining by setting hard limits on response times and error rates that are not to be exceeded. You can specify fixed thresholds for services and applications on the global level or for specific application and service instances.

To enable fixed thresholds for anomaly detection in your applications globally, go to Settings > Anomaly detection > Applications and select using fixed thresholds from the Detect action duration degradations droplist. 

To specify an upper limit of 3 seconds for a specific application response time:

  1. Click the Applications tile on your Dynatrace homepage.
  2. Select the application you want to edit. 
  3. Click Edit in the menu bar. 
    The Application setup settings page appears. 
  4. Select Anomaly detection
  5. Select using fixed thresholds from the Detect action duration degradations droplist. 
  6. Type a value (in milliseconds) in the Alert if the action duration degrades to… field. 

In this example, if the application has a response time higher than 3 seconds, Dynatrace will generate a problem as usual, but the detected impact of the problem will reflect this fixed threshold and the amount by which the threshold was exceeded. In the example problem shown below, the application’s fixed threshold of 3 seconds was exceeded by a response time of 16.9 seconds.

Note: To avoid confusion as to why Dynatrace has generated a given problem, check if the problem was triggered based on one of your fixed thresholds.