How are new problems evaluated and raised?

Dynatrace continuously measures incoming traffic levels against defined thresholds to determine when a detected slowdown or error-rate increase justifies the generation of a new problem event. Rapidly increasing response-time degradations for applications and services are evaluated based on sliding 5-minute time intervals. Slowly degrading response-time degradations are evaluated based on 15-minute time intervals.

Understanding thresholds

Dynatrace utilizes three types of thresholds:

  • Automatic baselines: Multidimensional baselining automatically detects individual reference values that adapt over time. Automatic baseline reference values are used to cope with dynamic changes within your application or service response times, error rates, and load.
  • Built-in static thresholds: Dynatrace uses built-in static thresholds for all infrastructure events (for example, detecting high CPU, low disk space, or low memory).
  • User-defined static thresholds: With customizable anomaly detection settings (available at Settings > Anomaly detection), you can overwrite the default static thresholds for infrastructure events. You can also switch from automatic baselining for application and service anomaly detection to static thresholds. With static thresholds, the detected baseline thresholds are overwritten by your custom static thresholds for individual dimensions.

The methodology used for raising events with automatic baselining is completely different from that used for static thresholds. The following sections provide details about both methods:

Automatic baselining

Dynatrace uses automatic baselining to learn the typical reference values of application and service response times, error rates, and load.

With respect to response times, Dynatrace collects references for the median (above which are the slowest 50% of all callers) and the 90th percentile (the slowest 10% of all callers). A slowdown event is raised if the typical response times for either the median or the 90th percentiles degrade.

Application baselining calculates reference values for 4 different dimensions:

  • User action: An application's user action (e.g., orange.jsf, login.jsp, logout, or specialOffers.jsp).
  • Geolocation: Hierarchically organized list of geolocations where user sessions originate from. Geolocations are organized into continents, countries, regions, and cities.
  • Browser: Hierarchically organized list of browser families, such as Firefox and Chrome. The topmost categories are the browser families. These are followed by the browser versions within each browser family.
  • Operating system: Hierarchically organized list of operating systems, such as Windows and Linux. The topmost categories are the operating systems. These are followed by the individual OS versions.

Service baselining calculates a reference value for the Service method dimension:

  • Service method: A service's individual service methods (e.g., getBookingPage or getReportPage).

In the case of database services, the service method represents the different SQL statements that are queried (e.g., call verify_location(?) select booking0_.id from Booking booking0_ where booking0_.user_name<>?) A reference value is additionally calculated for the predefined service method groups, static requests, and dynamic requests.

For database services, a reference value is calculated for the predefined service method groups inserts and updates, selects.

Automatic baselining attempts to figure out the best reference values for incoming application and service traffic. To do this, Dynatrace automatically generates a baseline cube for your actual incoming application and service traffic. This means that if your traffic comes mainly from New York, and most of your users use the Chrome browser, your baseline cube will contain the following reference values:

bash `USA - New York – Chrome – Reference response time : 2sek, error rate: 0%, load: 2 actions/min`

If your application also receives traffic from Beijing, but with a completely different response time, the baseline cube will automatically adapt and thereafter contain the following reference values:

bash `USA - New York – Chrome – Reference response time : 2sek, error rate: 0%, load: 2 actions/min` bash `China – Bejing - QQ Browser - Reference response time : 4sek, error rate: 1%, load: 1 actions/min`

Dynatrace detects when your applications and services are initially detected with OneAgent. The baseline cube is calculated two hours after your application or service is initially detected by Dynatrace OneAgent so that it can analyze two hours of actual traffic to calculate preliminary reference values and identify where your traffic comes from. Calculation of the reference cube is repeated every day so that Dynatrace can continue to adapt to changes in your traffic.

To avoid over-alerting and reduce notification noise, the automatic anomaly-detection modes don't alert on fluctuating applications and services that haven't run for at least 20% of a full week.

Alerting on response time degradations and error rate increases begins once the baseline cube is ready and the application or service has run for at least 20% of a week.

Dynatrace application traffic anomaly detection is based on the assumption that most business traffic follows predictable daily and weekly traffic patterns. Dynatrace automatically learns each applications’ unique traffic patterns. Alerting on traffic spikes and drops begins after a learning period of one week because baselining requires a full week’s worth of traffic to learn daily and weekly patterns.

Following the learning period, Dynatrace forecasts the next week’s traffic and then compares the actual incoming application traffic with the prediction. If Dynatrace detects a deviation from forecasted traffic levels that falls outside of reasonable statistical variation, Dynatrace raises either an Unexpected low traffic or an Unexpected high traffic problem.

Advantages of automatic baselining:

  • Works out of the box without manual configuration of thresholds.
  • No manual effort required to set specific thresholds for geolocations, browsers, etc.
  • Adapts automatically to changes in traffic patterns.

Disadvantages of automatic baselining:

  • Requires a learning period within which Dynatrace learns normal traffic patterns.

Summary

  • Baselines are evaluated within 5-min and 15-min sliding time intervals.
  • Automatic detection of reference values for response times, error ratess and load.
  • A combination of 4 dimensions for applications and 1 dimension for services.
  • Baseline cube calculation is initially performed 2 hours after your application or service is first detected by Dynatrace, and thereafter on a daily basis.
  • Applications and services have to run for at least 20% of a week before slowdown and error rate alerts are raised.
  • Applications have to run for at least a full week before traffic spike and drops alerts are raised.
  • Slowdown events are detected for the median and 90th percentile.

Static thresholds

Dynatrace infrastructure monitoring is based on numerous built-in, predefined static thresholds. These thresholds relate to resource contentions like CPU spikes, memory, and disk usage. You can change these default thresholds by navigating to Settings > Anomaly Detection > Infrastructure.

For applications and services, you can disable automatic baselining-based reference-value detection anytime and switch to user-defined static thresholds. If you set a static threshold for response time and error rate on an application or service level, events will be raised if the static threshold is breached. A slowdown event is raised if the static thresholds for either the median or the 90th percentile response times are breached.

Advantages of static thresholds:

  • Begins to alert immediately without a learning period.

Disadvantages of static thresholds:

  • Too much manual effort is required for setting static thresholds for each service method or user action.
  • It can be challenging to set static thresholds for dynamic services.
  • They don't adapt to changing environments.

Summary

  • Infrastructure monitoring is built upon predefined static thresholds for numerous metrics.
  • Immediately begins to alert on static thresholds without a learning period.
  • Events are raised for threshold breaches of the median and 90th percentiles.