cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
dawid_rampalski
Dynatrace Advisor
Dynatrace Advisor

Summary

This article provides insights into the Metric Event engine and offers tips for setting up anomaly detection rules that are consistent and stable based on your data. Both Metric Events and Davis Anomaly Detectors use similar principles, so this guide will be helpful for both applications.

For basic information about those anomaly detection functionalities please visit our documentation:

 

Understanding the engine

To understand what to expect from the alert, let's first dive into how Dynatrace polls data based on the configuration. In the two screenshots below, you can see the most important parameters for our alerting engines:

Metric EventMetric EventDavis Anomaly DetectorDavis Anomaly Detector

Sliding window

Starting with the sliding window to give context for all the other parameters. The sliding window defines the timeframe of all queries used to fetch data points and operates on a "Last X minutes" basis. Setting it to 5, as in the example above, will result in the engine polling for the last 5 minutes of data every minute. Anomaly detection engines expect a data resolution of 1 minute, so when polling the last 5 minutes, it is expected to receive 5 data samples in response. Each sample is then compared against the threshold.

 

Threshold

Whether it's a static threshold or one based on automated baselines, its main job is to sort the received samples into two buckets: violating or dealerting. Each received sample will be compared against the threshold individually.

 

Alert on missing data

It may happen that when requesting last X minutes of data, the amount of the samples returned does not match the expected number. This flag decides how those missing samples should be treated.

When enabled, missing data samples will be treated as violating samples.

When disabled, missing data is not treated as a violation but will still contribute to dealerting.

 

Violating samples

Defines how many violating samples in a given execution are required to raise a problem.

 

Dealerting samples

Defines how many dealerting samples in a given execution are required to close an active problem.

 

Understanding the results

Sometimes, after setting up anomaly detection configuration, alerts behave in unexpected ways. This chapter provides methods to identify the root cause of such behavior.

 

Query your data

The best way to understand how the configuration behaves is to understand the data. Here are some tips on how to view your metric in a way that closely resembles what the anomaly detection engine sees.

  • set timeframe to be as long as your sliding window
  • use resolution of 1m in Data explorer / interval:1m with DQL
  • try to fetch most recent data to check for latency
  • check dt.sfm.server.metrics.latenciesdsfm:server.metrics.latencies during the unexpected alerts

Now let's explore some examples to visualize the indicators: 

One of the typical issues is alerting on missing data where the metric is ingested with timestamps in the past. It could look like this:

timeseries max(dt.cloud.aws.alb.connections.active), by:{aws.resource.name}, interval:1m
| filter aws.resource.name == "easytravel-angular-large-live"

 

dawid_rampalski_0-1755678585897.png

If the metric latency is consistent a gap like this can be seen when querying Last X minutes of the metric.

If no gap is present on the fresh data or data is sparse in general, it's worth to check our latencies metrics for the specific timeframe where the issue appeared. An example for the metric like the one above:

timeseries { max(dt.sfm.server.metrics.latencies) }, by:{metric_key}, filter:contains(metric_key,"cloud.aws.alb.connections.active")

For Data explorer: 

dsfm:server.metrics.latencies:filter(contains("metric_key","cloud.aws.alb.connections.active")):max

While those metrics do not allow filtering for specific metric dimensions, when using max aggregation it is possible to find out what was the latency reported at given time to estimate if it could've affected problem generation.

Solution

Both Metric Events and Davis Anomaly Detectors have an "offset" parameter that can be defined to tell the engine to move the sliding window into the past.

Davis Anomaly DetectionDavis Anomaly DetectionMetric EventMetric Event

Here is a simple visualization of the "offset":

 

Effects of the offset on the sliding windowEffects of the offset on the sliding window

 

Check for unstable configurations

With larger sliding windows it is possible to set up a configuration where both a violating and dealerting condition is met at the same time. 

Unstable configurationUnstable configuration

Here is an example of a 10/30/10 (violating/sliding window/dealerting) configuration. Within the 30-minute sliding window, you can get 10 violating samples, 10 dealerting samples, and 10 other random samples. The engine receives both signals to open and close the problem at the same time, which can lead to unexpected behavior where problems open and close randomly.

Solution

If your problems behave erratically, always check for unstable configurations. A stable configuration in this case would require at least 21 dealerting samples: 10/30/21. If there are 21 dealerting samples, then there is no room for an additional 10 violating samples within the sliding window.

 

Complex queries

Most issues can be solved using the two approaches above, but when the query is complex, it's not as straightforward. While it's not possible to provide solutions for all potential issues, here is a set of best practices that should help untangle most complex scenarios:

  • Try to divide your query into multiple smaller parts. Pick the base and start adding your modifiers one by one to see how they affect the results
  • When using multiple metrics, try adding them sequentially to observe their impact on the result
  • Keep the number of resulting dimensions as low as possible
  • In DQL, try to avoid using scalar value calculations, these add unique dimensions to your alerts and will cause a fresh problem to be created each time they are calculated, if conditions are met
  • In DQL, try to avoid arrayMovingXXX operations, as they can lead to unexpected results. If there is no alternative, always use sliding window timeframes to investigate the query output. The resulting values may vary significantly from one execution to the next, depending on the data points within the window
  • Try to avoid default and nonempty operators. Instead, use "Alert on missing data" to your advantage
  • Do not use limit and sort operators

 

What's next

In case this article was not enough to find out what happened with your alert, please open a support ticket with the following details:

  • Description of the issue you are facing with the anomaly detection. If possible, share timeframes (and time zones) where you expect a specific outcome - especially important for cases where the problem was not generated
  • Link to your Metric event / Davis Anomaly Detector config
  • Link to the specific problem as an example of the issue
  • Screenshots of the underlying data with 1m resolution - in case remote access is not possible
  • Notebook share link with results of investigation, where possible
Version history
Last update:
‎29 Aug 2025 01:58 PM
Updated by:
Comments
AntonPineiro
DynaMight Guru
DynaMight Guru

Nice! Thank you! :take_my_money: