on
22 Aug 2025
02:45 PM
- edited on
29 Aug 2025
01:58 PM
by
Michal_Gebacki
This article provides insights into the Metric Event engine and offers tips for setting up anomaly detection rules that are consistent and stable based on your data. Both Metric Events and Davis Anomaly Detectors use similar principles, so this guide will be helpful for both applications.
For basic information about those anomaly detection functionalities please visit our documentation:
To understand what to expect from the alert, let's first dive into how Dynatrace polls data based on the configuration. In the two screenshots below, you can see the most important parameters for our alerting engines:
Starting with the sliding window to give context for all the other parameters. The sliding window defines the timeframe of all queries used to fetch data points and operates on a "Last X minutes" basis. Setting it to 5, as in the example above, will result in the engine polling for the last 5 minutes of data every minute. Anomaly detection engines expect a data resolution of 1 minute, so when polling the last 5 minutes, it is expected to receive 5 data samples in response. Each sample is then compared against the threshold.
Whether it's a static threshold or one based on automated baselines, its main job is to sort the received samples into two buckets: violating or dealerting. Each received sample will be compared against the threshold individually.
It may happen that when requesting last X minutes of data, the amount of the samples returned does not match the expected number. This flag decides how those missing samples should be treated.
When enabled, missing data samples will be treated as violating samples.
When disabled, missing data is not treated as a violation but will still contribute to dealerting.
Defines how many violating samples in a given execution are required to raise a problem.
Defines how many dealerting samples in a given execution are required to close an active problem.
Sometimes, after setting up anomaly detection configuration, alerts behave in unexpected ways. This chapter provides methods to identify the root cause of such behavior.
The best way to understand how the configuration behaves is to understand the data. Here are some tips on how to view your metric in a way that closely resembles what the anomaly detection engine sees.
Now let's explore some examples to visualize the indicators:
One of the typical issues is alerting on missing data where the metric is ingested with timestamps in the past. It could look like this:
timeseries max(dt.cloud.aws.alb.connections.active), by:{aws.resource.name}, interval:1m
| filter aws.resource.name == "easytravel-angular-large-live"
If the metric latency is consistent a gap like this can be seen when querying Last X minutes of the metric.
If no gap is present on the fresh data or data is sparse in general, it's worth to check our latencies metrics for the specific timeframe where the issue appeared. An example for the metric like the one above:
timeseries { max(dt.sfm.server.metrics.latencies) }, by:{metric_key}, filter:contains(metric_key,"cloud.aws.alb.connections.active")
For Data explorer:
dsfm:server.metrics.latencies:filter(contains("metric_key","cloud.aws.alb.connections.active")):max
While those metrics do not allow filtering for specific metric dimensions, when using max aggregation it is possible to find out what was the latency reported at given time to estimate if it could've affected problem generation.
Both Metric Events and Davis Anomaly Detectors have an "offset" parameter that can be defined to tell the engine to move the sliding window into the past.
Here is a simple visualization of the "offset":
With larger sliding windows it is possible to set up a configuration where both a violating and dealerting condition is met at the same time.
Here is an example of a 10/30/10 (violating/sliding window/dealerting) configuration. Within the 30-minute sliding window, you can get 10 violating samples, 10 dealerting samples, and 10 other random samples. The engine receives both signals to open and close the problem at the same time, which can lead to unexpected behavior where problems open and close randomly.
If your problems behave erratically, always check for unstable configurations. A stable configuration in this case would require at least 21 dealerting samples: 10/30/21. If there are 21 dealerting samples, then there is no room for an additional 10 violating samples within the sliding window.
Most issues can be solved using the two approaches above, but when the query is complex, it's not as straightforward. While it's not possible to provide solutions for all potential issues, here is a set of best practices that should help untangle most complex scenarios:
arrayMovingXXX
operations, as they can lead to unexpected results. If there is no alternative, always use sliding window timeframes to investigate the query output. The resulting values may vary significantly from one execution to the next, depending on the data points within the windowdefault
and nonempty
operators. Instead, use "Alert on missing data" to your advantagelimit
and sort
operatorsIn case this article was not enough to find out what happened with your alert, please open a support ticket with the following details: