on
22 Aug 2025
02:45 PM
- edited on
29 Aug 2025
01:58 PM
by
Michal_Gebacki
This article provides insights into the Metric Event engine and offers tips for setting up anomaly detection rules that are consistent and stable based on your data. Both Metric Events and Davis Anomaly Detectors use similar principles, so this guide will be helpful for both applications.
For basic information about those anomaly detection functionalities please visit our documentation:
To understand what to expect from the alert, let's first dive into how Dynatrace polls data based on the configuration. In the two screenshots below, you can see the most important parameters for our alerting engines:
Starting with the sliding window to give context for all the other parameters. The sliding window defines the timeframe of all queries used to fetch data points and operates on a "Last X minutes" basis. Setting it to 5, as in the example above, will result in the engine polling for the last 5 minutes of data every minute. Anomaly detection engines expect a data resolution of 1 minute, so when polling the last 5 minutes, it is expected to receive 5 data samples in response. Each sample is then compared against the threshold.
Whether it's a static threshold or one based on automated baselines, its main job is to sort the received samples into two buckets: violating or dealerting. Each received sample will be compared against the threshold individually.
It may happen that when requesting last X minutes of data, the amount of the samples returned does not match the expected number. This flag decides how those missing samples should be treated.
When enabled, missing data samples will be treated as violating samples.
When disabled, missing data is not treated as a violation but will still contribute to dealerting.
Defines how many violating samples in a given execution are required to raise a problem.
Defines how many dealerting samples in a given execution are required to close an active problem.
Sometimes, after setting up anomaly detection configuration, alerts behave in unexpected ways. This chapter provides methods to identify the root cause of such behavior.
The best way to understand how the configuration behaves is to understand the data. Here are some tips on how to view your metric in a way that closely resembles what the anomaly detection engine sees.
Now let's explore some examples to visualize the indicators:
One of the typical issues is alerting on missing data where the metric is ingested with timestamps in the past. It could look like this:
timeseries max(dt.cloud.aws.alb.connections.active), by:{aws.resource.name}, interval:1m
| filter aws.resource.name == "easytravel-angular-large-live"
If the metric latency is consistent a gap like this can be seen when querying Last X minutes of the metric.
If no gap is present on the fresh data or data is sparse in general, it's worth to check our latencies metrics for the specific timeframe where the issue appeared. An example for the metric like the one above:
timeseries { max(dt.sfm.server.metrics.latencies) }, by:{metric_key}, filter:contains(metric_key,"cloud.aws.alb.connections.active")
For Data explorer:
dsfm:server.metrics.latencies:filter(contains("metric_key","cloud.aws.alb.connections.active")):max
While those metrics do not allow filtering for specific metric dimensions, when using max aggregation it is possible to find out what was the latency reported at given time to estimate if it could've affected problem generation.
Both Metric Events and Davis Anomaly Detectors have an "offset" parameter that can be defined to tell the engine to move the sliding window into the past.
Here is a simple visualization of the "offset":
With larger sliding windows it is possible to set up a configuration where both a violating and dealerting condition is met at the same time.
Here is an example of a 10/30/10 (violating/sliding window/dealerting) configuration. Within the 30-minute sliding window, you can get 10 violating samples, 10 dealerting samples, and 10 other random samples. The engine receives both signals to open and close the problem at the same time, which can lead to unexpected behavior where problems open and close randomly.
If your problems behave erratically, always check for unstable configurations. A stable configuration in this case would require at least 21 dealerting samples: 10/30/21. If there are 21 dealerting samples, then there is no room for an additional 10 violating samples within the sliding window.
Most issues can be solved using the two approaches above, but when the query is complex, it's not as straightforward. While it's not possible to provide solutions for all potential issues, here is a set of best practices that should help untangle most complex scenarios:
arrayMovingXXX operations, as they can lead to unexpected results. If there is no alternative, always use sliding window timeframes to investigate the query output. The resulting values may vary significantly from one execution to the next, depending on the data points within the windowdefault and nonempty operators. Instead, use "Alert on missing data" to your advantagelimit and sort operatorsIn case this article was not enough to find out what happened with your alert, please open a support ticket with the following details:
This is very nice @dawid_rampalski, thanks!
One thing I wish is that we could have a separate sliding window for dealerting samples. Having the dealerting condition share the same sliding window as the alerting condition really limits us when it comes to auto-closing the problems. I may want something to alert quickly (like when 3 out of violating 5 samples occur), but I don't want that to dealert until there has been 10, 20, or maybe 30 minutes of stability.
Because the violating samples and dealerting samples share the same sliding window, we have to choose between fast detection and potentially frequent problems opening and closing all day, or slower detection and just one problem open all day.
For example, say some count metric has occasional spikes throughout the day (like maybe it averages a value of 3 and occasionally spikes to 10 every 10 or 15 minutes). And that is fine. However, sometimes an issue occurs, and it spikes to 10 every minute or two. We need to know when this happens fairly quickly, so we setup a seasonal threshold detector with 3/5/5. Great, now if this happens, we get notified quickly. But what if this issue comes in chunks where it does this every 10 or 20 minutes, and then it's calm again? With 3/5/5, we'll get a problem, it'll close, then another problem 10 minutes later, then it closes, and so on... It's actually one issue all day, but it will look like a bunch of separate problems.
Yes, I know there is also the Frequent Issue Detection, but that is a binary option that we don't have a lot of control over and not really the same thing that we want here.
Anyways, to mitigate the frequent problems all day issue, we could instead setup a detector with 5/15/10 or maybe 10/20/15, but then we may start getting alerted about the occasional spikes that are completely normal, plus it will now take longer before we alert on the problem.
If we had separate dealerting windows, we could setup a 3/5 violating threshold, but a 10/20 dealerting threshold, or maybe even a 15/30 so that we can ensure that the Problem only closes when it has been completely stable for a very long time.
In other words, the trigger condition and the close conditions should be completely separate from each other. It would be even cooler if we could have a completely separate query for reset if we wanted! "Alert when this metric is over a certain value for this long, but only dealert when that metric and this metric are under a certain value for this long."
That's how SolarWinds Orion (or I think it's called SolarWinds Platform now...) does alerts and it is super useful. 9 times out of 10 we don't set custom dealerting queries in SolarWinds, but it's useful for those special scenarios where we need it. However, we do frequently set much longer dealerting timeframes to ensure that an alert only clears when the issue is completely over.
Hi @36Krazyfists ,
I think this is a great idea for an improvement! If this feature is needed for your deployment you can create a Product Idea topic here: https://community.dynatrace.com/t5/Dynatrace-product-ideas/idb-p/DynatraceProductIdeas
This is currently the main way to communicate directly with our PM and dev teams in terms of requests for features like this.
Feel free to drop your story there as well. At this moment we don't have this kind of functionality out of the box with the Davis Anomaly Detection.