Summary

dawid_rampalski · ‎22 Aug 2025

Summary
Understanding the engine
Understanding the results
What's next

Summary

This article provides insights into the Metric Event engine and offers tips for setting up anomaly detection rules that are consistent and stable based on your data. Both Metric Events and Davis Anomaly Detectors use similar principles, so this guide will be helpful for both applications.

For basic information about those anomaly detection functionalities please visit our documentation:

Understanding the engine

To understand what to expect from the alert, let's first dive into how Dynatrace polls data based on the configuration. In the two screenshots below, you can see the most important parameters for our alerting engines:

Metric Event

Davis Anomaly Detector

Sliding window

Starting with the sliding window to give context for all the other parameters. The sliding window defines the timeframe of all queries used to fetch data points and operates on a "Last X minutes" basis. Setting it to 5, as in the example above, will result in the engine polling for the last 5 minutes of data every minute. Anomaly detection engines expect a data resolution of 1 minute, so when polling the last 5 minutes, it is expected to receive 5 data samples in response. Each sample is then compared against the threshold.

Threshold

Whether it's a static threshold or one based on automated baselines, its main job is to sort the received samples into two buckets: violating or dealerting. Each received sample will be compared against the threshold individually.

Alert on missing data

It may happen that when requesting last X minutes of data, the amount of the samples returned does not match the expected number. This flag decides how those missing samples should be treated.

When enabled, missing data samples will be treated as violating samples.

When disabled, missing data is not treated as a violation but will still contribute to dealerting.

Violating samples

Defines how many violating samples in a given execution are required to raise a problem.

Dealerting samples

Defines how many dealerting samples in a given execution are required to close an active problem.

Understanding the results

Sometimes, after setting up anomaly detection configuration, alerts behave in unexpected ways. This chapter provides methods to identify the root cause of such behavior.

Query your data

The best way to understand how the configuration behaves is to understand the data. Here are some tips on how to view your metric in a way that closely resembles what the anomaly detection engine sees.

set timeframe to be as long as your sliding window
use resolution of 1m in Data explorer / interval:1m with DQL
try to fetch most recent data to check for latency
check dt.sfm.server.metrics.latencies / dsfm:server.metrics.latencies during the unexpected alerts

Now let's explore some examples to visualize the indicators:

One of the typical issues is alerting on missing data where the metric is ingested with timestamps in the past. It could look like this:

timeseries max(dt.cloud.aws.alb.connections.active), by:{aws.resource.name}, interval:1m
| filter aws.resource.name == "easytravel-angular-large-live"

If the metric latency is consistent a gap like this can be seen when querying Last X minutes of the metric.

If no gap is present on the fresh data or data is sparse in general, it's worth to check our latencies metrics for the specific timeframe where the issue appeared. An example for the metric like the one above:

timeseries { max(dt.sfm.server.metrics.latencies) }, by:{metric_key}, filter:contains(metric_key,"cloud.aws.alb.connections.active")

For Data explorer:

dsfm:server.metrics.latencies:filter(contains("metric_key","cloud.aws.alb.connections.active")):max

While those metrics do not allow filtering for specific metric dimensions, when using max aggregation it is possible to find out what was the latency reported at given time to estimate if it could've affected problem generation.

Solution

Both Metric Events and Davis Anomaly Detectors have an "offset" parameter that can be defined to tell the engine to move the sliding window into the past.

Davis Anomaly Detection

Metric Event

Here is a simple visualization of the "offset":

Effects of the offset on the sliding window

Check for unstable configurations

With larger sliding windows it is possible to set up a configuration where both a violating and dealerting condition is met at the same time.

Unstable configuration

Here is an example of a 10/30/10 (violating/sliding window/dealerting) configuration. Within the 30-minute sliding window, you can get 10 violating samples, 10 dealerting samples, and 10 other random samples. The engine receives both signals to open and close the problem at the same time, which can lead to unexpected behavior where problems open and close randomly.

Solution

If your problems behave erratically, always check for unstable configurations. A stable configuration in this case would require at least 21 dealerting samples: 10/30/21. If there are 21 dealerting samples, then there is no room for an additional 10 violating samples within the sliding window.

Complex queries

Most issues can be solved using the two approaches above, but when the query is complex, it's not as straightforward. While it's not possible to provide solutions for all potential issues, here is a set of best practices that should help untangle most complex scenarios:

Try to divide your query into multiple smaller parts. Pick the base and start adding your modifiers one by one to see how they affect the results
When using multiple metrics, try adding them sequentially to observe their impact on the result
Keep the number of resulting dimensions as low as possible
In DQL, try to avoid using scalar value calculations, these add unique dimensions to your alerts and will cause a fresh problem to be created each time they are calculated, if conditions are met
In DQL, try to avoid arrayMovingXXX operations, as they can lead to unexpected results. If there is no alternative, always use sliding window timeframes to investigate the query output. The resulting values may vary significantly from one execution to the next, depending on the data points within the window
Try to avoid default and nonempty operators. Instead, use "Alert on missing data" to your advantage
Do not use limit and sort operators

What's next

In case this article was not enough to find out what happened with your alert, please open a support ticket with the following details:

Description of the issue you are facing with the anomaly detection. If possible, share timeframes (and time zones) where you expect a specific outcome - especially important for cases where the problem was not generated
Link to your Metric event / Davis Anomaly Detector config
Link to the specific problem as an example of the issue
Screenshots of the underlying data with 1m resolution - in case remote access is not possible
Notebook share link with results of investigation, where possible

AntonPineiro · ‎24 Aug 2025

Nice! Thank you!

DanielS · ‎05 Sep 2025

Great Resource for people starting with this or as a training resource.

36Krazyfists · ‎13 Nov 2025

This is very nice @dawid_rampalski, thanks!

One thing I wish is that we could have a separate sliding window for dealerting samples. Having the dealerting condition share the same sliding window as the alerting condition really limits us when it comes to auto-closing the problems. I may want something to alert quickly (like when 3 out of violating 5 samples occur), but I don't want that to dealert until there has been 10, 20, or maybe 30 minutes of stability.

Because the violating samples and dealerting samples share the same sliding window, we have to choose between fast detection and potentially frequent problems opening and closing all day, or slower detection and just one problem open all day.

For example, say some count metric has occasional spikes throughout the day (like maybe it averages a value of 3 and occasionally spikes to 10 every 10 or 15 minutes). And that is fine. However, sometimes an issue occurs, and it spikes to 10 every minute or two. We need to know when this happens fairly quickly, so we setup a seasonal threshold detector with 3/5/5. Great, now if this happens, we get notified quickly. But what if this issue comes in chunks where it does this every 10 or 20 minutes, and then it's calm again? With 3/5/5, we'll get a problem, it'll close, then another problem 10 minutes later, then it closes, and so on... It's actually one issue all day, but it will look like a bunch of separate problems.

Yes, I know there is also the Frequent Issue Detection, but that is a binary option that we don't have a lot of control over and not really the same thing that we want here.

Anyways, to mitigate the frequent problems all day issue, we could instead setup a detector with 5/15/10 or maybe 10/20/15, but then we may start getting alerted about the occasional spikes that are completely normal, plus it will now take longer before we alert on the problem.

If we had separate dealerting windows, we could setup a 3/5 violating threshold, but a 10/20 dealerting threshold, or maybe even a 15/30 so that we can ensure that the Problem only closes when it has been completely stable for a very long time.

In other words, the trigger condition and the close conditions should be completely separate from each other. It would be even cooler if we could have a completely separate query for reset if we wanted! "Alert when this metric is over a certain value for this long, but only dealert when that metric and this metric are under a certain value for this long."

That's how SolarWinds Orion (or I think it's called SolarWinds Platform now...) does alerts and it is super useful. 9 times out of 10 we don't set custom dealerting queries in SolarWinds, but it's useful for those special scenarios where we need it. However, we do frequently set much longer dealerting timeframes to ensure that an alert only clears when the issue is completely over.

dawid_rampalski · ‎13 Nov 2025

Hi @36Krazyfists ,

I think this is a great idea for an improvement! If this feature is needed for your deployment you can create a Product Idea topic here: https://community.dynatrace.com/t5/Dynatrace-product-ideas/idb-p/DynatraceProductIdeas
This is currently the main way to communicate directly with our PM and dev teams in terms of requests for features like this.

Feel free to drop your story there as well. At this moment we don't have this kind of functionality out of the box with the Davis Anomaly Detection.

36Krazyfists · ‎14 Nov 2025

Thanks @dawid_rampalski. Yeah, I'll do that.

And yeah, I have plenty of other ideas for DAD as well. In general, we just need more options there. When I hit "Show advanced properties" when configuring an Anomaly Detector, I want to see advanced properties.

I want things like:

Different tolerances and violating samples and sliding windows based on time and/or day of week. During business hours lower tolerances could alert us of issues as they start to happen, but after hours, higher tolerances could insure we only alert when there is a very anomalous problem that needs to be looked at.
Seasonal baseline control. I would love to be able to tell it to look back more than just 14 days. Like, use the last 28 days to calculate the seasonal baseline. Or, maybe I only want this to look at the same time over the last 2 or 3 months. This could be especially helpful in finance where usage patterns may repeat during the end of the month instead of weekly.
Seasonal baseline intelligence. You guys kind of added this feature recently, but yeah, if a day that the seasonal baseline algorithm is looking at falls on a holiday, ignore that day and go back an extra week and use that week instead.
- That and we should be able to just use our calendars from Workflow. Honestly, Dynatrace should just have an environment-wide Business Hours calendar where we set our typical business hours/days as well as define our holidays. That calendar could then be automatically used by all of the seasonal baseline algorithms, maintenance schedules, and dashboards.
  - All Seasonal baselines could then automatically ignore holidays and go back an extra week.
  - All seasonal baselines could then have a set of tolerances/windows for business hours and another set for after hours
  - All dashboard panel thresholds could have an option to use seasonal baselines to set the thresholds (with the during/after business hours options!) instead of the static thresholds we get now. This would put our dashboard panels more in line with our anomaly alerting. Bonus if we could just point our dashboard panel thresholds at an existing anomaly alert!

How to set up Metric events / Davis anomaly detectors

Summary

Understanding the engine

Sliding window

Threshold

Alert on missing data

Violating samples

Dealerting samples

Understanding the results

Query your data

Solution

Check for unstable configurations

Solution

Complex queries

What's next