Alerting
Questions about alerting and problem detection in Dynatrace.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Anomaly Detection delay compared to metric events

carolosjk
Observer

Hi all,

 

Over the past couple of months we've been asked to replicate a lot of alerts from other monitoring tools into Dynatrace which has been a bit of a challenge. The possibilities with DQL and anomaly detection are pretty much endless and you can get an amazing outcome on the alerting condition but one thing kept recurring as a problem.

The anomaly detection configurations have a big delay compared to metric events. This is a really big issue when it comes to time sensitive alerts that every minute of downtime matters and costs a lot of money. Essentially the problem is that every anomaly detection configuration will create an event and problem after 3+ minutes from the violating sample being ingested into Grail. Whether the DQL query uses a log or a metric this is always the same behavior. Here is an example below:

You have a really simple alert for metric A, when it gets above the value 10 a problem gets created.
You configure an anomaly detector and a metric event with 1 violating sample and a 3 minute window.
Let's say a datapoint is ingested with the value 15 then we see the following behavior:
The metric event usually creates a event and opens a new problem after 30-60 seconds.
The anomaly detection configuration creates an event and opens a problem after 3+ minutes.

Here is a screenshot with the events and problems from the above scenario:

carolosjk_0-1776851834918.png

The datapoint with value 15 for the metric was ingested at about 11:39:30 AM.
At 11:40:05 the metric event configuration created the first event and created the problem P-xxx529
At 11:42:43 the anomaly detection configuration created the first event and created the problem P-xxx531
That is over a 2 and a half minute difference between the two. And these results are pretty much the same on every test case we've had, either real scenarios or simulated.

I understand that anomaly detection uses the Grail data warehouse, while metric events use the classic metrics and there is probably a time difference for the data to be ingested and processed in each path. But the difference in delay is a deal breaker for time sensitive alerts. For some simple alerts we can use metric events, but when we want a bit more complex logic with DQL like parsing, joining other data, etc then anomaly detection is the only option.

We have tried using the event properties dt.davis.analysis_time_budget:0 and dt.davis.analysis_trigger_delay:0 with no difference in the delays for the creation of the event and problem. We have also tried setting both the violating samples and time window in anomaly detection to 1 minute which again makes no difference. The only working workaround we have found is to have a workflow that executes every minute and sends a notification via mail or other integration, but this is not scalable (or cost effective) in any way.

I would love to know if you have any suggestions on this topic and if you have found a way to get alerts and problems faster with anomaly detection. Has anyone else had the same issue in the past?

Thank you for your time!
Best regards,
Karolos

4 REPLIES 4

Julius_Loman
DynaMight Legend
DynaMight Legend

Very good question! I have the same experience and only found this in the changelog, but it seems to apply only to problem-opening events, not to Anomaly detectors.

 @DavidBruendl can you advise?

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

dannemca
DynaMight Guru
DynaMight Guru

I am following this thread!!

Site Reliability Engineer @ Kyndryl

Hey Antonio,

 

This is really interesting. I wasn't aware of that metric and it explains some metric events that I had tested in the past and had a big delay in producing an event/problem. It seems like the same metric doesn't exist on Grail, right? I can't find anything similar there.

Featured Posts