Alerting
Questions about alerting and problem detection in Dynatrace.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Problem not opened in case of failure rate 100%

tesp11331
Participant

Hi All,

I see on my tenant that for some services in case oh high failure rate (even 100%) no problem is opened or the opening is delayed. See for instance:

tesp11331_2-1770813661418.png

For another service (with the same alert condition) the problem was immediately opened:

tesp11331_3-1770813861427.png

Why this different behavior ?

Thanks

Regards

Pasquale

5 REPLIES 5

Julius_Loman
DynaMight Legend
DynaMight Legend

Most likely, those are because of frequent issue detection. Can you check if you have any such event on the service? Also, check your anomaly detection setting on the service (or in the settings hierarchy).

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

tesp11331
Participant

I don't think this is related to a frequent issue, usually we do not have such failure rates on that services. It is also strange that in case of the first chart the problem was opened but later and not at the beginning when the failure rate was at 100%.

The alert condition is the same for all services (absolute threshold=3% and relative=60%).

Hey, by any chance did you find an answer to this as I'm on the same boat?

Hi, unfortunately I didn't 

t_pawlak
Leader

 

Hi,

My guess is that this is mostly related to how Dynatrace evaluates the violation over time, not only to the visible percentage on the chart.

Even if the chart shows 100% failure rate for a short period, Dynatrace usually raises the event only after the anomaly detection logic collects enough violating samples within its sliding window. For anomaly detection, the default behavior is typically 3 violating one-minute samples out of 5 minutes before the event is raised. So a short spike can appear immediately on the graph, while the problem itself is opened a bit later.

A second factor can be the traffic volume / number of requests behind the percentage. Two services may have the same configured thresholds, but if one service has only a few failing calls and the other has sustained failing traffic, the evaluation can behave differently even when the displayed failure rate looks similar.

I would therefore check:

1. the service anomaly detection settings (and inherited settings),
2. whether the service had enough request volume during that time,
3. whether frequent issue detection/suppression played any role,
4. and whether there is any difference between problem opening time and notification sending delay from the alerting profile.

So in short: the different behavior is most likely caused by sample-based evaluation and sliding-window logic, possibly combined with different request volumes on those services.
For example, go to the service settings and check the anomaly detection configuration, especially the “Avoid over alerting” option.
avoid1.jpg

Featured Posts