Alert fatigue, Dashboards, recommendations?

Kenny_Gillette — Tue, 30 Dec 2025 14:57:22 GMT

How are DT engineers working through Alert fatigue?

How are DT engineers figuring out best time for below (services, Web, etc)?

I know adjusting these settings is a balancing act between speed (MTTD - Mean Time to Detect) and accuracy (reducing false positives).

Anybody got any good dashboards or notebooks that help with this?

I got some various Problems dashboards that I can share if helpful but doesn't look at how many problems at various times. Here is a dql we started to use to figure out some services.

fetch dt.davis.problems
| filter event.status == "CLOSED"
| filter affected_entity_types[0] == "dt.entity.service"
| filter contains(event.name, "Failure rate increase", caseSensitive:false) or contains(event.name, "Response time degradation", caseSensitive:false)
| fields timestamp, event.name, resolved_problem_duration, affected_entity_ids, root_cause_entity_id
| summarize count(), by: {resolved_problem_duration}

Re: Alert fatigue, Dashboards, recommendations?

AntonioSousa — Tue, 30 Dec 2025 21:58:40 GMT

@Kenny_Gillette ,

I have done this exercise with several clients. What seems to work best from my perspective:

Maintain 1 minute for "Failure rate" so MTTD is fast
In "Failure rate" start with a higher "Absolute threshold". 1% and 5% have worked for me. More details below.
In "Response time", an absolute threshold of 300 ms is my favorite. But it really depends on a lot of factors, and in Portugal, where the Cloud is not near, as latency is always a problem, that's the reason I normally suggest it's higher.
Pareto tells us that 80% of the problems will be on 20% of the Services. In my case, I normally see 95%/5%, so setting specific thresholds for specific Services yields great results... You can see them if you summarize by affected_entity_ids.
But in the end, the greatest way is to correct all those errors. Response time, is more difficult though...

Regarding #2 above, you can graph it with

fetch dt.davis.problems | filter event.status == "CLOSED" | filter affected_entity_types[0] == "dt.entity.service" | filter contains(event.name, "Failure rate increase", caseSensitive:false) | parse event.description, "DATA 'increased to ' ([0-9.]+):fail_raw ' %' DATA" | fieldsAdd fail_rate = bin(toDouble(fail_raw),1) | filter isNotNull(fail_rate) | summarize count = count(), by: { fail_rate } | sort fail_rate asc

and you might get something like

In this case, more than a third of the problems were with error percentages below 6%.

topic Re: Alert fatigue, Dashboards, recommendations? in Alerting

Alert fatigue, Dashboards, recommendations?

Re: Alert fatigue, Dashboards, recommendations?