30 Dec 2025 02:57 PM
How are DT engineers working through Alert fatigue?
How are DT engineers figuring out best time for below (services, Web, etc)?
I know adjusting these settings is a balancing act between speed (MTTD - Mean Time to Detect) and accuracy (reducing false positives).
Anybody got any good dashboards or notebooks that help with this?
I got some various Problems dashboards that I can share if helpful but doesn't look at how many problems at various times. Here is a dql we started to use to figure out some services.
fetch dt.davis.problems
| filter event.status == "CLOSED"
| filter affected_entity_types[0] == "dt.entity.service"
| filter contains(event.name, "Failure rate increase", caseSensitive:false) or contains(event.name, "Response time degradation", caseSensitive:false)
| fields timestamp, event.name, resolved_problem_duration, affected_entity_ids, root_cause_entity_id
| summarize count(), by: {resolved_problem_duration}
30 Dec 2025 09:58 PM
I have done this exercise with several clients. What seems to work best from my perspective:
Regarding #2 above, you can graph it with
fetch dt.davis.problems
| filter event.status == "CLOSED"
| filter affected_entity_types[0] == "dt.entity.service"
| filter contains(event.name, "Failure rate increase", caseSensitive:false)
| parse event.description, "DATA 'increased to ' ([0-9.]+):fail_raw ' %' DATA"
| fieldsAdd fail_rate = bin(toDouble(fail_raw),1)
| filter isNotNull(fail_rate)
| summarize count = count(), by: { fail_rate }
| sort fail_rate ascand you might get something like
In this case, more than a third of the problems were with error percentages below 6%.
Featured Posts