Alerting
Questions about alerting and problem detection in Dynatrace.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Alert fatigue, Dashboards, recommendations?

Kenny_Gillette
DynaMight Leader
DynaMight Leader

How are DT engineers working through Alert fatigue?

How are DT engineers figuring out best time for below (services, Web, etc)?

Kenny_Gillette_0-1767106262783.png

I know adjusting these settings is a balancing act between speed (MTTD - Mean Time to Detect) and accuracy (reducing false positives).

 

Anybody got any good dashboards or notebooks that help with this?

I got some various Problems dashboards that I can share if helpful but doesn't look at how many problems at various times.  Here is a dql we started to use to figure out some services.

fetch dt.davis.problems
| filter event.status == "CLOSED"
| filter affected_entity_types[0] == "dt.entity.service"
| filter contains(event.name, "Failure rate increase", caseSensitive:false) or contains(event.name, "Response time degradation", caseSensitive:false)
| fields timestamp, event.name, resolved_problem_duration, affected_entity_ids, root_cause_entity_id
| summarize count(), by: {resolved_problem_duration}

 

Kenny_Gillette_2-1767106535373.png

 

Dynatrace Certified Professional
1 REPLY 1

AntonioSousa
DynaMight Guru
DynaMight Guru

@Kenny_Gillette ,

I have done this exercise with several clients. What seems to work best from my perspective:

  1. Maintain 1 minute for "Failure rate" so MTTD is fast
  2. In "Failure rate" start with a higher "Absolute threshold". 1% and 5% have worked for me. More details below.
  3. In "Response time", an absolute threshold of 300 ms is my favorite. But it really depends on a lot of factors, and in Portugal, where the Cloud is not near, as latency is always a problem, that's the reason I normally suggest it's higher.
  4. Pareto tells us that 80% of the problems will be on 20% of the Services. In my case, I normally see 95%/5%, so setting specific thresholds for specific Services yields great results... You can see them if you summarize by affected_entity_ids.
  5. But in the end, the greatest way is to correct all those errors. Response time, is more difficult though...

 

Regarding #2 above, you can graph it with 

fetch dt.davis.problems
| filter event.status == "CLOSED"
| filter affected_entity_types[0] == "dt.entity.service"
| filter contains(event.name, "Failure rate increase", caseSensitive:false)
| parse event.description, "DATA 'increased to ' ([0-9.]+):fail_raw ' %' DATA"
| fieldsAdd fail_rate = bin(toDouble(fail_raw),1)
| filter isNotNull(fail_rate)
| summarize count = count(), by: { fail_rate }
| sort fail_rate asc

and you might get something like

AntonioSousa_0-1767131787112.png

In this case, more than a third of the problems were with error percentages below 6%.

Antonio Sousa

Featured Posts