<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Alert fatigue, Dashboards, recommendations? in Alerting</title>
    <link>https://community.dynatrace.com/t5/Alerting/Alert-fatigue-Dashboards-recommendations/m-p/292319#M6140</link>
    <description>&lt;P&gt;&lt;a href="https://community.dynatrace.com/t5/user/viewprofilepage/user-id/36140"&gt;@Kenny_Gillette&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I have done this exercise with several clients. What seems to work best from my perspective:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Maintain 1 minute for "Failure rate" so MTTD is fast&lt;/LI&gt;&lt;LI&gt;In "Failure rate" start with a higher "Absolute threshold". 1% and 5% have worked for me. More details below.&lt;/LI&gt;&lt;LI&gt;In "Response time", an absolute threshold of 300 ms is my favorite. But it really depends on a lot of factors, and in Portugal, where the Cloud is not near, as latency is always a problem, that's the reason I normally suggest it's higher.&lt;/LI&gt;&lt;LI&gt;Pareto tells us that 80% of the problems will be on 20% of the Services. In my case, I normally see 95%/5%, so setting specific thresholds for specific Services yields great results... You can see them if you summarize by&amp;nbsp;affected_entity_ids.&lt;/LI&gt;&lt;LI&gt;But in the end, the greatest way is to correct all those errors. Response time, is more difficult though...&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regarding #2 above, you can graph it with&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;fetch dt.davis.problems
| filter event.status == "CLOSED"
| filter affected_entity_types[0] == "dt.entity.service"
| filter contains(event.name, "Failure rate increase", caseSensitive:false)
| parse event.description, "DATA 'increased to ' ([0-9.]+):fail_raw ' %' DATA"
| fieldsAdd fail_rate = bin(toDouble(fail_raw),1)
| filter isNotNull(fail_rate)
| summarize count = count(), by: { fail_rate }
| sort fail_rate asc&lt;/LI-CODE&gt;&lt;P&gt;and you might get something like&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="AntonioSousa_0-1767131787112.png" style="width: 999px;"&gt;&lt;img src="https://community.dynatrace.com/t5/image/serverpage/image-id/31373iCB7BFB59E4A6B209/image-size/large?v=v2&amp;amp;px=999" role="button" title="AntonioSousa_0-1767131787112.png" alt="AntonioSousa_0-1767131787112.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;In this case, more than a third of the problems were with error percentages below 6%.&lt;/P&gt;</description>
    <pubDate>Tue, 30 Dec 2025 21:58:40 GMT</pubDate>
    <dc:creator>AntonioSousa</dc:creator>
    <dc:date>2025-12-30T21:58:40Z</dc:date>
    <item>
      <title>Alert fatigue, Dashboards, recommendations?</title>
      <link>https://community.dynatrace.com/t5/Alerting/Alert-fatigue-Dashboards-recommendations/m-p/292298#M6139</link>
      <description>&lt;P&gt;How are DT engineers working through Alert fatigue?&lt;/P&gt;&lt;P&gt;How are DT engineers figuring out best time for below (services, Web, etc)?&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Kenny_Gillette_0-1767106262783.png" style="width: 400px;"&gt;&lt;img src="https://community.dynatrace.com/t5/image/serverpage/image-id/31370i6EB2256A42932A2F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Kenny_Gillette_0-1767106262783.png" alt="Kenny_Gillette_0-1767106262783.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I know adjusting these settings is a balancing act between&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;speed&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;(MTTD - Mean Time to Detect) and&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;accuracy&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;(reducing false positives).&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Anybody got any good dashboards or notebooks that help with this?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;I got some various Problems dashboards that I can share if helpful but doesn't look at how many problems at various times.&amp;nbsp; Here is a dql we started to use to figure out some services.&lt;/P&gt;&lt;P&gt;fetch dt.davis.problems&lt;BR /&gt;| filter event.status == "CLOSED"&lt;BR /&gt;| filter affected_entity_types[0] == "dt.entity.service"&lt;BR /&gt;| filter contains(event.name, "Failure rate increase", caseSensitive:false) or contains(event.name, "Response time degradation", caseSensitive:false)&lt;BR /&gt;| fields timestamp, event.name, resolved_problem_duration, affected_entity_ids, root_cause_entity_id&lt;BR /&gt;| summarize count(), by: {resolved_problem_duration}&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Kenny_Gillette_2-1767106535373.png" style="width: 400px;"&gt;&lt;img src="https://community.dynatrace.com/t5/image/serverpage/image-id/31372i1BEBACE16820AB5D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Kenny_Gillette_2-1767106535373.png" alt="Kenny_Gillette_2-1767106535373.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 30 Dec 2025 14:57:22 GMT</pubDate>
      <guid>https://community.dynatrace.com/t5/Alerting/Alert-fatigue-Dashboards-recommendations/m-p/292298#M6139</guid>
      <dc:creator>Kenny_Gillette</dc:creator>
      <dc:date>2025-12-30T14:57:22Z</dc:date>
    </item>
    <item>
      <title>Re: Alert fatigue, Dashboards, recommendations?</title>
      <link>https://community.dynatrace.com/t5/Alerting/Alert-fatigue-Dashboards-recommendations/m-p/292319#M6140</link>
      <description>&lt;P&gt;&lt;a href="https://community.dynatrace.com/t5/user/viewprofilepage/user-id/36140"&gt;@Kenny_Gillette&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I have done this exercise with several clients. What seems to work best from my perspective:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Maintain 1 minute for "Failure rate" so MTTD is fast&lt;/LI&gt;&lt;LI&gt;In "Failure rate" start with a higher "Absolute threshold". 1% and 5% have worked for me. More details below.&lt;/LI&gt;&lt;LI&gt;In "Response time", an absolute threshold of 300 ms is my favorite. But it really depends on a lot of factors, and in Portugal, where the Cloud is not near, as latency is always a problem, that's the reason I normally suggest it's higher.&lt;/LI&gt;&lt;LI&gt;Pareto tells us that 80% of the problems will be on 20% of the Services. In my case, I normally see 95%/5%, so setting specific thresholds for specific Services yields great results... You can see them if you summarize by&amp;nbsp;affected_entity_ids.&lt;/LI&gt;&lt;LI&gt;But in the end, the greatest way is to correct all those errors. Response time, is more difficult though...&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regarding #2 above, you can graph it with&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;fetch dt.davis.problems
| filter event.status == "CLOSED"
| filter affected_entity_types[0] == "dt.entity.service"
| filter contains(event.name, "Failure rate increase", caseSensitive:false)
| parse event.description, "DATA 'increased to ' ([0-9.]+):fail_raw ' %' DATA"
| fieldsAdd fail_rate = bin(toDouble(fail_raw),1)
| filter isNotNull(fail_rate)
| summarize count = count(), by: { fail_rate }
| sort fail_rate asc&lt;/LI-CODE&gt;&lt;P&gt;and you might get something like&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="AntonioSousa_0-1767131787112.png" style="width: 999px;"&gt;&lt;img src="https://community.dynatrace.com/t5/image/serverpage/image-id/31373iCB7BFB59E4A6B209/image-size/large?v=v2&amp;amp;px=999" role="button" title="AntonioSousa_0-1767131787112.png" alt="AntonioSousa_0-1767131787112.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;In this case, more than a third of the problems were with error percentages below 6%.&lt;/P&gt;</description>
      <pubDate>Tue, 30 Dec 2025 21:58:40 GMT</pubDate>
      <guid>https://community.dynatrace.com/t5/Alerting/Alert-fatigue-Dashboards-recommendations/m-p/292319#M6140</guid>
      <dc:creator>AntonioSousa</dc:creator>
      <dc:date>2025-12-30T21:58:40Z</dc:date>
    </item>
  </channel>
</rss>

