10 Jan 2025 02:17 PM
Hello,
When creating a alert profile you can see:
Solved! Go to Solution.
10 Jan 2025 03:31 PM
Hi,
I guess because resource and slowdown can create more alerts in 0 minutes for smallers picks. Something as a CPU going up and down frequently, etc.
But just guessing...
Best regards
10 Jan 2025 04:07 PM
I think overall this is a good discussion. Because the Alert Profile is how long a problem is open before it Qualifies for that Profile and the associated notification method.... so even if you make the case that 0 is too noisy for short lived segments, that could also be handled in the sampling segment too - depending on the metric and detection level.
Really good conversation 🙂
11 Jan 2025 01:43 AM
Practically it makes sense to me. Here is my theory.
Availability alerts: Something has already broken down and needs immediate fix. P1/P2.
Resource & Slowdown alerts: No Immediate action but a less priority that needs to be addressed. P3/P4.
This allows engineers to focus on what's relevant and need action.
Also, we can rely on Davis AI to elevate the severity and correlate with availability alerts.
Lastly, 30 mins allows to reduce noise as most of the resource/slowdown spikes get auto resolved within this timeframe.
11 Jan 2025 09:26 AM - edited 11 Jan 2025 09:27 AM
Hi,
Thanks for your reply,
Adding from the doc's
For events:
Apart from the threshold value, you can specify how often the threshold must be violated within a sliding time window to raise an event (violations don't have to be successive). It helps you to avoid alerting too aggressively on single threshold violations. You can set a sliding window of up to 60 minutes.
For Alerts:
How long the problem is open before an alert is sent out—this enables you to avoid alerts for low-severity problems that don't affect customer experience and therefore don't require immediate attention
So this is most useful when mailing or paging and less interesting when integrating with Pagerduty, serviceNOW .....?
Also, looks like the wait is almost limetless:
KR Henk
13 Jan 2025 09:12 AM - edited 13 Jan 2025 09:28 AM
The alerting profiles are linked to integrations, and very often these can be ticketing integrations like ServiceNow or Jira. If we immediately create a ticket every time there's some high resource consumption or slowdown, we'd drown in tickets as most of them would anyway close automatically before anyone has time to react to them. On the other hand, if we wait 30 mins to create the Problem event at the Dynatrace UI, then a person using Dynatrace to investigate something won't as easily notice that something is currently wrong or abnormal.
So yeah, this makes sense to me: the first UI notification is there to let people who use Dynatrace at the time to know something is up. The latter is an active notification or ticket to a team saying: check this out, do something right now. Both serve a purpose.
13 Jan 2025 09:21 AM - edited 14 Jan 2025 09:52 AM
Hi Kalle,
Great addition!
KR Henk
14 Jan 2025 08:01 AM
The philosophy behind having a delay specifically for resource and slowdown alerts in Dynatrace stems from the nature of these issues and the need to avoid unnecessary noise. Resource and slowdown problems are often transient and can resolve quickly without intervention. A delay allows Dynatrace to filter out these short-lived issues, ensuring that only significant or sustained problems trigger alerts and Not all resource usage spikes or slowdowns indicate critical issues. The delay helps to ensure the problem is persistent enough to warrant attention, reducing false positives.