Solved: Re: Why the 30 minute wait in the Default alerting profile

henk_stobbe · ‎10 Jan 2025

Hello,

When creating a alert profile you can see:

Resource alert (After 30 mins; Include all entities )

Slowdown alert (After 30 mins; Include all entities )

I understand what is does, and I can ignore it (-; But I am very curious about the underlying philosophy...

So there is a threshold (or wait) when a problem occurs in the alerting.

Is it not far more elegant that you wait 30 minutes before you create a problem in the first place.

And why, by default, only a wait for Resource and Slowdown?

KR Henk

AntonPineiro · ‎10 Jan 2025

Hi,

I guess because resource and slowdown can create more alerts in 0 minutes for smallers picks. Something as a CPU going up and down frequently, etc.

But just guessing...

Best regards

❤️ Emacs ❤️ Vim ❤️ Bash ❤️ Perl

ChadTurner · ‎10 Jan 2025

I think overall this is a good discussion. Because the Alert Profile is how long a problem is open before it Qualifies for that Profile and the associated notification method.... so even if you make the case that 0 is too noisy for short lived segments, that could also be handled in the sampling segment too - depending on the metric and detection level.

Really good conversation 🙂

-Chad

RohitBisht · ‎11 Jan 2025

Practically it makes sense to me. Here is my theory.
Availability alerts: Something has already broken down and needs immediate fix. P1/P2.
Resource & Slowdown alerts: No Immediate action but a less priority that needs to be addressed. P3/P4.
This allows engineers to focus on what's relevant and need action.
Also, we can rely on Davis AI to elevate the severity and correlate with availability alerts.
Lastly, 30 mins allows to reduce noise as most of the resource/slowdown spikes get auto resolved within this timeframe.

RB

henk_stobbe · ‎11 Jan 2025

Hi,

Thanks for your reply,

Adding from the doc's

For events:

Apart from the threshold value, you can specify how often the threshold must be violated within a sliding time window to raise an event (violations don't have to be successive). It helps you to avoid alerting too aggressively on single threshold violations. You can set a sliding window of up to 60 minutes.

For Alerts:

How long the problem is open before an alert is sent out—this enables you to avoid alerts for low-severity problems that don't affect customer experience and therefore don't require immediate attention

So this is most useful when mailing or paging and less interesting when integrating with Pagerduty, serviceNOW .....?

Also, looks like the wait is almost limetless:

Value must be between 0 and 10000

KR Henk

kalle_lahtinen · ‎13 Jan 2025

The alerting profiles are linked to integrations, and very often these can be ticketing integrations like ServiceNow or Jira. If we immediately create a ticket every time there's some high resource consumption or slowdown, we'd drown in tickets as most of them would anyway close automatically before anyone has time to react to them. On the other hand, if we wait 30 mins to create the Problem event at the Dynatrace UI, then a person using Dynatrace to investigate something won't as easily notice that something is currently wrong or abnormal.

So yeah, this makes sense to me: the first UI notification is there to let people who use Dynatrace at the time to know something is up. The latter is an active notification or ticket to a team saying: check this out, do something right now. Both serve a purpose.

henk_stobbe · ‎13 Jan 2025

Hi Kalle,

Great addition!

KR Henk

Abidyaseen · ‎14 Jan 2025

The philosophy behind having a delay specifically for resource and slowdown alerts in Dynatrace stems from the nature of these issues and the need to avoid unnecessary noise. Resource and slowdown problems are often transient and can resolve quickly without intervention. A delay allows Dynatrace to filter out these short-lived issues, ensuring that only significant or sustained problems trigger alerts and Not all resource usage spikes or slowdowns indicate critical issues. The delay helps to ensure the problem is persistent enough to warrant attention, reducing false positives.