I understand there are two ways we can set alerts in Dynatrace - Static and Auto-baselining (AI solution).
My question is - does the latter work for infrastructure alerts like CPU and Memory ?
Today we have custom static alerts set at 70% CPU and memory consumption but I feel it's not very effective. Our infrastructure is in AWS (EC2 and ECS Fargate). What would be the most optimal way of setting infra alerts, so as to reduce alert storms and False positives ?
Solved! Go to Solution.
The auto-baselining for infrastructure alerts works very well in my opinion. Almost no false positives, and when they occur it's because the high CPU was temporary. And almost no false negatives, which I've only seen when CPU spikes take some seconds, which is expected, as to essentially reduce alert spamming.
I too tried some static definitions in the past, believing I could do better than Davis. I was proved wrong several times, so now I stick with auto-baselining...
This is something that I still unsure about in Dynatrace with Host monitoring... is there a baseline for it?
As far I know it is always static thresholds, but you can choose left them as "automatically" or manually.
When we leave it as automatically, does Dynatrace create a baseline for it the same way it does for transactions response time, for example? I guess no.
Am I wrong? 🤔
CPU is much more volatile than response times. So the way Dynatrace calculates it is based on the thresholhs that you point to, and that you can change:
This is the static approach, am I right? With a sliding window.
Where can we define sliding window size?
Sliding window size is 3 out of 5 samples (where samples refers to 1 minute samples with 6 measurements each minute). You can't modify the ootb mechanisms sliding window size, but you can use your own sliding window size in 'custom events for alerting'.
The General rule of thumb to determine if the Dynatrace AI (DAVIS) applies to your entity that you are monitoring comes down to weather or not the Oneagent is installed and touches that entity. For things such as host CPU, Memory, Disk and network, they all fall under AI (DAVIS) because the host has the oneagent installed on them.
For things that don't have a Oneagent and are pulled via API Calls, like K8 Cluster Metrics, Azure Subscription data like Azure Firewall, Service bus etc... they are all pulled via API so there is no Oneagent to leverage the AI (DAVIS). You can however use a work around for these types of things such as Custom Events for Alerting and set the thresholds as adaptive.
Still not clear to me... does Infrastructure alerts use baselines too?
If I do have an host that keeps the MEM utilization always on 97% (due to some application requirement), will Dynatrace understand that this is "normal" and do not alert?
Or, let's say I have a Disk utilization always on 5%, then suddenly it raises to 85% (still below the default static threshold), will Dynatrace understand that this is a problem (due to spike)?
If your Metrics are set to "Automatic" then AI baselining will be used. If you have it set to custom and set the values static, then those values will be used and not AI Detection.
Static is static, If it breaches it, it alerts - no ifs ands or buts
AI uses baselines, If it spikes up dynatrace looks at the baseline and asks itself, is this deviated from the Norm? If yes, then alarm, if not, dont alert. If this Spike happens every Tuesday Night at 9PM, that will be added into the baseline. That spike might be normal as jobs are running in off hours, but if a static threshold was set it would alert every time.
Think of it as a Speed Camera. Speed Cameras would be Static. Speed Above X, ticket sent. Where a Police Man would be the AI, and can decide for himself; 1-2 over the speed limit is fine, but 10 over is a ticket.
One real example here to illustrate why I believe there is no baseline involved on infrastructure anomaly detection:
Everyday we are having Problems generated for high network utilization due to the DB2 backups
If the baseline really exists, why Davis does not understand the network behavior and just "ignore" the peaks?
The network anomaly detection are set to Automatic:
Have people been using Custom Alerts for CPU, Memory, Disk with success using "Auto-adaptive baseline"? We are trying out in lower environment and curious what other peoples results are. Thanks for input.
We tested in our lower environment and seems like auto adaptive does NOT work well for CPU, Memory and Disk.