cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Advice on alerting if CPU suddenly significantly goes and stays above normal baseline but under alerting threshold

JordanGreen
Frequent Guest

Hi 

On a few of our hosts the CPU usuage spiked significantly and stayed at that not normal level for few days as per below. I just wondered Is there a way to detect a sudden increase in CPU above our normal baseline. We didnt get an alert as it didnt hit our 95% threshold but we would like to be notified in sudden significant change of CPU say an increase of 20/30% for a period of time. Just wondered does any recommendation on best way to do this?

 

JordanGreen_0-1755865814289.png

 

 

1 REPLY 1

Mizső
DynaMight Guru
DynaMight Guru

Hi @JordanGreen 

Auto-adaptive thresholds for anomaly detection — Dynatrace Docs

In Managed you can play with auto-adaptive baseline (+signal fluctuation and duaration) metric event at Anomaly detection with this metric expression:

builtin:host.cpu.usage:splitBy("dt.entity.host").

Mizs_0-1755982681417.png

In Saas you can use Davis Anomaly Detection from a Notebook (also play with fluctuation and duration):

Mizs_1-1755982876896.png

Mizs_2-1755982943241.png

Based on my experience these CPU patterns related only one process: eg. antivirus, compression or java process gc suspension...so you can try monitor process cpu usage with parents transformation (for host infromation):

Managed metric expression:

builtin:tech.generic.cpu.usage:parents:splitBy("dt.entity.process_group_instance","dt.entity.host")

SaaS DQL:

timeseries usage = avg(dt.process.cpu.usage), by: { dt.entity.process_group_instance, dt.entity.host }
| fieldsAdd entityName(dt.entity.process_group_instance), entityName(dt.entity.host) 

 

I would have another idea for metric expression and DQL, you can try this also.

Metric expression:

(builtin:host.cpu.usage:splitBy("dt.entity.host"):avg:sort(value(auto,descending)):rollup(avg,15m))-(builtin:host.cpu.usage:splitBy("dt.entity.host"):avg:sort(value(auto,descending)):rollup(avg,15m):timeshift(-1h))

DQL:

timeseries usage = avg(dt.host.cpu.usage), by: { dt.entity.host }
| fieldsAdd usage = arrayMovingAvg(usage, 15)
| sort arraySum(usage) desc
| join [ timeseries usage = avg(dt.host.cpu.usage), by: { dt.entity.host }, shift: -1h
| fieldsAdd usage = arrayMovingAvg(usage, 15)
| sort arraySum(usage) desc ], on: { dt.entity.host }, fields: { operand = usage }
| fieldsAdd expression = usage[] - operand[]
| fieldsRemove usage, operand
| fieldsAdd entityName(dt.entity.host)

Mizs_3-1755984922710.png

Long positive "hills" can be a good trigger of problem creation.

I hope it helps.

Best regards,

János

Dynatrace Community RockStar 2024, Certified Dynatrace Professional

Featured Posts