We have Dynatrace 6.5 running in our environment.
We have setup couple of incidents which upon violation triggers an email which in turn creates a production ticket.
We are seeing that alerts are generated for short spikes also. We want to tune the alerts such a way that if any metric's threshold is breached for more than 5 minutes, then trigger an alert/ticket.
I have tried to set the Evaluation Timeframe to 5 minutes and used aggregation of average, but it is not working as intended.
For Example: If I want to create an alert for metric "Current
CPU load" if the load is greater than 60% for more 5 minutes or more. How do I go about it?
Please see below snapshots of alert setup, cpu load chart.
Solved! Go to Solution.
If you select a timeframe of one minute and use a PurePath duration measure with an average aggregation as input for this incident rule, the average PurePath duration of the last minute is calculated and checked for violation every 10 seconds. Measures remain in memory for one hour.
Check the average evaluation timeframe examples to understand that how it is evaluting.
I have configured the alert for CPU Total time at 50% threshold and evaluation time frame for 10 sec, but during my analysis I was not able to understand at one point, though the CPU Total time value came below the threshold the incident hasn't stopped it still continued.
The below snip which has the same incident was not ended in first plunge highlighted in circle and but, it ended in the second fall, not able to get to conclusion how dynatrace incident evaluation works here..
Hello @Ravi D.
What is your aggregation e.g. avg(average), count, last, max (maximum), min (minimum), sum, or first.
The action can be triggered when the incident is raised, when the incident is ended, or every time an incident begins or ends.
If execution is set for the beginning and end of an incident, and the incident has a duration of 0 seconds (the start and end time are the same), then the action will only be executed once.
Also have a look on the the below link for the understanding of incident rules.
I'm using the average aggregation for the CPU total time, also I have gone through the link already which you have posted which didn't helped me so I'm here, as it says if the evaluation time frame is for one min it will verify for every 10 sec in 1 min time frame and triggers an incident if the conditions matches and also in order to supress the incident it the value of the measure should be less than threshold for same 1 min.
But, below are the scenario which put me in confusion and help me to understand - The first snap shows the incident with start and end time duration of the incident is for 4 min and 50 s
In the second snap which has the threshold baseline in fully colored also shows the incident time frame, as per the dynatrace doc the incident should have been ended at highlighted portion but it still continued and ended at 7:49:20 where it hasn't looked for 1 min to see the measure value is less than the threshold.
AppMon employs statistical methods to calculate expected application behavior from historical data and to compare current application behavior against the expected behavior.
Violations are identified if at least two significant measurements are above the threshold.
I would recommend you to read about significant measurements in the below link to understand the chart you shared with us.
in your screenshot we can see 5 values that together have an average of slightly more than 60 for the minutes 19:40 till 19:44. With my trained eye, I determine the values to be roughly 41, 55, 90, 68 and 50, which gives an average of 60.8, therefore the incident triggers before 19:45. Then, replacing the 41 with the 15 of 19:45 you drop below 60 and the incident stops.
If you want to alert on an average CPU load consistently greater than 60% for 5 minutes, you need to choose the "min" aggregation. Best regards,
By looking at the 'minimum'' over a 5 minute period it is only going to be looking at the lowest value across that entire 5 minutes. If the minimum over the entire period is above the threshold that is set then logically the measurement as a whole must have been in violation for that entire 5 minute period. So if there is say a 1 minute period where the measure was above the threshold but it went beneath it again then it would not trigger.
Whenever I want an 'extended' period of violation before triggering an incident I'll think 'in a good situation should this measure be higher or lower' and select minimum if lower and maximum if higher. Then I'll set the timeframe to however long I want that measure to be in violation before triggering.
Does that help?