cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Need to understand why incidents are firing off when the event is well below the threshold

dave_deleo
Inactive

We have a measure using a metric called: Total GC utilization (java virtual machine). We have an upper Warning threshold of 15% and have applied it to an incident we created. However, we are getting bombarded with incidents when the threshold is not even being hit - its averaging 8%. The log shows the following:

ViolationThresholdType=Upper
Warning,ViolationThresholdValue=15.0,TriggerValue=100.0. The end part of the log shows upper boundary hit.

Does anyone have any thoughts what might be causing this.

Thanks,

David

4 REPLIES 4

david_n
Inactive

Hello David,

I would check the settings for the GC utilization incident and make sure that the aggregation is not set maximum. It sounds like either the incident is being thrown whenever your GC spikes to 15% or the timeframe that you have set for the incident is too short. If the time frame is very short, it is almost the same as setting the aggregation to maximum because it looks at very short intervals to determine whether it needs to throw an incident.

Thanks,

David Nicholls

dave_deleo
Inactive

Thanks David for the quick response! I checked the evaluation time frame and its a 10 seconds and we have aggregation at "average". So that is saying take the average within a 10 second time frame and if it hits the threshold create an incident. If that's true it would seem that 10 seconds is reasonable for a GC average, but I think the best way to do this is to test it out. The next setting is 1 minute, wish there was a middle choice 🙂 I will let you and the community know how it goes. Thanks again!

Hello David,

Glad to help. Remember, 10 seconds is the smallest time frame that Dynatrace captures measures. That is why it was firing the alerts every time the GC average hits 15%. If the evaluation time is set to ten second, every time DT captures the measure metric, it will evaluate your incident rule. I think it will work like you expect with the higher time frame. Feel free to let us know how the testing goes.

Thanks,

David Nicholls

dave_deleo
Inactive

Hi David and community! That did the trick. I did have to raise it up to 5 minutes rather than the 1 minute. 1 minute still was giving us lots of incidents. With a 15% threshold I can see where at 1 minute any spike could skew the average and cause the same problem. Bottom line is your recommendation solved the issue. Thanks much David!