Our customer is getting "Host CPU Unhealthy" incident alerts and wants to know what caused them. The alert description states "During the last 15 minutes this host reported a high overall CPU usage of 99%, a load greater than 2 (unix based systems only) or more than 15% system CPU usage."
However, there is no indication of which one of those three conditions were met to trigger the alert. I have checked the Host Health dashboard for the time of the incident and CPU usage around 55%.
Can anyone please let me know how to find out which condition triggred the alert? Also what what it's measured value?
Hi. In this case it most likely is the "System Load" measure which we currently don't display in the host health dashboard. You can chart that measure in a custom chart. Its called "System Load". You will find it under System Monitoring -> Host Performance
In case the "System Load" is measure that breaches the threshold; are you able to let me know the exact names of the other two measures ( refereed as system CPU usage and overall CPU usage in the alert message)?
All measures used for the Host Health Dashboard are in System Monitoring -> Host Performance -> ...
'CPU System Time' and 'CPU Idle Time' are used to calculate the percentages over time, whereby we alert on CPU Idle Time <= 1%.