We have configured Unix System Monitor Plugins to monitor our OS servers. In Infrastructure Host-groupings, we couldn't fix the thresholds for both warning and severe due to override issue. So, We separated our set of hosts by filtering into four different hostgroups. When we tried to map the plugins measures with Incidents, There is 0 Incidents logged with the plugins measure. Please advise.
Can you post screenshots? I'm not really sure what you're configuring. Specifically what the incident configuration is including the measures that make up the conditions and their thresholds.
It sounds like you're setting thresholds within the infrastructure settings which have no affect on hosts only monitored through plugin monitors. Each incident you want from monitor results needs to be created manually.
Regarding your rate incident not working, see my comments on that post. Calculated "rate" measurements don't work for incidents.
Regarding your CPU alert, I don't think anything is being overridden. In your screenshot for your incidents you're using "and" logic which means that both criteria need to be fulfilled in order for the incident to trigger. I don't imagine this is the behavior you want. Try switching to "or."
After doing that if you still have issues an important step is to chart out the measurements you're using in the incident to make sure you're seeing what you expect (i.e. there's no issues gathering the data in the first place).
I have charted the measurements to check the Plugins measures are picking the data's/values. But, I can see the average and maximum value is in very low level. Could you please advise how can I fix the threshold value for this. Please refer the attachment.
And also advise, how to fix the threshold values in GB(Not requesting to change it in Unit, Is there any calculations to set the threshold values in GB instead of %) for Used DiskSpace (Disk) and Received Bytes/sec(Network) Measures.cloud-charts.pngvdp-java-charts.png
@James K. As suggested, I have removed the default measure-CPU Total Time and configured with the Plugins Measure-Processor Time in the CPU Utilization rule @03:30 PM IST today. Since 01:30 hours, I haven't seen any hosts has logged the issues. Please see the attachment. plugins-measure-processor-time.pngincident-havent-logged-any-issues.png
Once I revert this @05:15 PM IST, The CPU default measures starts logging the Incidents. Please refer here. default-measures-cpu.png
I'm not sure what is the issue here. I tried twice with Plugins measures- Processor Time_CPU. But, it is not logging any incidents.
Including both monitor results and agent gathered measures is really confusing things - is there any reason you're trying to use a monitor when it seems like you already have agents on those hosts?
Regardless, the only thresholds I've seen in the screenshots you provided for the processor time monitor result measure was 90% - the values in those charts were all well below that so they wouldn't trigger an incident.
@James K. I also suspected the same, Now I created few charts to identify the hosts values. The Incidents are started to log the issues. Please see here.plugins-measure1-0314.png plugins-measure-processor-time-0413.png. But the concern is now, How could I fix the threshold limit? Is there any calculation which will help me fix the threshold values. For default measures, Directly we can set the thresholds has 90% severe and 80% warning. But, For this Plugins measures the values are very low, Which ultimately confusing how to set the limit(for upper severe and upper warning and lower severe) thresholds level. Please advise.
And also advise, Is there any measures to set the Process Availability Incident for this OS servers in system profile level?
I am really not clear on what you're asking. You set the thresholds on the measures being returned by the monitor results which are used in the incidents. You can do this either in the monitor configuration or by editing the measure.
Not sure what process availability incident you're asking about either. These might help:
Thanks for your time, I guess there is a misunderstanding.
Basically the threshold limit 90% works fine for the normal/default CPU usage readings. But in this plugins measure case the readings on average is 12% and the maximum till now it has been recorded as 14%.
So the challenge now is to identify the new threshold limit for upper severe(for example set it to 14%) or come up with a calculation to set an ideal threshold limit for these low CPU usage readings.
For better understanding please find the chart attached in here charted-measure-cpu-0414.png, where you can find the values are very less.
No calculation for that, it really depends on what normal is in your systems and at what point you need to be alerted to take action. This varies between systems. So may be fine waiting until 90% and some like to know when it is above 50%. Once you have more data you can take a look at the peaks and base it off of that.
Also just wanted to note, the max and min values you see in the table are based off of the aggregation you have charted (the data points in the chart) not the measurements directly. So the max column in that table is the largest average that was seen in that chart not necesarily the maximum value that was seen overall. For that you'd need to configure the series to be using maximum aggregation in the chart. That can sometimes be confusing.