I created a custom measure for CPU Total time (e.g. warning 60, severe 70), and then an incident based on that measure. My problem is that when I change the threshold values (either via the measure edit dialog, or via drill through with the Incident edit dialog), the incident then fails to honor the changed values. The measure (and incident) both display the new values, but the violation alerts keep coming based on the old values. I have tried removing and then re-adding the measure to the incident, but that does not change the behavior. Am I doing something wrong, or should I open a support ticket on this?
BTW, I scanned through dozens and dozens of incident related answers posts and did not see this issue raised.
A support ticket is a good idea to understand what is happening, but if you want to attempt a quick fix you may want to just 'copy and paste' the incident and essentially make a new one but with all the same configurations. I experienced an issue where my downtimes weren't being applied and simply recreating the incident in this manner resolved it.
I deactivated the original incident, and did a copy / paste to create a new incident with name "Oldname - Copy". But the violations, now coming from the Copy, continue to show the "old" threshold settings.
The old ones are what show up in the incident details dashboard.
I just replicated this issue in my 6.2 EasyTravel installation, which I use for reference. Here are the steps:
1. Create a new measure based on Host Performance/CPU Total Time (e.g. TEST CPU Total Time). Assign it some trigger thresholds (e.g. 10% upper warning, 20% upper severe) that your easytravel server can easily be induced to violate, but will not be constantly violated.
2. Create a new incident, using this measure. Use out-of-the-box settings (e.g. 10 s evaluation, etc.), with two differences.. First, have the incident fire for both warnings and severe threshold violations., and secondly, turn off smart alerting so you can easily generate multiple incident violations.
3. [optional] for ease of reference have the incident generate an email.
4. After inducing a few violations, change the threshold levels (e.g. 9%, 19%), either via measure edit dialog, or by drilling through the incident dialog), and then induce some more violations.
The behavior is the same.. the new incidents carry the "old" threshold settings.
BTW, my servers are at v.22.214.171.1247.
I've been trying to get a better understanding of how host metric based measures and incidents work in 6.2. Is this incident defined at the global infrastructure level or within one of the system profiles?
I'll mess around with easyTravel to recreate and investigate.
I've had a similar issue with a different measure after trying to adjust some of the values, to fix it I had to rename the measure to resolve the issue. Perhaps we should both raise support tickets as this may not be as isolated of an occurrence.
Did further testing with EasyTravel, and I discovered that if you "confirm" an incident violation (in the incidents dashlet) then the incident will be refreshed and pick up any changed threshold values, which will show up in subsequent violations.
I also had a copy of the incident subscribed to the exact same measure, and it also started picking up the changed threshold values once the original incident violation was "confirmed".
Thats an interesting find. If you have an open incident the changes are not applied. In a way it makes sense to keep the old threshold in case you still have open incidents that were based on these incidents. But I can see that this is rather confusing. Thoughts to this behavior?
It's confusing, and possibly broken I think.
If an incident is active (started but not ended) when the associated measure threshold is changed, then I agree, the new threshold value should not take effect until all started incidents are ended. But in this case, all the incidents were ended (they just were not "confirmed"), and the threshold change did not take effect.
The act of "confirming" just one of the incident violations caused the incident to start honoring the new value. So the current workaround is to confirm a violation after you make the threshold change. That is a little clunky and definitely not intuitive I think.
Would you suggest I open a support ticket for this?