We are currently trying to test out thresholds for an Incident that should alert when we see an amount of a certain mainframe program run outside of a certain range.
Our Incident is configured with an upper-severe threshold of 3000 and a lower-severe threshold of 100. The evaluation timeframe for the Incident is 1-minute.
As you can see from the screenshot below (of the chart generated when right-clicking on the Incident and selecting Show Measures in Chart), the start and stop times of the Incident seem to be way off. The heat field at the top of the chart is configured to show information for the Incident in question, and the shown thresholds belong to the Measure that the Incident is based off of.
My question is, why is this? How does Dynatrace calculate the start and stop times for these Incidents? It was my understanding that Measures are calculated and checked for violations every 10 seconds. Why then did the Incident not end sometime in the 16:00 - 17:00 timeframe when the Measure was consistently below the threshold?
Below is a screenshot of the Incident rule behind the strange violations. As mentioned before, the evaluation timeframe for the Incident is 1-minute, and the aggregation is set to count. The above chart also has an aggregation set to count.
Any clarification that could be provided would be greatly appreciated! We are trying to better understand the way Dynatrace is doing its alerting to safeguard ourselves from false-positives, or not being alerted when there truly is an issue.
Thank you very much,
- Kasey C.
Let me try this 🙂
When you set the aggregation to "count" in a incident rule, you set the number of purepaths that contributed to your measure violation.
When you pick a 1 min evaluation timeframe interval, it means that in 1 minute the number of purepaths that exceeded your measure threshold was greater (or lower) than you configured.
The start and stop time of those incidentes are configured when you fill the box under "Incident severity".
As I understand, if you want the exact times of the incident start and stop, you should pick "last" as you incident aggregation. In this case, if your last measure is higher or lower than your threshold, it will raise or stop instantly.
Hope this helps!
Thank you for your response!
We are indeed looking to measure the number of PurePaths containing a certain mainframe program. For this reason we set the Incident aggregation to "Count".
The field Period (seconds) to suppress further Incidents after Incident End, which sits directly below the Incident Severity field, was intentionally left at 0, as we do not care if a second instance of the Incident occurs immediately after the first ends.
What we are instead worried about is why the Incident was triggered at a time when the threshold was not exceeded, and why it did not end for over an hour after the measure dropped back to healthy levels?
Unfortunately I do not believe using the "Last" aggregation will resolve our issue for the following reason. We want to make sure that the number of transactions containing the particular program stays within a particular range each minute. I believe the "Last" aggregation only looks at the last instance of the measure pulled from a PurePath, which means it is only looking at a single transaction. Or am I mistaken about this?
- Kasey C.
If I'm wrong and someone from Dynatrace wants to correct me, please feel free 🙂
You are right about the "last" aggregation. It will just consider the last purepath from the transaction and, in your case, it wouldn't be the best choice.
As I understand, the alarm was fired because of the timerange. When you selected a 1 minute interval and "count" as aggregation, Dynatrace will calculate the number of all purepaths that contains your measure, considering the data just from the 1 last minute.
Looking at the graph at the beggining of the incident, at 8:20 am, the number of purepaths at the time did not exceed, but considering the group of purepaths from the last 1 minute, it did. And that happened at 17:55 pm, since the number of purepaths at the time were below of your threshold, but the count of all purepaths from the last minute was still higher.
Thank you again for your reply! It is good to have validation on the "last" aggregation. I am still a little confused about the Incident triggering, as it does not seem to be reflected in the chart at all.
In the chart I posted above, the resolution is set to 1 minute to match the evaluation timeframe for the Incident.
The datapoint for 8:20am was below the configured threshold, as you said. However, the minute before 8:20am was below the upper-severe threshold as well. In fact, the most recent threshold violation (before the alert was sent out at 8:20am) according to the chart was at 8:09am, a full 11 minutes before the alert was sent out. The first violation visible on the chart after the violation occurs happens roughly 12 minutes later at 8:32am, meaning there is a 10+ minute window on either side where the threshold is not breached.
Even if the Incident were to have somehow fired at 8:20am, wouldn't the Incident have ended sometime in the next 12 minute period before Dynatrace shows the next threshold violation? Instead, the Incident does not end until over 8 hours later at 17:55pm, despite the fact that there were several 30-minute periods where the measurements were consistently below the upper-severe threshold (17:00-17:30 for example). Shouldn't the Incident have ended there?
I am still trying to figure out why the Incidents started and ended when they did, as the Incident's behavior seems to violate what the documentation says.
Thank you again very much for your help!
- Kasey C.