I have SOAP monitor running every 10 mins. In the extended email incident plugin, I have set the aggregate for HTTP Status code as max, evaluation timeframe as 15 mins. Previously the aggregate was avg, but we were receiving HTTP status code as 302 (average of 200 and 404) in the incidents every now and then. I then changed the aggregate to max to test out the results. Now I see 404 in the incidents instead of 302 which is good. But the incidents are not ending at the right time. Can someone help figure out if setting the aggregate as max is the best option or is there a better option?
Below were the observations:
SOAP Monitor HTTP Status Code values:
Time: HTTP Status code
Incident started at 12:30: 404
Incident didn't end at 12:40 (though it should have)
Incident ended at 12:45 (because the evaluation timeframe was 15 mins)
Incident started at 1:00: 404
Incident didn't end at 1:10 (though it should have)
Incident ended at 1:15 (because the evaluation timeframe was 15 mins)
Solved! Go to Solution.
1. What are you trying to achieve and what is expected? I mean could you let us know the use case. Is your plugin you adding HTTP status codes or counting the 4xx or 3xx?
2. What is the metric/measure/BT you are using in the Incident condition or monitor? Have you tried creating count threshold for that measure?
Also, Dynatrace alerts are based on Incidents ie. Incident do starts on conditions, it can remain active and it may end. When incident starts the plugin send the email, while in active state it won't run send email, and when incidents ends it sends incident ended acknowledgement.
Since your evaluation windows is set at 15 minutes, the incidents won't be ending until there is an entire 15 minute window with an execution that didn't violate the threshold since you are using the max aggregation so the behavior of the incidents you describe is expected.
As to what aggregation makes the most sense it depends on what you would like to be alerted upon. For instance, in many cases we like to use a more frequent execution (say 1 per minute) and a 5 minute window with the min aggregation for something like HTTP status code so that it will take 5 consecutive executions with the measure in violation before the incidents triggers. This lets us 'ignore' short failures and only be alerted on extended events.
For simple and easy to understand configuration for something like status code min and max will be best. Since the codes will be interpreted as integers if you want to use averages you'll need to think about things mathematically to get the behavior you want which is complicated by the fact that there can be multiple codes for failures (404, 500, 401, etc...). For this reason I like to stick to max and min for monitor status code incidents.