I have received a requirement to alert when we have 3 or more HTTP 500 errors within a minute on one of our services. I have created a Custom Alert as follows:
and done some testing, and it seems like there is some delay in the alerting process. For example if I have 4 HTTP 500 errors at 10:03, then the Custom Alert problem is opened at 10:10 and then closed at 10:12. Since I can't find any documentation about the alerting process and the 'for X minutes during any Y minutes period' setting on the custom alert, I'm not sure what to expect, so could someone please explain how often Dynatrace evaluates the treshold on Custom Alerts and how often it decides to raise/close an alert and how the X and Y settings affects this?
Well the actual timeseries is evaluated every time a cluster consolidated metric payload is written to the storage. So there is a small delay of 1 to 2 minutes until all metric results of all cluster nodes reach the storage and are written. 7 minutes delay sounds a bit too long for my perspective but 1 to 2 minutes is the typical delay until the metric is checked and the alert is raised and notified on.
Thank you for answering 🙂 In my first test I triggered 4 HTTP 500 errors and nothing more, so the delay could be explained with lack of subsequent traffic, but now I have done a new test with 5 HTTP 500 errors at 12:01 and then subsequent requests without failures, and the alert is raised at 12:08 and closed at 12:09
When the alert is raised at 12:08, it is reported to be open for 8 minutes:
and then after a minute (12:09) the problem is closed and changed to this:
There is something in the alerting process I dont quite understand 🙂 I would expect the problem to be raised no later than 12:03 and closed again 12:05
Sorry I have to correct my previous answer as the cluster nodes have to consolidate the incoming data. So by default only after 5 minutes when no new data is written do a timeslot the slot is closed and checked for the threshold.
Same for the de-alerting where we can only decide if the condition is no longer valid after the consolidation run. If we detect after 5 minute that the condition is no longer met we correct the problem duration to the correct timeframe and the heat field is also corrected back.
May I know, is there any permanent solution/best practice to mitigate the situation. Even, we noticed the same scenario, where the custom alerts triggering the email notification which the previous data after 4 to 5 min.
Below is the example:
I enabled custom alert at 10:43 PM
Dynatrace triggered the email at 10:47 PM showing there was a violation at 10:38(alerting condition met ).
Is it expected behavior ? If yes, kindly explain the logic how does Dynatrace scans the logs for custom alerts.
As explained above we consolidate the data across all cluster nodes before taking the decision in order to not falsely alert on partial data. If we change the strategy to alert on partial data we would trade speed for a high number of false positive alerts.
But the app teams expect the alert notification at real time. Sending email after 7 to 10 min would cause delay in implementing the remediation activities. Is there any workaround/configuration set up to change this set up.