Need help/input regarding batch system monitoring

ct_27 · ‎04 Mar 2022

We have a broader issues of using Dynatrace to help monitor the execution of nightly batch jobs and alert if they fail to run at all or within a given window. So, if anyone has useful tips on that, please share.

The issue at hand is we have a batch server for which we need to know if it ever stops processing batch jobs. The server itself "thinks" it's processing jobs but like a squirrel in the fall, it's collecting as many nuts as it can and just stowing them somewhere.

Another scenario......you have an email server that crashes and you want an email to go out when it stops working. Not to easy to send an email when your email server is dead.

To handle this, we've created a batch job that runs every 5 minutes and "Chirps" like a canary in a coal mine. The "Chirp" is recognized by the number "1" reported to Dynatrace via metrics v2.

Inspired by warrant canaries (https://en.wikipedia.org/wiki/Warrant_canary)

Here is our current implementation. The Purple dots represent our noisy canary. "Chirp" = 1

So, great, we have a canary. But I need to be alerted if the canary stops singing for 1 cycle.

- Turning on 'Alert if data is missing' just creates 1 endless problem in this scenario. So, that's not working.

- Setting the threshold to '1' and saying if drops BELOW '1' doesn't work because in Dynatrace world if you don't send in a metric, Dynatrace sets the missing data point to equal what you set the Threshold to (in this case '1') instead of zero. Thus the metric technically never goes below '1' unless I send a '0'. I can't send a '0' because that would only happen when the things are not working. (like asking an email to server to send you an email when it's down).

In the chart above though, you can see Dynatrace is doing the INVERSE of what we actually want. Dynatrace Alerts when the metric is coming in but turns OFF the alert whenever the metric STOPS coming in. The reason Dynatrace does this is because I set the Threshold = 2, so when a metric is not sent in Dynatrace records a '2' (instead of 0) and thus the metric is no longer BELOW 2 (it's equal to 2).

So, anyone have any ideas? Even if a completely different solution?

HigherEd

mgome · ‎04 Mar 2022

Add a count aggregation to your metric. This will give you a count of metrics received regardless of value and you can alert if the count drops below your desired count threshold.

ct_27 · ‎04 Mar 2022

Thank you. I had tried already changing to count but noticed no difference. I assumed it was because our metric is reporting either a '1' or '0' which is very similar to a count of '1' metric or '0' being no metrics. I will give it another try and also confirm our metric v2 ingest call.

We're starting to have better success after we increased our reporting frequency from 10 minutes to 1 minute but this is not a sustainable solution as we really want to only "Chirp" every 60 minutes.

Thank you for input.

HigherEd