28 Nov 2021 11:46 PM - last edited on 29 Nov 2021 02:17 AM by MaciejNeumann
Any thoughts on monitoring a batch job that runs say every 6 hours, 24 hours, or 168 hours? The custom event for alerting feature allows only for a max of 60 minute (rolling window).
We have two options to report status.
1) we build into our batch job a Metrics v2 API call that stores a status-as-a-metric every time the batch job runs.
2) we monitor a log file for the status value.
Issue: in both cases if a catastrophic failure occurs no status will be reported.
The alert needs to trigger if the metric stops reporting for 6, 12, 168 hours. Or, within the same window 6, 12, 168 hours a metric value of 0 (or less than 1) is reported.
I have no way of obtaining a heartbeat or status between executions.
Solved! Go to Solution.
I've been trying to solve these kind of usecases for a very long time (with AppMon and Dynatrace). A very common case for example are periodic CronJobs on the Hybric commerce platform.
In the end I found a working solution but that involves quite some "external" logic, but the general process for me is the following:
For more in detail evaluation I'm using my timeseries streamer to get the above data out of Dynatrace and into a timeseries database (influxDB in my case). There I can use the full logic of flux queries to track things like:
My bottom line, it is tricky (still) to trigger events in Dynatrace based on low frequency or missing metrics. For that purpose I get data out of DT into a system where I can perform advanced data manipulation and logic for alerting or visualization..