Solved: Cached JMX Metrics used in Alerts

smitty · ‎18 Dec 2025

Hi,

I have maybe an unusual use-case for setting up some Custom Davis Anomaly Alerts.

I have several Custom JMX Metrics setup through a Custom Extensions we use in monitoring the health of our system. Most of these JMX Mbeans are setup to return an integer value of 1 for healthy and 2 for unhealthy for specific checks performed in our application. To control the load of these checks we cache the results after each check for either 5, 10, or 20 minutes.

The Alerting DQL looks like this for one of them:

timeseries avg(`jmx.app-monitoring.metric_xxxx_healthy`),

filter: { in(entityAttr(dt.entity.process_group, "tags"),"Environment:PROD")

and in(entityAttr(dt.entity.process_group, "tags"),"Application:WebServiceApp")},

by: {dt.entity.process_group_instance}

I setup the Alert with:

Threshold 1

Alert if metric is above

Violations 3

Sliding window 11

Dealerting samples 5

With this I’m getting alerted for a single failure that was cached for 5 minutes after I assume 3 checks in 1 minute intervals for the Alert inside that same cached 5 minute time frame.

Since I set this up I learned that the Alerts are checking the metric every minute, which I assume is triggering the call to my JMX Metric MBean at the same 1 min intervals, but not 100 percent sure.

My goal is to only trigger the Alert after three unhealthy checks, meaning three back to back checks that cached for 5 minutes with a value equal to 2 for Unhealthy.

Any guidance on how I configure this?

Would I need something like 11 violations in a sliding window of 15 minutes?

Seems like I’m trying to matchup two sliding windows my cached result window with the Alerts sliding window and may not be able to get a predicable result. Also, should I be using max instead of avg?

Any help on this unusual use case would be appreciated.

Thanks,

Smitty

Julius_Loman · ‎18 Dec 2025

JMX Extensions don't support intervals, unlike some other data sources, so JMX attribute value is retrieved every minute. Correct me if I'm wrong, but you cache the data on the application side, so for 5 minute cache, this means real check is done once and 4 consecutive gets of the attribute retrieve the value from cache.

This means for the metric that comes from an MBean attribute with 5 minute cache, you need to have 11 minute sliding window at least with at least 11 violations.

For you case, you can use AVG, Davis Anomaly Detector needs 60 bins and works on data from the last hour only. JMX MBeans attribute values are scraped in minute intervals, so aggregation does not make sense here. It would if you display it on a dashboard, though.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

smitty · ‎18 Dec 2025

Thanks for the explanation I'll give the 11 minute window and 11 violations a try.

I'm still a bit new to Dynatrace DQL. When you say "so aggregation does not much sense here" are you meaning my DQL should be changed to use Fetch? I do use the metric on a dashboard and took this DQL from the Dashboard to use in the Alert.

Julius_Loman · ‎19 Dec 2025

You collect it as a metric signal. Metrics are stored at 1 minute granularity. Unless you send the metric often than once a minute, the max/min/avg are the same for one minute bucket.

Aggregation makes a difference if you query metrics for a longer timeframe or with more bins, so Dynatrace needs to calculate the value for each bin based on the aggregation.

Davis Anomaly Detector at the moment queries for the last 60 minutes only, and it requires strictly 60 bins afaik. So changing aggregation does not make much sense here, as there is only one data point in each bin.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner