20 Jan 2025 11:11 AM
Dear experts,
I have the need to create a Problem based on two different metrics related to the same service, on one side it would be dt.service.request.response_time and on the other dt.service.request.failure_count, both metrics refer to the same endpoint.
I have created a DQL query that shows me in the same graph both metrics but I don't know if it can be alerted following the following conditions:
- response_time > 300 ms
- failure_count > 20
- For 60 minutes
Only if all three conditions are met, the problem should be raised to the trouble ticket.
DQL:
timeseries interval: 1m, { response_time = avg(dt.service.request.response_time),
count(dt.service.request.failure_count)}, from:-1h, to:now(),
by: { endpoint.name }
| filter endpoint.name == "test"
Thank you very much for your knowledge and time
Solved! Go to Solution.
20 Jan 2025 01:04 PM
This is a fun question!
There's a couple of ways of doing this that come to mind, but I think the most suitable would be to use Davis Anomaly Detection.
It should be relatively straightforward from the instructions in the documentation. Just two things to keep in mind:
1. A Davis Anomaly Detector assumes it will receive one time series and works from that. Here you have two, so you'll have to combine them in some way to allow for alerting.
There's no single right answer, but you could try something like this:
timeseries { response_time = avg(dt.service.request.response_time), failure_count = avg(dt.service.request.failure_count) },
by: { endpoint.name }
| fieldsAdd response_time_alert = iCollectArray(if(response_time[] > 300, 1, else: 0))
| fieldsAdd failure_count_alert = iCollectArray(if(failure_count[] > 20, 1, else: 0))
| fieldsAdd alert = response_time_alert[] + failure_count_alert[]
| fields timeframe, interval, alert, endpoint.name
This creates a new timeseries, alert, which is 2 when your first two conditions are met.
Then simply set your Davis Anomaly detector to alert with a threshold of 2!
2. For your third condition, you will want to adjust the sliding window settings in the anomaly detector (Toggle Show advanced properties in the Customize parameters step):
60/60/* would correspond to what you are requesting. From experience, I would recommend a smaller time window though, as these slow acting alerts can feel a bit too sluggish. But up to you, of course!
22 Jan 2025 09:04 AM
Thank you very much. That's exactly what we are looking for
Victor