Timeseries - Filter each datapoint by value threshold

jegron · ‎11 Feb 2025

Hi!

I have a values-type metric. I'd like to display a timeseries that only counts datapoints whose value is greater than (or less than) an arbitrary value. I don't want to filter an aggregation, but analyze each value and remove the ones I don't want.

How can I achieve this?

Thanks for your help!

Observability Engineer at Phenisys - Dynatrace Professional

krzysztof_hoja · ‎13 Feb 2025

I am not sure if I got the requirement correctly. Here are my thought:

To get timeseries (cpu usage in my example) where at least one value is greater the predefined threshold iAny function can help

timeseries cpu=avg(dt.host.cpu.usage), by: {dt.entity.host}
| filter iAny(cpu[]>80)

and to see in the final result only datapoint matching this conditions "iterative expression" is useful:

timeseries cpu=avg(dt.host.cpu.usage), by: {dt.entity.host}
| filter iAny(cpu[]>80)
| fieldsAdd cpu = if(cpu[]>80, cpu[])

jegron · ‎17 Feb 2025

Hi @krzysztof_hoja !

Thanks for your help. But you are filtering the result of avg(dt.host.cpu.usage). I would like to filter each data point independently before any aggregation. I would like to build response time SLO for example 🙂 I can do it easily with fetch logs but I can't find a solution with the timeseries command ...

Observability Engineer at Phenisys - Dynatrace Professional

krzysztof_hoja · ‎18 Feb 2025

You cannot find it, because it does not exists in generic case for timeseries 🙂

I used dt.host.cpu.usage with breakdown by host sort of on purpose. Let's consider this query:

timeseries { cpu=avg(dt.host.cpu.usage), cpu_t=sum(dt.host.cpu.usage, rollup:total) } , by: {dt.entity.ec2_instance, dt.entity.host}
| filter dt.entity.host == "HOST-937E3C790B64E8B5"

Besides plain average I added second timeseries: sum with rollup:total. This additional metric will tell us how many contributions aka raw measurement happened. Result looks like this:

The value of cpu_t is constantly 6 every minute because reading of cpu usage for host happens every 10 sec. But this individual measurement are not stored. What is stored and is in fact most granular "data point" is statistical description of what happened containing in basic case 4 values: min, max, sum and count (sum and count allows to calculate average). From this compounds bigger aggregates can be calculated like for host groups or for longer intervals.

If we take a look at similar query:

timeseries { rt=avg(dt.service.request.response_time), rt_t=sum(dt.service.request.response_time, rollup:total) } , by: {dt.entity.service}
, filter: dt.entity.service == "SERVICE-CB0AFF6C5BC4EABE"

and result

you can see that number of contributions is variable: these are actuals requests. But this metrics has also additional dimension which can allow to look at it at more granular way. Adding "endpoint.name" allows to look at this at more granular way:

You can even look deeper by splitting requests into successful and failed, but in generic case you will get to the point when you get datapoint representing single requests. You may just have it by chance when only one request fell in the specific bucket.

The basic idea of metric is to have aggregated view on a process (bucketized time, selected dimensions only) - you loos details but you gain easy and fast access. If details are needed: for some cases we have span to look deeper (service.request.response_time can be recreated from spans if no sampling occurs), but also for some we do not keep details.

jegron · ‎20 Mar 2025

Hi @krzysztof_hoja ! Thanks for explanation. Is there a roadmap to implement filtering on raw datapoints, even if it results in slower performance?

Observability Engineer at Phenisys - Dynatrace Professional

jegron · ‎14 May 2025

Hi @krzysztof_hoja !

This is still an ongoing issue for us. Due to the high cost of handling a large volume of logs, we rely primarily on metrics. However, Dynatrace SLOs suffer from a lack of precision caused by data aggregation. Our customer is comparing our new Dynatrace SLOs with their legacy SLOs from Splunk, which are based on raw log data, and the results are not consistent (i.e., not idempotent).

Is there anything new planned on your side to address this issue?

Observability Engineer at Phenisys - Dynatrace Professional

krzysztof_hoja · ‎14 May 2025

What's the actual definition of SLO?

J01am · ‎15 May 2025

Hi @krzysztof_hoja

Currently, on Splunk, SLO are definied based on the ratio of response time that is under a threshold.

That is somethink like :

index = my_app duration>0
| eval threshold = 0.5
| eval count_under_threshold = if(duration < threshold, 1, 0)

| timechart sum(count_under_threshold) as count_under_threshold, count(duration) as count_all span=1d

| eval sli = 100*count_under_threshold/count_all

So, each duration is analyzed.

In Dynatrace and Timeseries, we need to firstly agregate duration into avg/max/min with 1m resolution, and then compare it to the threshold.

krzysztof_hoja · ‎19 May 2025

Ok, so it it for specific set of requests....

Please consider creating two metric for this purpose: count of all spans/request and count of "slow" spans/requests.
Alternatively you can use built in metric (dt.service.request.count) if it can act ad denominator (has right content and/or all needed dimensions to extract subset you need)