30 Oct 2024 01:49 PM - last edited on 31 Oct 2024 07:22 AM by MaciejNeumann
Hello,
we recently wanted to implement a feature with Dynatrace that alerts users if their namespace overrequests CPU resource in the Openshift Cluster.
As a basis for this feature, we were planning on using a DQL query similar to the following:
timeseries {
requests = sum(dt.kubernetes.container.requests_cpu, rollup: avg), from: -3h, to: now()
}, by:{dt.entity.cloud_application_namespace}, interval:1m
The idea was to investigate the total request in a namespace. We wanted to represent this value as 1 data point per timeslice. We use 'rollup: avg' because using the default of 'rollup: sum' doesn't make sense in this context because e.g in a situation where a namespace requests 1 CPU for 5 Minutes we want the output to also be 1 core because that is the only resource that has been occupied. Ideally we would like to take data points from the previous 14 days into consideration which limits us to use 15min intervals as the finest option.
The issue however we see is that if we run the query above against a random namespace and plot different interval settings for a fixed timeframe, the values of data points climb as can be seen in the attached screenshot. Our expectation is that the behaviour is tied to the 'rollup: avg' parameter. So we expect that the average of the entire timeframe stays similar no matter if the interval is 1m or 15m or 3h. However from what we see we assume that an additional aggregation method is used in the backend of dynatrace other than the 'rollup: avg' that is tied to the 'sum' aggregation method.
The question that we now have is if there is another approach to get the desired results or if a feature request can be opened to decouple the 'sum' aggregation method from whatever other aggregation is done in the background that is connected to the interval.