Re: Introducing new Kubernetes metrics for improved user experience (and deprecating old ones)

florian_g · ‎13 Jul 2022

As Dynatrace, we're constantly trying to improve your user experience with our product. Sometimes, this comes with breaking changes, such as in this case. Specifically, we're consolidating some of our Kubernetes-related metrics, meaning that some currently existing metrics will soon be deprecated. This post explains the why and what of this move. It also describes the potential impact on your environment and steps to migrate to our improved metrics. Please read this carefully if you use Kubernetes-related metrics in custom dashboards, custom events for alerting, or SLOs – these assets might break with this change.

In the past, based on some technical limitations, we had to provide multiple metrics to represent the same measurement on multiple levels of the Kubernetes entity hierarchy. For example, there were separate metrics for counting running pods on the workload vs. counting them on the namespace level. Also, for some levels of the Kubernetes entity hierarchy, there was no metric available at all. In case you needed that measurement, you had to come up with complex metric expressions, such as for charting running pods on a cluster level. Many people were overwhelmed by the complexity of these queries, and we got lots of requests for improvements. Consequently, we decided to take a significant step in consolidating these metrics.

Which metrics are affected?

We have focused on metrics originating from the Kubernetes API – technically, we do that using our ActiveGate. Consequently, the availability of the new metrics is bound to the version of the ActiveGate monitoring your Kubernetes cluster. In total, we add 30 new metrics with ActiveGate 1.247. We also use this chance to move towards our new Kubernetes metrics prefix: "builtin:kubernetes". The new prefix lets you spot new metrics within our product immediately. Navigate to the metrics browser and search for "builtin:kubernetes" as shown in the screenshot below. Please note, that the search result will show you old and new metrics, as it performs a full-text search. It's also worth mentioning, that you can find these metrics already in our product today, but they will only contain data for Kubernetes clusters monitored with an ActiveGate of version 1.247 or higher.

Unfolding the details of any of these new metrics shows the vast selection of dimensions available for splitting or filtering.

Deprecation: What happens to the old metrics?

The new metrics will eventually replace the old metrics – until then, both metrics will be written in parallel. With cluster version 1.246 the old metrics are prefixed with "[Deprecated]" in the display name. Further details about the deprecation can be found in the metrics section of our product within the description for each of these metrics. Especially the information on which metric to use as the replacement may prove helpful.

New and old metrics are written in parallel until you upgrade to ActiveGate version 1.253 (~3 months). That means that with ActiveGate version 1.253, the old metrics won't be written anymore. Already persisted data can still be read and analyzed, but no data will be migrated into the new metrics.

The impact on your environment and steps for a successful migration to our improved metrics

Metrics play a significant role in Dynatrace. Especially in an orchestration system like Kubernetes, the metrics are key to provisioning, perform infrastructure optimization, failure analysis, and much more. Therefore, metrics are used in a variety of places, namely: Dashboards, WebUI pages, custom events for alerting, and our REST API.

All these places need to be updated to the new metrics. We, as Dynatrace, will update the Kubernetes dashboard presets and the Kubernetes monitoring WebUI pages over the next few releases. However, as there is no straight forward migration from an old metric to a new metric, we can't automatically migrate your custom dashboards, custom events for alerting, SLOs, or any other tool that utilizes our API to access these metrics.

We understand that this might come with some effort on your side. As always, we want to keep your efforts as minimal as possible, so we offer a web application, called Metric Audit Report, guiding you through this process. Specifically, the Metric Audit Report provides you with a list of affected custom dashboards, custom events for alerting, and SLOs, alongside additional helpful information, such as the corresponding owner. We hope this makes migration fast and easy, so you can quickly focus again on all the benefits of our new metrics around Kubernetes.

Also note that the new metrics work differently with management zones. The old metrics were tied to primary entities such as namespaces or workloads. Consequently, including these entities in a management zone also included the corresponding metrics. With the new metrics, all the entities are available in the metrics' dimensions. Hence, an additional "Dimensional rule for Metrics" is required. The template for setting up a management zone for a single Kubernetes cluster has included this additional rule already since Dynatrace version 1.226. Consequently, most of your management zones are likely already considering this adaptation if you've used this template. However, if you configured management zones on your own, we recommend re-validating if the new dimensional rule for metrics is included.

Overview of all new and deprecated metrics

New metric key	New metric name	Deprecated metric key(s)	Availability of new metric	Planned decommissioning of old/deprecated metrics
builtin:kubernetes.containers	Kubernetes: Container count	builtin:cloud.kubernetes.pod.containers	ActiveGate 245	ActiveGate 253
builtin:kubernetes.pods	Kubernetes: Pod count (by workload)	builtin:cloud.kubernetes.workload.pods builtin:cloud.kubernetes.namespace.runningPods builtin:cloud.kubernetes.workload.runningPods	ActiveGate 245	ActiveGate 253
builtin:kubernetes.node.pods	Kubernetes: Pod count (by node)	-	ActiveGate 245	-
builtin:kubernetes.workloads	Kubernetes: Workload count	builtin:cloud.kubernetes.namespace.workloads	ActiveGate 245	ActiveGate 253
builtin:kubernetes.nodes	Kubernetes: Node count	builtin:cloud.kubernetes.cluster.nodes	ActiveGate 245	ActiveGate 253
builtin:kubernetes.resourcequota.limits_cpu	Kubernetes: Resource quota - CPU limits	builtin:cloud.kubernetes.namespace.quota.cpuLimits	ActiveGate 245	ActiveGate 253
builtin:kubernetes.resourcequota.limits_cpu_used	Kubernetes: Resource quota - CPU limits used	builtin:cloud.kubernetes.namespace.quota.usedCpuLimits	ActiveGate 245	ActiveGate 253
builtin:kubernetes.resourcequota.requests_cpu	Kubernetes: Resource quota - CPU requests	builtin:cloud.kubernetes.namespace.quota.cpuRequests	ActiveGate 245	ActiveGate 253
builtin:kubernetes.resourcequota.requests_cpu_used	Kubernetes: Resource quota - CPU requests used	builtin:cloud.kubernetes.namespace.quota.usedCpuRequests	ActiveGate 245	ActiveGate 253
builtin:kubernetes.resourcequota.limits_memory	Kubernetes: Resource quota - memory limits	builtin:cloud.kubernetes.namespace.quota.memoryLimits	ActiveGate 245	ActiveGate 253
builtin:kubernetes.resourcequota.limits_memory_used	Kubernetes: Resource quota - memory limits used	builtin:cloud.kubernetes.namespace.quota.usedMemoryLimits	ActiveGate 245	ActiveGate 253
builtin:kubernetes.resourcequota.requests_memory	Kubernetes: Resource quota - memory requests	builtin:cloud.kubernetes.namespace.quota.memoryRequests	ActiveGate 245	ActiveGate 253
builtin:kubernetes.resourcequota.requests_memory_used	Kubernetes: Resource quota - memory requests used	builtin:cloud.kubernetes.namespace.quota.usedMemoryRequests	ActiveGate 245	ActiveGate 253
builtin:kubernetes.resourcequota.pods	Kubernetes: Resource quota - pod count	builtin:cloud.kubernetes.namespace.quota.pods	ActiveGate 245	ActiveGate 253
builtin:kubernetes.resourcequota.pods_used	Kubernetes: Resource quota - pod used count	builtin:cloud.kubernetes.namespace.quota.usedPods	ActiveGate 245	ActiveGate 253
builtin:kubernetes.workload.limits_cpu	Kubernetes: Pod - CPU limits (by workload)	builtin:cloud.kubernetes.pod.cpuLimits builtin:cloud.kubernetes.namespace.cpuLimits builtin:cloud.kubernetes.cluster.cpuLimit builtin:cloud.kubernetes.cluster.cpuLimitStatistics	ActiveGate 245	ActiveGate 253
builtin:kubernetes.node.limits_cpu	Kubernetes: Pod - CPU limits (by node)	builtin:cloud.kubernetes.node.cpuLimit builtin:cloud.kubernetes.cluster.cpuLimitStatistics builtin:cloud.kubernetes.cluster.cpuLimit	ActiveGate 245	ActiveGate 253
builtin:kubernetes.workload.requests_cpu	Kubernetes: Pod - CPU requests (by workload)	builtin:cloud.kubernetes.pod.cpuRequests builtin:cloud.kubernetes.namespace.cpuRequests builtin:cloud.kubernetes.cluster.cpuRequestedStatistics builtin:cloud.kubernetes.cluster.cpuRequested	ActiveGate 245	ActiveGate 253
builtin:kubernetes.node.requests_cpu	Kubernetes: Pod - CPU requests (by node)	builtin:cloud.kubernetes.node.cpuRequested builtin:cloud.kubernetes.cluster.cpuRequestedStatistics builtin:cloud.kubernetes.cluster.cpuRequested	ActiveGate 245	ActiveGate 253
builtin:kubernetes.workload.limits_memory	Kubernetes: Pod - memory limits (by workload)	builtin:cloud.kubernetes.pod.memoryLimits builtin:cloud.kubernetes.namespace.memoryLimits builtin:cloud.kubernetes.cluster.memoryLimitStatistics builtin:cloud.kubernetes.cluster.memoryLimit	ActiveGate 245	ActiveGate 253
builtin:kubernetes.node.limits_memory	Kubernetes: Pod - memory limits (by node)	builtin:cloud.kubernetes.node.memoryLimit builtin:cloud.kubernetes.cluster.memoryLimitStatistics builtin:cloud.kubernetes.cluster.memoryLimit	ActiveGate 245	ActiveGate 253
builtin:kubernetes.workload.requests_memory	Kubernetes: Pod - memory requests (by workload)	builtin:cloud.kubernetes.pod.memoryRequests builtin:cloud.kubernetes.namespace.memoryRequests builtin:cloud.kubernetes.cluster.memoryRequestedStatistics builtin:cloud.kubernetes.cluster.memoryRequested	ActiveGate 245	ActiveGate 253
builtin:kubernetes.node.requests_memory	Kubernetes: Pod - memory requests (by node)	builtin:cloud.kubernetes.node.memoryRequested builtin:cloud.kubernetes.cluster.memoryRequestedStatistics builtin:cloud.kubernetes.cluster.memoryRequested	ActiveGate 245	ActiveGate 253
builtin:kubernetes.node.cpu_allocatable	Kubernetes: Node - CPU allocatable	builtin:cloud.kubernetes.node.cores builtin:cloud.kubernetes.cluster.cores	ActiveGate 245	ActiveGate 253
builtin:kubernetes.node.memory_allocatable	Kubernetes: Node - memory allocatable	builtin:cloud.kubernetes.node.memory builtin:cloud.kubernetes.cluster.memory	ActiveGate 245	ActiveGate 253
builtin:kubernetes.node.pods_allocatable	Kubernetes: Node - pod allocatable count	-	ActiveGate 245	ActiveGate 253
builtin:kubernetes.workload.pods_desired	Kubernetes: Workload - desired pod count	builtin:cloud.kubernetes.workload.desiredPods builtin:cloud.kubernetes.namespace.desiredPods	ActiveGate 245	ActiveGate 253
builtin:kubernetes.workload.containers_desired	Kubernetes: Pod - desired container count	builtin:cloud.kubernetes.pod.desiredContainers	ActiveGate 245	ActiveGate 253
builtin:kubernetes.container.oom_kills	Kubernetes: Container - out of memory (OOM) kill count	-	ActiveGate 245	ActiveGate 253
builtin:kubernetes.container.restarts	Kubernetes: Container - restart count	builtin:cloud.kubernetes.pod.containerRestarts	ActiveGate 247	ActiveGate 253
builtin:kubernetes.node.conditions	Kubernetes: Node conditions	builtin:cloud.kubernetes.node_conditions	ActiveGate 249	ActiveGate 253
builtin:kubernetes.cluster.readyz	Kubernetes: Cluster readyz status	builtin:cloud.kubernetes.cluster.readyz	ActiveGate 249	ActiveGate 253

Brace yourselves - cloud-native deployments are coming.

jason_gs · ‎26 Jul 2022

The Metric Audit Report requires a URL to perform a scan. For managed where the FQDN is not publicly accessible what options exist for running the report?

Thanks

Mizső · ‎26 Jul 2022

Hi Jason,

You should use cluster active gate for this purpose. If you have synthetic tests via cluster active gate is should be accessible via the public internet on the 9999 port. You can set a public endponint in cluster management/ settings/ public endpoints if it is allowed based on you security polisies.

This was my setup for the audit and it worked fine:

https://dynatrace-activegate-outside.xx.hu:9999/e/envid

Br, Mizső

Dynatrace Community RockStar 2024, Certified Dynatrace Professional

florian_g · ‎27 Jul 2022

Hi @jason_gs ,
we had air-gapped environments in mind when building this. It's all JavaScript and runs locally on your machine. Your browser directly talks to the DT tenant. Consequently, even the following should be possible:
* load the page (and/or save it for offline use)
* cut your internet connection
* connect to your air-gapped network
* provide the internal URL to the DT tenant and hit run 🙂

Brace yourselves - cloud-native deployments are coming.

AlanZ · ‎07 Sep 2022

Hi there,

We are using the "builtin:cloud.kubernetes.workload.pods" split by "Pod phase" to detect and alert on failed pods (PHASE_FAILED). Which metric and splitting would be equivalent in the new approach so we do not loose functionality ?

florian_g · ‎11 Jul 2023

just wanted to note, that you can now also use our k8s platform alerts to easily alert on "pod failure events". This should make this much easier 🙂 For more info please read here.

Brace yourselves - cloud-native deployments are coming.

Mizső · ‎08 Sep 2022

Hi AlanZ,

I think this metric can be used for it:

Kubernetes: Pod count (by workload)
builtin:kubernetes.pods

Metric name Kubernetes: Pod count (by workload)

Metric key builtin:kubernetes.pods

Description This metric measures the number of pods.
The most detailed level of aggregation is workload. The value corresponds to the count of all pods.

Dimensions current_pod_condition, Kubernetes workload (dt.entity.cloud_application), Kubernetes namespace (dt.entity.cloud_application_namespace),
Kubernetes cluster (dt.entity.kubernetes_cluster), k8s.cluster.name, k8s.cronjob.name, k8s.daemonset.name, k8s.deployment.name, k8s.namespace.name,
k8s.pod.name, k8s.statefulset.name, k8s.staticpod.name, k8s.workload.kind, k8s.workload.name, pod_phase, pod_status_reason

I have already started to use this:

builtin:kubernetes.pods:filter(and(eq(pod_phase,Failed))):splitBy():sort(value(auto,descending)):limit(10)

You should check Florian's other posts, there are many good idea and best parctice in them, for example:

dynatrace-api/metric-expressions-for-k8s.md at master · Dynatrace/dynatrace-api · GitHub

Br, Mizső

Dynatrace Community RockStar 2024, Certified Dynatrace Professional

The_AM · ‎10 Nov 2023

We noticed some other metrics are deprecated and stopped reporting data, like CPU cluster metric.

Is there an update that could be provided on the table to reflect this and if there are alternatives?

Regards,
Andrew M.

Theodore_x86 · ‎11 Apr 2024

Hello.

Is there a memory requests metric for Containers? We see only for pods.

BR

Houston, we have a problem.

CrazyHarry · ‎12 Feb 2025

Is this the complete list of Kubernetes metrics available? Wasn't sure if there was some other master list

Mizső · ‎12 Feb 2025

Hi @CrazyHarry,

You can find the complete list in the documnetation (GEN3 and Classic):

https://docs.dynatrace.com/docs/shortlink/built-in-metrics-on-grail#kubernetes-main

I hope it helps.

Best regards,

Mizső

Dynatrace Community RockStar 2024, Certified Dynatrace Professional