13 Jul 2022 04:51 PM - edited 20 Oct 2022 04:35 PM
As Dynatrace, we're constantly trying to improve your user experience with our product. Sometimes, this comes with breaking changes, such as in this case. Specifically, we're consolidating some of our Kubernetes-related metrics, meaning that some currently existing metrics will soon be deprecated. This post explains the why and what of this move. It also describes the potential impact on your environment and steps to migrate to our improved metrics. Please read this carefully if you use Kubernetes-related metrics in custom dashboards, custom events for alerting, or SLOs – these assets might break with this change.
In the past, based on some technical limitations, we had to provide multiple metrics to represent the same measurement on multiple levels of the Kubernetes entity hierarchy. For example, there were separate metrics for counting running pods on the workload vs. counting them on the namespace level. Also, for some levels of the Kubernetes entity hierarchy, there was no metric available at all. In case you needed that measurement, you had to come up with complex metric expressions, such as for charting running pods on a cluster level. Many people were overwhelmed by the complexity of these queries, and we got lots of requests for improvements. Consequently, we decided to take a significant step in consolidating these metrics.
We have focused on metrics originating from the Kubernetes API – technically, we do that using our ActiveGate. Consequently, the availability of the new metrics is bound to the version of the ActiveGate monitoring your Kubernetes cluster. In total, we add 30 new metrics with ActiveGate 1.247. We also use this chance to move towards our new Kubernetes metrics prefix: "builtin:kubernetes". The new prefix lets you spot new metrics within our product immediately. Navigate to the metrics browser and search for "builtin:kubernetes" as shown in the screenshot below. Please note, that the search result will show you old and new metrics, as it performs a full-text search. It's also worth mentioning, that you can find these metrics already in our product today, but they will only contain data for Kubernetes clusters monitored with an ActiveGate of version 1.247 or higher.
Unfolding the details of any of these new metrics shows the vast selection of dimensions available for splitting or filtering.
The new metrics will eventually replace the old metrics – until then, both metrics will be written in parallel. With cluster version 1.246 the old metrics are prefixed with "[Deprecated]" in the display name. Further details about the deprecation can be found in the metrics section of our product within the description for each of these metrics. Especially the information on which metric to use as the replacement may prove helpful.
New and old metrics are written in parallel until you upgrade to ActiveGate version 1.253 (~3 months). That means that with ActiveGate version 1.253, the old metrics won't be written anymore. Already persisted data can still be read and analyzed, but no data will be migrated into the new metrics.
Metrics play a significant role in Dynatrace. Especially in an orchestration system like Kubernetes, the metrics are key to provisioning, perform infrastructure optimization, failure analysis, and much more. Therefore, metrics are used in a variety of places, namely: Dashboards, WebUI pages, custom events for alerting, and our REST API.
All these places need to be updated to the new metrics. We, as Dynatrace, will update the Kubernetes dashboard presets and the Kubernetes monitoring WebUI pages over the next few releases. However, as there is no straight forward migration from an old metric to a new metric, we can't automatically migrate your custom dashboards, custom events for alerting, SLOs, or any other tool that utilizes our API to access these metrics.
We understand that this might come with some effort on your side. As always, we want to keep your efforts as minimal as possible, so we offer a web application, called Metric Audit Report, guiding you through this process. Specifically, the Metric Audit Report provides you with a list of affected custom dashboards, custom events for alerting, and SLOs, alongside additional helpful information, such as the corresponding owner. We hope this makes migration fast and easy, so you can quickly focus again on all the benefits of our new metrics around Kubernetes.
Also note that the new metrics work differently with management zones. The old metrics were tied to primary entities such as namespaces or workloads. Consequently, including these entities in a management zone also included the corresponding metrics. With the new metrics, all the entities are available in the metrics' dimensions. Hence, an additional "Dimensional rule for Metrics" is required. The template for setting up a management zone for a single Kubernetes cluster has included this additional rule already since Dynatrace version 1.226. Consequently, most of your management zones are likely already considering this adaptation if you've used this template. However, if you configured management zones on your own, we recommend re-validating if the new dimensional rule for metrics is included.
New metric key | New metric name | Deprecated metric key(s) | Availability of new metric | Planned decommissioning of old/deprecated metrics |
builtin:kubernetes.containers | Kubernetes: Container count | builtin:cloud.kubernetes.pod.containers | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.pods | Kubernetes: Pod count (by workload) | builtin:cloud.kubernetes.workload.pods builtin:cloud.kubernetes.namespace.runningPods builtin:cloud.kubernetes.workload.runningPods |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.node.pods | Kubernetes: Pod count (by node) | - | ActiveGate 245 | - |
builtin:kubernetes.workloads | Kubernetes: Workload count | builtin:cloud.kubernetes.namespace.workloads | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.nodes | Kubernetes: Node count | builtin:cloud.kubernetes.cluster.nodes | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.resourcequota.limits_cpu | Kubernetes: Resource quota - CPU limits | builtin:cloud.kubernetes.namespace.quota.cpuLimits | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.resourcequota.limits_cpu_used | Kubernetes: Resource quota - CPU limits used | builtin:cloud.kubernetes.namespace.quota.usedCpuLimits | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.resourcequota.requests_cpu | Kubernetes: Resource quota - CPU requests | builtin:cloud.kubernetes.namespace.quota.cpuRequests | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.resourcequota.requests_cpu_used | Kubernetes: Resource quota - CPU requests used | builtin:cloud.kubernetes.namespace.quota.usedCpuRequests | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.resourcequota.limits_memory | Kubernetes: Resource quota - memory limits | builtin:cloud.kubernetes.namespace.quota.memoryLimits | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.resourcequota.limits_memory_used | Kubernetes: Resource quota - memory limits used | builtin:cloud.kubernetes.namespace.quota.usedMemoryLimits | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.resourcequota.requests_memory | Kubernetes: Resource quota - memory requests | builtin:cloud.kubernetes.namespace.quota.memoryRequests | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.resourcequota.requests_memory_used | Kubernetes: Resource quota - memory requests used | builtin:cloud.kubernetes.namespace.quota.usedMemoryRequests | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.resourcequota.pods | Kubernetes: Resource quota - pod count | builtin:cloud.kubernetes.namespace.quota.pods | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.resourcequota.pods_used | Kubernetes: Resource quota - pod used count | builtin:cloud.kubernetes.namespace.quota.usedPods | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.workload.limits_cpu | Kubernetes: Pod - CPU limits (by workload) | builtin:cloud.kubernetes.pod.cpuLimits builtin:cloud.kubernetes.namespace.cpuLimits builtin:cloud.kubernetes.cluster.cpuLimit builtin:cloud.kubernetes.cluster.cpuLimitStatistics |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.node.limits_cpu | Kubernetes: Pod - CPU limits (by node) | builtin:cloud.kubernetes.node.cpuLimit builtin:cloud.kubernetes.cluster.cpuLimitStatistics builtin:cloud.kubernetes.cluster.cpuLimit |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.workload.requests_cpu | Kubernetes: Pod - CPU requests (by workload) | builtin:cloud.kubernetes.pod.cpuRequests builtin:cloud.kubernetes.namespace.cpuRequests builtin:cloud.kubernetes.cluster.cpuRequestedStatistics builtin:cloud.kubernetes.cluster.cpuRequested |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.node.requests_cpu | Kubernetes: Pod - CPU requests (by node) | builtin:cloud.kubernetes.node.cpuRequested builtin:cloud.kubernetes.cluster.cpuRequestedStatistics builtin:cloud.kubernetes.cluster.cpuRequested |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.workload.limits_memory | Kubernetes: Pod - memory limits (by workload) | builtin:cloud.kubernetes.pod.memoryLimits builtin:cloud.kubernetes.namespace.memoryLimits builtin:cloud.kubernetes.cluster.memoryLimitStatistics builtin:cloud.kubernetes.cluster.memoryLimit |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.node.limits_memory | Kubernetes: Pod - memory limits (by node) | builtin:cloud.kubernetes.node.memoryLimit builtin:cloud.kubernetes.cluster.memoryLimitStatistics builtin:cloud.kubernetes.cluster.memoryLimit |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.workload.requests_memory | Kubernetes: Pod - memory requests (by workload) | builtin:cloud.kubernetes.pod.memoryRequests builtin:cloud.kubernetes.namespace.memoryRequests builtin:cloud.kubernetes.cluster.memoryRequestedStatistics builtin:cloud.kubernetes.cluster.memoryRequested |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.node.requests_memory | Kubernetes: Pod - memory requests (by node) | builtin:cloud.kubernetes.node.memoryRequested builtin:cloud.kubernetes.cluster.memoryRequestedStatistics builtin:cloud.kubernetes.cluster.memoryRequested |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.node.cpu_allocatable | Kubernetes: Node - CPU allocatable | builtin:cloud.kubernetes.node.cores builtin:cloud.kubernetes.cluster.cores |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.node.memory_allocatable | Kubernetes: Node - memory allocatable | builtin:cloud.kubernetes.node.memory builtin:cloud.kubernetes.cluster.memory |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.node.pods_allocatable | Kubernetes: Node - pod allocatable count | - | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.workload.pods_desired | Kubernetes: Workload - desired pod count | builtin:cloud.kubernetes.workload.desiredPods builtin:cloud.kubernetes.namespace.desiredPods |
ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.workload.containers_desired | Kubernetes: Pod - desired container count | builtin:cloud.kubernetes.pod.desiredContainers | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.container.oom_kills | Kubernetes: Container - out of memory (OOM) kill count | - | ActiveGate 245 | ActiveGate 253 |
builtin:kubernetes.container.restarts | Kubernetes: Container - restart count | builtin:cloud.kubernetes.pod.containerRestarts | ActiveGate 247 | ActiveGate 253 |
builtin:kubernetes.node.conditions | Kubernetes: Node conditions | builtin:cloud.kubernetes.node_conditions | ActiveGate 249 | ActiveGate 253 |
builtin:kubernetes.cluster.readyz | Kubernetes: Cluster readyz status | builtin:cloud.kubernetes.cluster.readyz | ActiveGate 249 | ActiveGate 253 |
26 Jul 2022 10:08 AM
The Metric Audit Report requires a URL to perform a scan. For managed where the FQDN is not publicly accessible what options exist for running the report?
Thanks
26 Jul 2022 10:39 AM
Hi Jason,
You should use cluster active gate for this purpose. If you have synthetic tests via cluster active gate is should be accessible via the public internet on the 9999 port. You can set a public endponint in cluster management/ settings/ public endpoints if it is allowed based on you security polisies.
This was my setup for the audit and it worked fine:
https://dynatrace-activegate-outside.xx.hu:9999/e/envid
Br, Mizső
27 Jul 2022 07:23 AM - edited 27 Jul 2022 08:33 AM
Hi @jason_gs ,
we had air-gapped environments in mind when building this. It's all JavaScript and runs locally on your machine. Your browser directly talks to the DT tenant. Consequently, even the following should be possible:
* load the page (and/or save it for offline use)
* cut your internet connection
* connect to your air-gapped network
* provide the internal URL to the DT tenant and hit run 🙂
07 Sep 2022 11:13 PM
Hi there,
We are using the "builtin:cloud.kubernetes.workload.pods" split by "Pod phase" to detect and alert on failed pods (PHASE_FAILED). Which metric and splitting would be equivalent in the new approach so we do not loose functionality ?
11 Jul 2023 01:24 PM
just wanted to note, that you can now also use our k8s platform alerts to easily alert on "pod failure events". This should make this much easier 🙂 For more info please read here.
08 Sep 2022 06:06 AM
Hi AlanZ,
I think this metric can be used for it:
Kubernetes: Pod count (by workload)
builtin:kubernetes.pods
Metric name Kubernetes: Pod count (by workload)
Metric key builtin:kubernetes.pods
Description This metric measures the number of pods.
The most detailed level of aggregation is workload. The value corresponds to the count of all pods.
Dimensions current_pod_condition, Kubernetes workload (dt.entity.cloud_application), Kubernetes namespace (dt.entity.cloud_application_namespace),
Kubernetes cluster (dt.entity.kubernetes_cluster), k8s.cluster.name, k8s.cronjob.name, k8s.daemonset.name, k8s.deployment.name, k8s.namespace.name,
k8s.pod.name, k8s.statefulset.name, k8s.staticpod.name, k8s.workload.kind, k8s.workload.name, pod_phase, pod_status_reason
I have already started to use this:
builtin:kubernetes.pods:filter(and(eq(pod_phase,Failed))):splitBy():sort(value(auto,descending)):limit(10)
You should check Florian's other posts, there are many good idea and best parctice in them, for example:
dynatrace-api/metric-expressions-for-k8s.md at master · Dynatrace/dynatrace-api · GitHub
Br, Mizső
10 Nov 2023 05:01 AM
We noticed some other metrics are deprecated and stopped reporting data, like CPU cluster metric.
Is there an update that could be provided on the table to reflect this and if there are alternatives?
11 Apr 2024 09:39 AM
Hello.
Is there a memory requests metric for Containers? We see only for pods.
BR