cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Introducing new Kubernetes metrics for improved user experience (and deprecating old ones)

florian_g
Dynatrace Helper
Dynatrace Helper

As Dynatrace, we're constantly trying to improve your user experience with our product. Sometimes, this comes with breaking changes, such as in this case. Specifically, we're consolidating some of our Kubernetes-related metrics, meaning that some currently existing metrics will soon be deprecated. This post explains the why and what of this move. It also describes the potential impact on your environment and steps to migrate to our improved metrics. Please read this carefully if you use Kubernetes-related metrics in custom dashboardscustom events for alerting, or SLOs – these assets might break with this change.

 

In the past, based on some technical limitations, we had to provide multiple metrics to represent the same measurement on multiple levels of the Kubernetes entity hierarchy. For example, there were separate metrics for counting running pods on the workload vs. counting them on the namespace level. Also, for some levels of the Kubernetes entity hierarchy, there was no metric available at all. In case you needed that measurement, you had to come up with complex metric expressions, such as for charting running pods on a cluster level. Many people were overwhelmed by the complexity of these queries, and we got lots of requests for improvements. Consequently, we decided to take a significant step in consolidating these metrics.

 

Which metrics are affected?

We have focused on metrics originating from the Kubernetes API – technically, we do that using our ActiveGate. Consequently, the availability of the new metrics is bound to the version of the ActiveGate monitoring your Kubernetes cluster. In total, we add 30 new metrics with ActiveGate 1.247. We also use this chance to move towards our new Kubernetes metrics prefix: "builtin:kubernetes". The new prefix lets you spot new metrics within our product immediately. Navigate to the metrics browser and search for "builtin:kubernetes" as shown in the screenshot below. Please note, that the search result will show you old and new metrics, as it performs a full-text search. It's also worth mentioning, that you can find these metrics already in our product today, but they will only contain data for Kubernetes clusters monitored with an ActiveGate of version 1.247 or higher.

florian_g_4-1657726440312.png

 

Unfolding the details of any of these new metrics shows the vast selection of dimensions available for splitting or filtering.florian_g_5-1657726440317.png

 

Deprecation: What happens to the old metrics?

The new metrics will eventually replace the old metrics – until then, both metrics will be written in parallel. With cluster version 1.246 the old metrics are prefixed with "[Deprecated]" in the display name. Further details about the deprecation can be found in the metrics section of our product within the description for each of these metrics. Especially the information on which metric to use as the replacement may prove helpful.

florian_g_6-1657726440319.png

 

New and old metrics are written in parallel until you upgrade to ActiveGate version 1.253 (~3 months). That means that with ActiveGate version 1.253, the old metrics won't be written anymore. Already persisted data can still be read and analyzed, but no data will be migrated into the new metrics.

 

florian_g_0-1658143849169.png

 

 

The impact on your environment and steps for a successful migration to our improved metrics

Metrics play a significant role in Dynatrace. Especially in an orchestration system like Kubernetes, the metrics are key to provisioning, perform infrastructure optimization, failure analysis, and much more. Therefore, metrics are used in a variety of places, namely: Dashboards, WebUI pages, custom events for alerting, and our REST API.

 

All these places need to be updated to the new metrics. We, as Dynatrace, will update the Kubernetes dashboard presets and the Kubernetes monitoring WebUI pages over the next few releases. However, as there is no straight forward migration from an old metric to a new metric, we can't automatically migrate your custom dashboards, custom events for alerting, SLOs, or any other tool that utilizes our API to access these metrics.

 

We understand that this might come with some effort on your side. As always, we want to keep your efforts as minimal as possible, so we offer a web application, called Metric Audit Report,  guiding you through this process. Specifically, the Metric Audit Report provides you with a list of affected custom dashboards, custom events for alerting, and SLOs, alongside additional helpful information, such as the corresponding owner. We hope this makes migration fast and easy, so you can quickly focus again on all the benefits of our new metrics around Kubernetes.

florian_g_0-1658128684298.png

 

 

 

Overview of all new and deprecated metrics

New metric key New metric name Deprecated metric key(s) Availability of new metric Planned decommissioning of old/deprecated metrics
builtin:kubernetes.containers Kubernetes: Container count builtin:cloud.kubernetes.pod.containers ActiveGate 245 ActiveGate 253
builtin:kubernetes.pods Kubernetes: Pod count (by workload) builtin:cloud.kubernetes.workload.pods
builtin:cloud.kubernetes.namespace.runningPods
builtin:cloud.kubernetes.workload.runningPods
ActiveGate 245 ActiveGate 253
builtin:kubernetes.node.pods Kubernetes: Pod count (by node) - ActiveGate 245 -
builtin:kubernetes.workloads Kubernetes: Workload count builtin:cloud.kubernetes.namespace.workloads ActiveGate 245 ActiveGate 253
builtin:kubernetes.nodes Kubernetes: Node count builtin:cloud.kubernetes.cluster.nodes ActiveGate 245 ActiveGate 253
builtin:kubernetes.resourcequota.limits_cpu Kubernetes: Resource quota - CPU limits builtin:cloud.kubernetes.namespace.quota.cpuLimits ActiveGate 245 ActiveGate 253
builtin:kubernetes.resourcequota.limits_cpu_used Kubernetes: Resource quota - CPU limits used builtin:cloud.kubernetes.namespace.quota.usedCpuLimits ActiveGate 245 ActiveGate 253
builtin:kubernetes.resourcequota.requests_cpu Kubernetes: Resource quota - CPU requests builtin:cloud.kubernetes.namespace.quota.cpuRequests ActiveGate 245 ActiveGate 253
builtin:kubernetes.resourcequota.requests_cpu_used Kubernetes: Resource quota - CPU requests used builtin:cloud.kubernetes.namespace.quota.usedCpuRequests ActiveGate 245 ActiveGate 253
builtin:kubernetes.resourcequota.limits_memory Kubernetes: Resource quota - memory limits builtin:cloud.kubernetes.namespace.quota.memoryLimits ActiveGate 245 ActiveGate 253
builtin:kubernetes.resourcequota.limits_memory_used Kubernetes: Resource quota - memory limits used builtin:cloud.kubernetes.namespace.quota.usedMemoryLimits ActiveGate 245 ActiveGate 253
builtin:kubernetes.resourcequota.requests_memory Kubernetes: Resource quota - memory requests builtin:cloud.kubernetes.namespace.quota.memoryRequests ActiveGate 245 ActiveGate 253
builtin:kubernetes.resourcequota.requests_memory_used Kubernetes: Resource quota - memory requests used builtin:cloud.kubernetes.namespace.quota.usedMemoryRequests ActiveGate 245 ActiveGate 253
builtin:kubernetes.resourcequota.pods Kubernetes: Resource quota - pod count builtin:cloud.kubernetes.namespace.quota.pods ActiveGate 245 ActiveGate 253
builtin:kubernetes.resourcequota.pods_used Kubernetes: Resource quota - pod used count builtin:cloud.kubernetes.namespace.quota.usedPods ActiveGate 245 ActiveGate 253
builtin:kubernetes.workload.limits_cpu Kubernetes:  Pod - CPU limits (by workload) builtin:cloud.kubernetes.pod.cpuLimits
builtin:cloud.kubernetes.namespace.cpuLimits
builtin:cloud.kubernetes.cluster.cpuLimit
builtin:cloud.kubernetes.cluster.cpuLimitStatistics
ActiveGate 245 ActiveGate 253
builtin:kubernetes.node.limits_cpu Kubernetes: Pod - CPU limits (by node) builtin:cloud.kubernetes.node.cpuLimit
builtin:cloud.kubernetes.cluster.cpuLimitStatistics
builtin:cloud.kubernetes.cluster.cpuLimit
ActiveGate 245 ActiveGate 253
builtin:kubernetes.workload.requests_cpu Kubernetes: Pod - CPU requests (by workload) builtin:cloud.kubernetes.pod.cpuRequests
builtin:cloud.kubernetes.namespace.cpuRequests
builtin:cloud.kubernetes.cluster.cpuRequestedStatistics
builtin:cloud.kubernetes.cluster.cpuRequested
ActiveGate 245 ActiveGate 253
builtin:kubernetes.node.requests_cpu Kubernetes: Pod - CPU requests (by node) builtin:cloud.kubernetes.node.cpuRequested
builtin:cloud.kubernetes.cluster.cpuRequestedStatistics
builtin:cloud.kubernetes.cluster.cpuRequested
ActiveGate 245 ActiveGate 253
builtin:kubernetes.workload.limits_memory Kubernetes: Pod - memory limits (by workload) builtin:cloud.kubernetes.pod.memoryLimits
builtin:cloud.kubernetes.namespace.memoryLimits
builtin:cloud.kubernetes.cluster.memoryLimitStatistics
builtin:cloud.kubernetes.cluster.memoryLimit
ActiveGate 245 ActiveGate 253
builtin:kubernetes.node.limits_memory Kubernetes: Pod - memory limits (by node) builtin:cloud.kubernetes.node.memoryLimit
builtin:cloud.kubernetes.cluster.memoryLimitStatistics
builtin:cloud.kubernetes.cluster.memoryLimit
ActiveGate 245 ActiveGate 253
builtin:kubernetes.workload.requests_memory Kubernetes: Pod - memory requests (by workload) builtin:cloud.kubernetes.pod.memoryRequests
builtin:cloud.kubernetes.namespace.memoryRequests
builtin:cloud.kubernetes.cluster.memoryRequestedStatistics
builtin:cloud.kubernetes.cluster.memoryRequested
ActiveGate 245 ActiveGate 253
builtin:kubernetes.node.requests_memory Kubernetes: Pod - memory requests (by node) builtin:cloud.kubernetes.node.memoryRequested
builtin:cloud.kubernetes.cluster.memoryRequestedStatistics
builtin:cloud.kubernetes.cluster.memoryRequested
ActiveGate 245 ActiveGate 253
builtin:kubernetes.node.cpu_allocatable Kubernetes: Node - CPU allocatable builtin:cloud.kubernetes.node.cores
builtin:cloud.kubernetes.cluster.cores
ActiveGate 245 ActiveGate 253
builtin:kubernetes.node.memory_allocatable Kubernetes: Node - memory allocatable builtin:cloud.kubernetes.node.memory
builtin:cloud.kubernetes.cluster.memory
ActiveGate 245 ActiveGate 253
builtin:kubernetes.node.pods_allocatable Kubernetes: Node - pod allocatable count - ActiveGate 245 ActiveGate 253
builtin:kubernetes.workload.pods_desired Kubernetes: Workload - desired pod count builtin:cloud.kubernetes.workload.desiredPods
builtin:cloud.kubernetes.namespace.desiredPods
ActiveGate 245 ActiveGate 253
builtin:kubernetes.workload.containers_desired Kubernetes: Pod - desired container count builtin:cloud.kubernetes.pod.desiredContainers ActiveGate 245 ActiveGate 253
builtin:kubernetes.container.oom_kills Kubernetes: Container - out of memory (OOM) kill count - ActiveGate 245 ActiveGate 253
builtin:kubernetes.container.restarts Kubernetes: Container - restart count builtin:cloud.kubernetes.pod.containerRestarts ActiveGate 247 ActiveGate 253
builtin:kubernetes.node.conditions Kubernetes: Node conditions builtin:cloud.kubernetes.node_conditions ActiveGate 249 ActiveGate 253
builtin:kubernetes.cluster.readyz Kubernetes: Cluster readyz status builtin:cloud.kubernetes.cluster.readyz ActiveGate 249 ActiveGate 253

 

One does not simply run a container...
5 REPLIES 5

jason_gs
Dynatrace Enthusiast
Dynatrace Enthusiast

The Metric Audit Report requires a URL to perform a scan. For managed where the FQDN is not publicly accessible what options exist for running the report?

 

Thanks

Hi Jason,

 

You should use cluster active gate for this purpose. If you have synthetic tests via cluster active gate is should be accessible via the public internet on the 9999 port. You can set a public endponint in cluster management/ settings/ public endpoints if it is allowed based on you security polisies.

 

This was my setup for the audit and it worked fine:

https://dynatrace-activegate-outside.xx.hu:9999/e/envid

 

Br, Mizső

Certified Dynatrace Associate

Hi @jason_gs ,
we had air-gapped environments in mind when building this. It's all JavaScript and runs locally on your machine. Your browser directly talks to the DT tenant. Consequently, even the following should be possible:
* load the page (and/or save it for offline use)
* cut your internet connection
* connect to your air-gapped network
* provide the internal URL to the DT tenant and hit run 🙂

One does not simply run a container...

AlanZ
Organizer

Hi there,

 

We are using the "builtin:cloud.kubernetes.workload.pods" split by "Pod phase" to detect and alert on failed pods (PHASE_FAILED). Which metric and splitting would be equivalent in the new approach so we do not loose functionality ?

Mizső
Helper

Hi AlanZ,

 

I think this metric can be used for it:

 

Kubernetes: Pod count (by workload)
builtin:kubernetes.pods


Metric name​ Kubernetes: Pod count (by workload)


Metric key​ builtin:kubernetes.pods


Description​ This metric measures the number of pods.
The most detailed level of aggregation is workload. The value corresponds to the count of all pods.


Dimensions​ current_pod_condition, Kubernetes workload (dt.entity.cloud_application), Kubernetes namespace (dt.entity.cloud_application_namespace), 
Kubernetes cluster (dt.entity.kubernetes_cluster), k8s.cluster.name, k8s.cronjob.name, k8s.daemonset.name, k8s.deployment.name, k8s.namespace.name, 
k8s.pod.name, k8s.statefulset.name, k8s.staticpod.name, k8s.workload.kind, k8s.workload.name, pod_phase, pod_status_reason

 

I have already started to use this:

builtin:kubernetes.pods:filter(and(eq(pod_phase,Failed))):splitBy():sort(value(auto,descending)):limit(10)

 

You should check Florian's other posts, there are many good idea and best parctice in them, for example:

 

dynatrace-api/metric-expressions-for-k8s.md at master · Dynatrace/dynatrace-api · GitHub

Br, Mizső

Certified Dynatrace Associate