on 15 Jul 2024 02:20 PM - edited on 30 Jul 2024 01:35 PM by Michal_Gebacki
This is a troubleshooting guide for using Kubernetes CPU throttling data in Dynatrace. It is intended for customers who have questions or problems when using CPU throttling data.
The guide provides answers to the questions
The guide begins with general considerations about CPU throttling, continues with an explanation of how Dynatrace processes and displays CPU throttling data, and finally addresses specific scenarios and questions when using CPU throttling data.
A container is considered as CPU throttled if it requires more CPU resources than it is granted. At a more technical level, a container can be considered CPU throttled if it is interrupted during a certain scheduling period even though it is still capable of running.
Suppose you have a container in an isolated environment with no CPU limits. Assume further that the process in the container has the following CPU usage behavior over time.
Each small box represents the time course in 10ms. If the box is green, the process in the container is running and needs CPU; if the box is gray, the process in the container is waiting for IO (storage, network, user, whatever) and therefore does not need CPU. So there are phases in which the container needs CPU (green) interrupted by phases in which it does not need CPU (gray).
Now assume that the same container is running in a real-world environment and a CPU limit of 400 millicores has been set. A CPU limit of 400 millicores means that the process in the container is not allowed to use more than 400 millicores. With this CPU limit the following running behavior would result.
Every small red box here means that the process in the container could actually be run (see first diagram), but was throttled (interrupted) due to the limit. Throttling is generally enforced in individual 100ms scheduling periods.
This is what throttling means. A running container is interrupted due to resource limits. This extends its original runtime.
CPU throttling can happen for a number of reasons, e.g.
In a Kubernetes cluster in which a number of pods are supposed to make the best possible use of the overall available Kubernetes resources, a certain amount of throttling is normal and can be considered as usual. In order to make the best use of cluster resources, it can make sense to run a low-priority batch job, where response time is not important, with higher throttling.
Excessive throttling can become problematic when a process has to deliver short response times. In this case, care should be taken to ensure that the CPU throttling is not too high.
General advice: CPU throttling occurs if not enough resources are available, at the same time, one has to be careful of not over-provisioning workloads and end up wasting huge amounts of resources. For further information see 'Optimize resource utilization of Kubernetes clusters with SLOs'.
There is a difference regarding detail level of CPU usage and throttling data between Dynatrace Classic and the new Dynatrace platform. In Dynatrace Classic, CPU usage and throttling data is only available at workload level. On the new Dynatrace platform, this data is available at workload, pod and container level.
Kubernetes provides two different throttling metrics over its Prometheus cAdvisor metrics.
Metric | Kubernetes cAdvisor Metric Key | Description |
throttled_periods_total | container_cpu_cfs_throttled_periods_total |
Measures the CPU throttling in periods. This value is increased by one in each scheduling period it is actually throttled. |
throttled_seconds_total | container_cpu_cfs_throttled_seconds_total |
Measures the CPU throttling in milliseconds. This value is increased by the actual throttled milliseconds. |
Dynatrace provides the following Kubernetes CPU metrics. In order to make all these metrics easy to combine and compare, Dynatrace stores them with the unit 'core' / 'millicore'.
Metric | Dynatrace Classic Metric Key | Dynatrace Platform Metric Key | Description |
cpu_usage | builtin:kubernetes.workload.cpu_usage |
dt.kubernetes.container.cpu_usage |
Measure the total CPU consumed (user usage + system usage) by container in millicores. |
cpu_throttled | builtin:kubernetes.workload.cpu_throttled |
dt.kubernetes.container.cpu_throttled |
Measure the total CPU throttling by container in millicores. This metric is based on the throttled_seconds_total metric mentioned above. |
requests_cpu | builtin:kubernetes.workload.requests_cpu |
dt.kubernetes.container.requests_cpu |
Measure the CPU requests of a container in millicores. |
limits_cpu | builtin:kubernetes.workload.limits_cpu |
dt.kubernetes.container.limits_cpu |
Measure the CPU limits of a container in millicores. |
In the Dynatrace Kubernetes Classic UI, CPU throttling can be analyzed on workload level in the 'Resources analysis' section of the details screen.
In the Dynatrace Kubernetes App, CPU throttling can be analyzed on workload, pod or container level in the 'Utilization' section of the details screen.
If a container that requires fast request/response times is experiencing excessive CPU throttling, the first thing to check is whether the container CPU usage is close to the container CPU limit.
If you are already using the new Dynatrace platform, this data is available on container level. Otherwise, if you are still using Dynatrace Classic, this data is only available on workload level. In this case, it is difficult to find out exactly which container is affected.
If possible, always try to break down the throttling analysis to the container level. A throttling analysis exclusively at the workload level does not reveal the problematic container.
If the CPU usage is close to the limit, then the container limit should be increased. For more information see 'Resource Management for Pods and Containers'. If the CPU usage is not close to the limit, see next points for further reasons.
If the limit of a container is significantly higher than its usage but the container CPU throttling is high, this may be because the node has too little allocatable CPU resources for the number of pods running on it. This information can be found in the Dynatrace UI in the node details checking the usage / allocatable / limits metrics. This data is available on Dynatrace Classic as well as on the new Dynatrace platform.
If the CPU usage is close to the allocatable CPUs, you have the following options.
Recommended measures:
Further measures:
Even if the container limit is apparently set high enough and the node has enough allocatable CPU resources, throttling can still occur. The reason for this may lie in a technical detail of the operating system. At the end it is a matter of scale (or a matter of metric timeframe vs. scheduling period).
The Dynatrace CPU usage metric is a value averaged over one minute whereas the actual CPU throttling is enforced in 100ms periods. The CPU usage metric available in Dynatrace is determined once per minute and represents the average usage in that minute. The CPU throttling itself is enforced by the operating system and generally works with a 100ms scheduling period.
Consider the throttling example from above.
Although throttling occurs in some of the scheduling periods, the usage is smaller than the limit over the entire period on average. However, the throttling is not decided on the average, but in the small scheduling periods. Even if the average CPU usage of a container is below the CPU limit over a whole minute, it is still possible that the usage would exceed the limit in one of the many small (100ms) throttling periods and the container would therefore be throttled in these periods.
This can for example happen, when the container has short CPU bursts. So high CPU throttling while limits are not reached may indicate short high CPU bursts. In this case, you could further increase the CPU limit of the container (or remove it completely) so that the limit is not exceeded so easily within individual scheduling periods. In addition, increasing the CPU requests in a healthy way can also be helpful.
If you have any further questions or encounter any issues not listed above, please feel free to contact our support team.