on 15 Jul 2024 02:20 PM
This is a troubleshooting guide for using the Kubernetes 'High CPU throttling' alert in Dynatrace. It is intended for customers who have questions or problems using the CPU throttling alert.
The guide provides answers to the questions
The guide begins with general considerations about CPU throttling, continues with an explanation of the Dynatrace 'High CPU Throttling' alert, and finally addresses specific scenarios and questions when using the alert.
A container is considered as CPU throttled if it requires more CPU resources than it is granted. At a more technical level, a container can be considered CPU throttled if it is interrupted during a certain scheduling period even though it is still capable of running.
More detailed information on CPU throttling can be found in the guide 'Troubleshooting Kubernetes CPU Throttling Problems in Dynatrace'.
Suppose you have a container with a CPU limit of 400 millicore. Further assume that the container has the following CPU behavior over time.
Each small box represents the time course in 10ms. A green box means that the process in the container is running and requires CPU resources. A red box means that the process could run, but is throttled due to the CPU limit. A gray box means that the process is waiting for IO (storage, network, user, whatever) and therefore does not require CPU. Throttling is generally enforced in individual 100ms scheduling periods.
There are two CPU throttling metrics from Kubernetes or more precisely from the operating system side: 'throttled periods' and 'throttled seconds'. See "Prometheus cAdvisor metrics" for further details. These are absolute values without a reference to the actual usage and in this form not very meaningful. A throttling ratio of throttling / usage
would be more handy here.
Due to these two Kubernetes throttling metrics, there are two possible definitions of a throttling / usage
ratio.
This ratio is based on the scheduling periods of the operating system.
ratio-by-periods = container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total
The metric throttled_periods_total
increases by one whenever a container was interrupted during a period, although it was able to run. The metric periods_total
increases every period by one as long as a container is running. The higher throttled_periods
is, the higher is the ratio. This ratio is usual between 0 and 100%.
According to the throttling example above, let us have a look to the first scheduling period.
container_cpu_cfs_throttled_periods_total
would be 1 as throttling occurred in this period. container_cpu_cfs_periods_total
would also be 1. So ratio-by-periods = container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total = 1 / 1 = 1 = 100%
.
Grafana Dashboard 'CPU Usage vs Throttling Percent' and Prometheus 'CPUThrottlingHigh Alert' are using this throttling ratio definition.
This ratio is based on the actual scheduling time in milliseconds.
ratio-by-seconds = container_cpu_cfs_throttled_seconds_total / container_cpu_usage_seconds_total
The metric throttled_seconds_total
increases when a container is not running, although it would be able to run. The metric usage_seconds_total
increases when a container is running. Both metrics increase by the respective number of milliseconds. The higher throttled_seconds
is, the higher is the ratio. This ratio can be higher than 100%!
According to the throttling example above, let us have a look to the first scheduling period.
container_cpu_cfs_throttled_seconds_total
would be 0.06 as throttling occurred for 60ms. container_cpu_usage_seconds_total
would be 0.04 as the container actual runs for 40ms. So ratio-by-seconds = container_cpu_cfs_throttled_seconds_total / container_cpu_usage_seconds_total = 0.06 / 0.04 = 1.5 = 150%
.
Dynatrace is using the 'Throttling Ratio by Seconds' definition.
The advantage is that this ratio is more accurate than the ratio by period because it uses exact time values. This relationship also seems to be more intuitive and meaningful. The ratio by usage makes it easier to find thresholds that fit a variety of workloads. Values of 150% or more may seem confusing at first glance, but on closer look they are the correct value.
For the Kubernetes anomaly detection 'High CPU throttling' alert, Dynatrace uses the 'Throttling Ratio by Seconds' definition as described above. It is defined as dynatrace-throttling-ratio = builtin:kubernetes.workload.cpu_throttled / builtin:kubernetes.workload.cpu_usage
. For exact definition see "High CPU throttling" in 'Workload alerts'. This value can be higher than 100%! See section "Throttling Ratio by Seconds" above.
To check the Dynatrace throttling ratio metric expression open 'Data Explorer' in Dynatrace Classic or 'Notebook' in the new Dynatrace Platform to execute the metric expression.
The alert can be configured in the 'Anomaly detection' settings on tenant level, cluster level or namespace level. These settings are structured hierarchically. The more detailed level overrides the value of the more general level. Specific settings on workload level are not supported!
The settings for the tenant level can be found in 'Settings' --> 'Anomaly detection' --> 'Kubernetes Workload'. The settings for the cluster and namespace level can be found in the cluster settings at 'Kubernetes settings' --> 'Anomaly detection' --> 'Workload'.
Enabling / Disabling: Using the toggle 'Detect high CPU throttling' (default: disabled), the alarm can be switched on or off for the entire tenant, for a specific cluster or for a specific namespace.
Threshold: The threshold for triggering the alert can be defined using the input field 'CPU throttling level is above' (default: 50%). As explained in the section 'Throttling Ratio by Seconds', this value can be higher than 100%.
Trigger Period: The value in the field 'of CPU usage for at least' (default: 10 minutes) specifies how long the threshold defined above must be exceeded in the total observation period defined below to trigger this alarm.
Observation Period: Finally, the value in the field 'within the last' (default: 15 minutes) indicates the total observation period.
Check the actual Dynatrace throttling ratio values for your desired workload. See section "How the Dynatrace 'High CPU throttling' Alert is defined".
Does the metric expression have data points?
If not:
builtin:kubernetes.workload.cpu_usage
metric on this cluster at all?If yes:
The fact that CPU throttling occurs even though usage is below the limit does not necessarily mean it is a fault.
For bigger clusters or clusters with general higher throttling, an over-alerting can happen.
The alarm settings are only default values that make sense for some typical Kubernetes use cases. However, not for all of them. If the alarm is triggered too often for your specific use case, you have the option of adjusting the settings to your needs down to the namespace level.
Individually configuring an alarm for a distinct workload is not supported. The smallest granularity for configuration is the namespace the workload is running in.
Individually configuring an alarm for a distinct workload is not supported. When you are at workload details and click 'Anomaly detection settings' from the menu, you are actually forwarded to the namespace level settings.
Changes there apply to all workloads in that namespace. Doing the same form a different workload forwards you to the same setting page.
The calculation of the throttling ratio as used by Dynatrace can produce values over 100%. This is normal and intended. See section "Throttling Ratio by Seconds" above.
In the alert settings it is possible to specify a threshold above 100%. Increase the threshold accordingly (see section "How the Dynatrace 'High CPU throttling' Alert can be configured").
If you have any further questions or encounter any issues not listed above, please feel free to contact our support team.