Introduction

PeVa · ‎15 Jul 2024

Introduction
General Considerations
CPU Throttling
An Example
CPU Throttling Ratio
The Dynatrace 'High CPU Throttling' Alert
How the Dynatrace 'High CPU throttling' Alert is defined
How the Dynatrace 'High CPU throttling' Alert can be configured
Troubleshooting
I have enabled the alarm, but it does not trigger. What can I do?
An alarm is triggered even though the usage does not reach the limit according to the UI. How can th...
The alarm occurs far too often. What can I do?
The alarm does not make sense for some of my workloads. Can I deactivate or change settings for a di...
I have configured the alarm for specific workloads, but the values overwrite each other. Is this a b...
The alarm states that throttling is higher than 100%. How can this happen?
Further Support

Introduction

This is a troubleshooting guide for using the Kubernetes 'High CPU throttling' alert in Dynatrace. It is intended for customers who have questions or problems using the CPU throttling alert.

The guide provides answers to the questions

I have enabled the alarm, but it does not trigger. What can I do?
An alarm is triggered even though the usage does not reach the limit according to the UI. How can this happen?
The alarm occurs far too often. What can I do?
The alarm does not make sense for some of my workloads. Can I deactivate or change settings for a distinct workload?
I have configured the alarm for specific workloads, but the values overwrite each other. Is this a bug?
The alarm states that throttling is higher than 100%. How can this happen?

The guide begins with general considerations about CPU throttling, continues with an explanation of the Dynatrace 'High CPU Throttling' alert, and finally addresses specific scenarios and questions when using the alert.

General Considerations

CPU Throttling

A container is considered as CPU throttled if it requires more CPU resources than it is granted. At a more technical level, a container can be considered CPU throttled if it is interrupted during a certain scheduling period even though it is still capable of running.

More detailed information on CPU throttling can be found in the guide 'Troubleshooting Kubernetes CPU Throttling Problems in Dynatrace'.

An Example

Suppose you have a container with a CPU limit of 400 millicore. Further assume that the container has the following CPU behavior over time.

Each small box represents the time course in 10ms. A green box means that the process in the container is running and requires CPU resources. A red box means that the process could run, but is throttled due to the CPU limit. A gray box means that the process is waiting for IO (storage, network, user, whatever) and therefore does not require CPU. Throttling is generally enforced in individual 100ms scheduling periods.

CPU Throttling Ratio

There are two CPU throttling metrics from Kubernetes or more precisely from the operating system side: 'throttled periods' and 'throttled seconds'. See "Prometheus cAdvisor metrics" for further details. These are absolute values without a reference to the actual usage and in this form not very meaningful. A throttling ratio of throttling / usage would be more handy here.

Due to these two Kubernetes throttling metrics, there are two possible definitions of a throttling / usage ratio.

Throttling Ratio by Periods

This ratio is based on the scheduling periods of the operating system.

ratio-by-periods = container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total

The metric throttled_periods_total increases by one whenever a container was interrupted during a period, although it was able to run. The metric periods_total increases every period by one as long as a container is running. The higher throttled_periods is, the higher is the ratio. This ratio is usual between 0 and 100%.

According to the throttling example above, let us have a look to the first scheduling period.

container_cpu_cfs_throttled_periods_total would be 1 as throttling occurred in this period. container_cpu_cfs_periods_total would also be 1. So ratio-by-periods = container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total = 1 / 1 = 1 = 100%.

Grafana Dashboard 'CPU Usage vs Throttling Percent' and Prometheus 'CPUThrottlingHigh Alert' are using this throttling ratio definition.

Throttling Ratio by Seconds

This ratio is based on the actual scheduling time in milliseconds.

ratio-by-seconds = container_cpu_cfs_throttled_seconds_total / container_cpu_usage_seconds_total

The metric throttled_seconds_total increases when a container is not running, although it would be able to run. The metric usage_seconds_total increases when a container is running. Both metrics increase by the respective number of milliseconds. The higher throttled_seconds is, the higher is the ratio. This ratio can be higher than 100%!

According to the throttling example above, let us have a look to the first scheduling period.

container_cpu_cfs_throttled_seconds_total would be 0.06 as throttling occurred for 60ms. container_cpu_usage_seconds_total would be 0.04 as the container actual runs for 40ms. So ratio-by-seconds = container_cpu_cfs_throttled_seconds_total / container_cpu_usage_seconds_total = 0.06 / 0.04 = 1.5 = 150%.

Dynatrace is using the 'Throttling Ratio by Seconds' definition.

The advantage is that this ratio is more accurate than the ratio by period because it uses exact time values. This relationship also seems to be more intuitive and meaningful. The ratio by usage makes it easier to find thresholds that fit a variety of workloads. Values of 150% or more may seem confusing at first glance, but on closer look they are the correct value.

The Dynatrace 'High CPU Throttling' Alert

How the Dynatrace 'High CPU throttling' Alert is defined

For the Kubernetes anomaly detection 'High CPU throttling' alert, Dynatrace uses the 'Throttling Ratio by Seconds' definition as described above. It is defined as dynatrace-throttling-ratio = builtin:kubernetes.workload.cpu_throttled / builtin:kubernetes.workload.cpu_usage. For exact definition see "High CPU throttling" in 'Workload alerts'. This value can be higher than 100%! See section "Throttling Ratio by Seconds" above.

To check the Dynatrace throttling ratio metric expression open 'Data Explorer' in Dynatrace Classic or 'Notebook' in the new Dynatrace Platform to execute the metric expression.

How the Dynatrace 'High CPU throttling' Alert can be configured

The alert can be configured in the 'Anomaly detection' settings on tenant level, cluster level or namespace level. These settings are structured hierarchically. The more detailed level overrides the value of the more general level. Specific settings on workload level are not supported!

The settings for the tenant level can be found in 'Settings' --> 'Anomaly detection' --> 'Kubernetes Workload'. The settings for the cluster and namespace level can be found in the cluster settings at 'Kubernetes settings' --> 'Anomaly detection' --> 'Workload'.

Enabling / Disabling: Using the toggle 'Detect high CPU throttling' (default: disabled), the alarm can be switched on or off for the entire tenant, for a specific cluster or for a specific namespace.

Threshold: The threshold for triggering the alert can be defined using the input field 'CPU throttling level is above' (default: 50%). As explained in the section 'Throttling Ratio by Seconds', this value can be higher than 100%.

Trigger Period: The value in the field 'of CPU usage for at least' (default: 10 minutes) specifies how long the threshold defined above must be exceeded in the total observation period defined below to trigger this alarm.

Observation Period: Finally, the value in the field 'within the last' (default: 15 minutes) indicates the total observation period.

Troubleshooting

I have enabled the alarm, but it does not trigger. What can I do?

Check the actual Dynatrace throttling ratio values for your desired workload. See section "How the Dynatrace 'High CPU throttling' Alert is defined".

Does the metric expression have data points?

If not:

Check if 'Monitor workload and node resource metrics' is enabled in the Kubernetes cluster monitoring settings.
Check whether all required rights are available on the Kubernetes cluster by clicking on 'Test monitoring features'.
Are there any data points available for the builtin:kubernetes.workload.cpu_usage metric on this cluster at all?

If yes:

Check how many data points are actual above the 'Threshold'. Consider also the 'Trigger/Observation Period'. Cross-check these observations with the defined alert settings.

An alarm is triggered even though the usage does not reach the limit according to the UI. How can this happen?

The fact that CPU throttling occurs even though usage is below the limit does not necessarily mean it is a fault.

First, you should look at the guide "Troubleshooting Kubernetes CPU Throttling Problems in Dynatrace". This is answering questions like 'Why is my container CPU throttled although pod limits are set high enough and node has enough allocatable CPUs?'.
Furthermore, it should be noted that the throttling alert is not defined on the ratio usage / limit. It is defined on the ratio throttling / usage! See section "How the Dynatrace 'High CPU throttling' Alert is defined".
- Check the actual Dynatrace throttling ratio values for your desired workload.
- Check how many data points are actual above the 'Threshold'.
- Consider also the 'Trigger/Observation Period'. Cross-check these observations with the defined alert settings.

The alarm occurs far too often. What can I do?

For bigger clusters or clusters with general higher throttling, an over-alerting can happen.

The alarm settings are only default values that make sense for some typical Kubernetes use cases. However, not for all of them. If the alarm is triggered too often for your specific use case, you have the option of adjusting the settings to your needs down to the namespace level.

Increase the 'Threshold' if you feel it is too low and the throttling ratio is permanently high.
Increase the 'Trigger/Observation Period' if the throttling ratio has frequent spikes that should not trigger the alarm.

The alarm does not make sense for some of my workloads. Can I deactivate or change settings for a distinct workload?

Individually configuring an alarm for a distinct workload is not supported. The smallest granularity for configuration is the namespace the workload is running in.

I have configured the alarm for specific workloads, but the values overwrite each other. Is this a bug?

Individually configuring an alarm for a distinct workload is not supported. When you are at workload details and click 'Anomaly detection settings' from the menu, you are actually forwarded to the namespace level settings.

Changes there apply to all workloads in that namespace. Doing the same form a different workload forwards you to the same setting page.

The alarm states that throttling is higher than 100%. How can this happen?

The calculation of the throttling ratio as used by Dynatrace can produce values over 100%. This is normal and intended. See section "Throttling Ratio by Seconds" above.

In the alert settings it is possible to specify a threshold above 100%. Increase the threshold accordingly (see section "How the Dynatrace 'High CPU throttling' Alert can be configured").

Further Support

If you have any further questions or encounter any issues not listed above, please feel free to contact our support team.

Mizső · ‎18 Jul 2024

Hi @PeVa,

Thanks for sharing this very good summary about this topic. I had to explain many times the cpu throttling issue to our customers. In the future it will be easier with this post.

Best regards,

Mizső

Troubleshooting Kubernetes 'High CPU Throttling' Alert Problems in Dynatrace

Introduction

General Considerations

CPU Throttling

An Example

CPU Throttling Ratio

Throttling Ratio by Periods

Throttling Ratio by Seconds

The Dynatrace 'High CPU Throttling' Alert

How the Dynatrace 'High CPU throttling' Alert is defined

How the Dynatrace 'High CPU throttling' Alert can be configured

Troubleshooting

I have enabled the alarm, but it does not trigger. What can I do?

An alarm is triggered even though the usage does not reach the limit according to the UI. How can this happen?

The alarm occurs far too often. What can I do?

The alarm does not make sense for some of my workloads. Can I deactivate or change settings for a distinct workload?

I have configured the alarm for specific workloads, but the values overwrite each other. Is this a bug?

The alarm states that throttling is higher than 100%. How can this happen?

Further Support