on
02 Jun 2025
01:23 PM
- edited on
03 Jun 2025
09:09 AM
by
HannahM
In order to increase the reliability of our "High CPU Throttling Alert" in Kubernetes, we are improving the way how the 'High CPU Throttling' Alert works, with the release of Dynatrace v1.316.0. In Kubernetes environments, CPU throttling is a common issue that can impact application performance and reliability. Detecting and alerting on high CPU throttling is essential, but choosing the right signal is just as important to avoid false positives and alert fatigue.
In this post, I’ll explain why we chose to use the CPU Throttling / CPU Limits ratio as the primary signal for our High CPU Throttling Alert, instead of CPU Throttling / CPU Usage as we did in the past. More information on how the alert worked before v1.316.0 can be found in the blog post Troubleshooting Kubernetes 'High CPU Throttling' Alert Problems in Dynatrace.
In the past, we computed the underlying signal using Throttling / Usage
. The idea here: if a container is being throttled heavily relative to how much CPU it's using, that could indicate a problem.
However, in practice, especially in idle or low-traffic environments, this ratio proved to be unstable. Here's why:
Low usage values can cause the ratio to spike dramatically, even if the actual throttling is minimal. These spikes are often not indicative of real performance issues, but rather artifacts of low activity. As a result, we saw a high number of false positives, particularly in staging or development environments where workloads are sporadic. This instability made it difficult to trust the alerts and led to unnecessary investigations.
To address this, we shifted to using the Throttling / Limits
ratio. This approach compares the amount of throttled CPU time to the CPU limit set for the container. Here’s why this signal is more reliable:
CPU limits are static, so the denominator in the ratio doesn’t fluctuate wildly. This makes the signal much more stable, even in idle environments. It provides a clearer picture of how much of the allocated CPU is being throttled, regardless of how much is actually being used. By using this ratio, we significantly reduced false positives and improved the signal-to-noise ratio of our alerts.
Let’s say a container has a CPU limit of 1 core. If it experiences 0.2 cores worth of throttling, the ratio is:
Throttling / Limits = 0.2 / 1 = 20%
This is a straightforward and stable signal. Compare that to a scenario where usage is only 0.05 cores:
Throttling / Usage = 0.2 / 0.05 = 400%
That 400% might look alarming, but in reality, the container is barely doing anything. This is exactly the kind of misleading signal we are now avoiding.
Choosing the right metric for alerting is about practical reliability. By using the Throttling / Limits
ratio, we’ve created a more stable and actionable alert that helps us focus on real issues, not noise.