I keep seeing these CPU & Memory saturation alerts on Kubernettes hosts/nodes complaining about CPU being maxed out at 100%. But when I go into the problem in Dynatrace, I see that the node/host actually has 13% CPU usage. The same thing occurs on memory saturation alerts.
So, these are either false alerts or these problem alerts are actually complaining about CPU/memory saturation on individual pods, but there are many pods on each node. Is there a way to get Dynatrace to be more specific about what specific entity is having an issue?
So I see these problems:
The problem will advise of CPU request saturation at 94% on the node.
When I go into this node, regular CPU usage is normal but the CPU request saturation is really high. I want to know what pods or containers are contributing to this request saturation and I don't see such level of detail on the host/node page.
CPU request can be higher than the real CPU usage. CPU request the minimum CPU resource at the pod start. In your alert the SUM CPU resource quota reached the allocable CPU 94% on that node. Eg. you have 100 core allocable CPU in your node, and the pods on this host requested SUM 94 core. It does not mean that they will use it.
I think your problem is a configuration issue, your node a little bit overbooked regarding CPU requests. Maybe there are an autoscaling mechanism in place for load an some extra pods started.
Here is an example for a small cluster dashboard part at node level (requests and limits based on the allocable resources):
You can use these metrics to check the cpu requests on namespace level. Which namespace are the top "consumer" of the cpu requests. There is an embedded default dashboard by Dynatrace Kubernetes namespace resource quotas.
Kubernetes: Resource quota - CPU requests
This metric measures the cpu requests quota. The most detailed level of aggregation is resource quota. The value corresponds to the cpu requests of a resource quota.
Kubernetes: Resource quota - CPU requests used
This metric measures the used cpu requests quota. The most detailed level of aggregation is resource quota. The value corresponds to the used cpu requests of a resource quota.
I hope it helps.
Thank you, Mizső!
So I have a takeaway to see if I can find out more about the CPU limits on the Nodes.
I did try to create something using these 2 metrics but sadly for my environment there are no data at all (which is surprising because these are a namespace entity and dimension and I have namespaces set up).