26 Mar 2025 05:47 PM - edited 26 Mar 2025 05:49 PM
Hi All;
I have been playing with some of the DQL/Expressions for detecting a workload in a non-ready state. In most cases, the cause for this is because of one or more pods that have either crashed/terminated, or been properly evicted from their respective node.
Unfortunately, one of the issues that I have run into, which gives me a bit of concern for our organization enabling Pod Not Ready alerting across Dynatrace, is that this alerting doesn't take into account a few different factors that can help isolate if there is a true alert condition or not.
For example, last night we had our Openshift team perform maintenance activities on an openshift cluster. This maintenance caused all nodes within the cluster to systematically restart, which would in turn evict the pods on that node, and recreate them on another node.
I had the above alert enabled on my specific workload, and I experienced a lot of alerting noise, which would have been ideal if Dynatrace could have identified more appropriately what was going on. For example:
The alert itself, showed the following:
What you can extrapolate from the above data, if you had no other context, is that "something" is flapping in the workload.
However it would be ideal if the alert took into consideration the fact that this was a StatefulSet, and each of those "Blips" was a different pod within that STS that was purposefully being moved around.
I have tried playing around with sample windows and intervals, but the challenge is that depending on the Node restart order and if the pods within the worklflow are on those nodes, the outcome could either look like "Blips" or "one long outage".
Example