Solved: Are There Opportunities To Improve "Pod Down" Alerting?

paimonsoror · ‎26 Mar 2025

Hi All;

I have been playing with some of the DQL/Expressions for detecting a workload in a non-ready state. In most cases, the cause for this is because of one or more pods that have either crashed/terminated, or been properly evicted from their respective node.

https://docs.dynatrace.com/docs/observe/infrastructure-monitoring/container-platform-monitoring/kube...

Unfortunately, one of the issues that I have run into, which gives me a bit of concern for our organization enabling Pod Not Ready alerting across Dynatrace, is that this alerting doesn't take into account a few different factors that can help isolate if there is a true alert condition or not.

For example, last night we had our Openshift team perform maintenance activities on an openshift cluster. This maintenance caused all nodes within the cluster to systematically restart, which would in turn evict the pods on that node, and recreate them on another node.

I had the above alert enabled on my specific workload, and I experienced a lot of alerting noise, which would have been ideal if Dynatrace could have identified more appropriately what was going on. For example:

My Workload is managed by a StatefulSet
It has node affinity per pod (only one of the STS pods can live on a node)
It has a PDB of no more than 1 pod (only 1 pod can be allowed as expected down at a time)

The alert itself, showed the following:

What you can extrapolate from the above data, if you had no other context, is that "something" is flapping in the workload.

However it would be ideal if the alert took into consideration the fact that this was a StatefulSet, and each of those "Blips" was a different pod within that STS that was purposefully being moved around.

I have tried playing around with sample windows and intervals, but the challenge is that depending on the Node restart order and if the pods within the worklflow are on those nodes, the outcome could either look like "Blips" or "one long outage".

Example

ChadTurner · ‎25 Apr 2025

We ran into a similar issue where PaaS instances were spinning up and spinning down. You can set a prevention of shutdown alerts when the host is gracefully shutdown. But the long term solution we had was to post into Dynatrace a tag on the entity called "Shutting down" where then a Maintenance window would exclude it from alerts. Then when the host was turned back on a call to remove the tag.

-Chad