04 Oct 2023 11:00 PM - last edited on 08 Oct 2023 01:15 PM by MaciejNeumann
classic dashboard, simple - builtin:kubernetes.pods:last:filter(eq("pod_phase", "Failed")):splitBy("k8s.namespace.name","dt.entity.kubernetes_cluster"):sum:sort(value(sum,descending))
it shows failed pods, in descending order, works great! It says last 5 minutes, but those pods never seem to go away from that query. Should they?
Timeframe: 2023-10-04 16:44 - 16:49Auto (1m)4 record(s)
the number of records just keeps going up even though pods are long ago rebuilt.
thanks
05 Oct 2023 07:40 PM - edited 05 Oct 2023 07:40 PM
Hi @nutsy4sure
I guess it depends on. At one of my clients I see a similar situation:
My mteric experssion is:
builtin:kubernetes.pods
:filter(and(eq(pod_phase,Failed)))
:splitBy("dt.entity.cloud_application")
:sort(value(auto,descending))
:limit(10)
Dashboard filtered for the last 5 minutes (it has not been changed since 9 days becasue nobody cares about the failed pods, they are still there). 😉
I hope it helps.
Best regards,
Mizső
05 Oct 2023 07:51 PM
Thank you for the reply, but it sounds like you are just confirming my experience. The pods failed and vmware reacted the way it should and spun up a new one. At that point, I dont care about the failed pod. Different circumstances, I might not want it to disappear for investigation purposes, but I want this dashboard to be "real time", and days old failures that vmware long ago recovered from, isnt reflecting the current status.
05 Oct 2023 08:37 PM
Hi @nutsy4sure,
It is not VMware, it is kubernetes. And based on the metric expression this is the actual status for the last five minutes: 1 runnning pod and 4 failed pod.
You should find another solution. Maybe you could count one of the kubernetes event which refers to the failed pod (with eg. pod name dimension). Then it can be vizualized well. eg.
I hope it helps.
Best regards,
Mizső