topic Re: Alerting on Long running pods in Container platforms

Alerting on Long running pods

PujithAnne — Sun, 15 Dec 2024 18:55:57 GMT

Hi Team,

Is there any alerting option to alert for long running pod in Kubernetes?

Re: Alerting on Long running pods

Maheedhar_T — Sun, 15 Dec 2024 20:36:16 GMT

Hi @PujithAnne ,
Can you please explain further on what exactly you mean by alert for long running pods ?
You want to be alerted if a pod runs more than x days?

Re: Alerting on Long running pods

PujithAnne — Mon, 16 Dec 2024 02:09:35 GMT

Yes, i want to get alert for pods under specific namespace over a 1 days.

Re: Alerting on Long running pods

gopher — Mon, 16 Dec 2024 04:35:36 GMT

@PujithAnne can you please describe the alert scenario in a bit of detail so we can advise.
Is there a specific scenario that you are trying to capture (not just that it's running longer than xx).

I'm assuming that you are looking for an alert where a pod or a job has not terminated and is left in a 'terminating' status? these are covered by default

Or is this really a pod has not received a termination signal and is left running. - this would not be covered and you'd need some custom metrics and alerts to capture this.

Re: Alerting on Long running pods

PujithAnne — Mon, 16 Dec 2024 05:31:51 GMT

We have pods running a job that needs to be completed and needs to get terminated in 2-3hrs. We are seeing pods that are running for over than 12 hours consuming resources on kubernetes cluster. I have delete the pods if it's older than 1 days to reduce the resource consumption.

Thanks,

Pujith

Re: Alerting on Long running pods

gopher — Mon, 16 Dec 2024 05:50:17 GMT

@PujithAnne ,

out of the box, you're probably up the creek without a paddle. uptime isn't kept as a searchable counter (as far as I know). there is an uptime counter in Prometheus for kube pod metrics you could look at, but not sure if there would be suitable logic in grail to work out if it's above a set value (pre-calculated).

alternatively - you can use Kubernetes native methods like activeDeadlineSeconds to specify the maximum running time of a pod in your deployment spec.
***use with extreme caution*** if time is up, it will terminate the pod. but you'll get what you're after and pod won't run for longer than expected. you could theoretically set it for say '6 hours' or something at a point where you would consider it a lost cause and effectively safe to terminate.

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  activeDeadlineSeconds: 3600  # Pod will be terminated after 1 hour (3600 seconds)

Good luck