15 Dec 2024 06:55 PM
Hi Team,
Is there any alerting option to alert for long running pod in Kubernetes?
15 Dec 2024 08:36 PM
Hi @PujithAnne ,
Can you please explain further on what exactly you mean by alert for long running pods ?
You want to be alerted if a pod runs more than x days?
16 Dec 2024 02:09 AM
Yes, i want to get alert for pods under specific namespace over a 1 days.
16 Dec 2024 04:35 AM
@PujithAnne can you please describe the alert scenario in a bit of detail so we can advise.
Is there a specific scenario that you are trying to capture (not just that it's running longer than xx).
I'm assuming that you are looking for an alert where a pod or a job has not terminated and is left in a 'terminating' status? these are covered by default
Or is this really a pod has not received a termination signal and is left running. - this would not be covered and you'd need some custom metrics and alerts to capture this.
16 Dec 2024 05:31 AM
We have pods running a job that needs to be completed and needs to get terminated in 2-3hrs. We are seeing pods that are running for over than 12 hours consuming resources on kubernetes cluster. I have delete the pods if it's older than 1 days to reduce the resource consumption.
Thanks,
Pujith
16 Dec 2024 05:50 AM
@PujithAnne ,
out of the box, you're probably up the creek without a paddle. uptime isn't kept as a searchable counter (as far as I know). there is an uptime counter in Prometheus for kube pod metrics you could look at, but not sure if there would be suitable logic in grail to work out if it's above a set value (pre-calculated).
alternatively - you can use Kubernetes native methods like activeDeadlineSeconds to specify the maximum running time of a pod in your deployment spec.
***use with extreme caution*** if time is up, it will terminate the pod. but you'll get what you're after and pod won't run for longer than expected. you could theoretically set it for say '6 hours' or something at a point where you would consider it a lost cause and effectively safe to terminate.
apiVersion: v1 kind: Pod metadata: name: example-pod spec: activeDeadlineSeconds: 3600 # Pod will be terminated after 1 hour (3600 seconds)
Good luck