27 Nov 2024 11:01 AM - edited 05 Dec 2024 07:31 AM
This article of type Full-Self-Service helps with root cause analysis of failed OneAgent deployments or updates on Kubernetes/OpenShift and explains the CrashLoopBackOff state.
Issue |
Solution |
Tasks |
Alternative |
OneAgent installation/update: Pod failed to start with CrashLoopBackOff |
Check logs for root cause and address on your side |
Follow the steps outlined below |
Contact Dynatrace Customer Success and Support via chat or ticket |
CrashLoopBackOff is a Kubernetes state representing a restart loop that is happening in a Pod: a container in the Pod is started, but crashes and is then restarted, over and over again. Kubernetes will wait an increasing back-off time between restarts to give you a chance to fix the error. As such, CrashLoopBackOff is not an error on itself, but indicates that there’s an error happening that prevents a Pod from starting properly.
First of all, please make sure to gather the logs for troubleshooting and future reference. If the problem resolves after a restart or re-installation, you may still want to know the root cause and what happened that day.
kubectl describe -n dynatrace CrashLoopBackOff-pod
to find useful information indicating what happened in the event of this output.
In Support we see such reports with OneAgent and K8s/OpenShift if another security tool prevents the execution of the oneagentwatchdog process. You'll see this OneAgent log:
Failed to execute /opt/dynatrace/oneagent/agent/lib64/oneagentwatchdog: error code: 13 (Permission denied)
Validate if oneagentwatchdog has execute permission (ls -l
or stat <filename>
show the full permissions).
If the execute bit was set, look into these logs to identify and fix the root cause:
We also see such failures when attempting a downgrade from a higher to a lower OneAgent version which is not supported. You'll see those errors in the logs:
[ERROR] Downgrading OneAgent is not supported, please uninstall the old version first
[ERROR] Attempted downgrade from 1.<...> to 1.<...>
Uninstall and reinstall the OneAgent.
If you see a log indicating Failed to pull images
with <IP>: i/o timeout
, make sure that the mentioned IP is available. Possible reasons include network outages/unavailable nodes or firewall misconfigurations.
Additionally, incorrect variables or settings could cause the failure. Review all configuration files, environment variables and make sure certificate are valid. You could also start trying the default config setup.
Another possibility is that the pod runs into resource issues and requests more CPU or memory than available, leading to crashes. Check the resource allocation using kubectl describe pod [pod-name]
and adjust resource requests and limits as needed.
As mentioned above, we typically see that OneAgent installation issues on Kubernetes can be resolved by customers themselves. Edge cases may exist where our code causes the application to fail. Contact Dynatrace Support to investigate why that's happening by providing kubectl logs [pod-name]
and other previously gathered information. Try disabling OneAgent features to isolate which feature might be causing the issue.