Self Service Summary

stefanie_pachne · ‎27 Nov 2024

Self Service Summary

This article of type Full-Self-Service helps with root cause analysis of failed OneAgent deployments or updates on Kubernetes/OpenShift and explains the CrashLoopBackOff state.

Issue	Solution	Tasks	Alternative
OneAgent installation/update: Pod failed to start with CrashLoopBackOff	Check logs for root cause and address on your side	Follow the steps outlined below	Contact Dynatrace Customer Success and Support via chat or ticket

Introduction

CrashLoopBackOff is a Kubernetes state representing a restart loop that is happening in a Pod: a container in the Pod is started, but crashes and is then restarted, over and over again. Kubernetes will wait an increasing back-off time between restarts to give you a chance to fix the error. As such, CrashLoopBackOff is not an error on itself, but indicates that there’s an error happening that prevents a Pod from starting properly.

Preserve the crime scene

First of all, please make sure to gather the logs for troubleshooting and future reference. If the problem resolves after a restart or re-installation, you may still want to know the root cause and what happened that day.

Follow this link and collect the operator support archive - this file contains Dynatrace pods logs including Operator and OneAgent pod if applicable.
Additionally, run the command kubectl describe -n dynatrace CrashLoopBackOff-pod to find useful information indicating what happened in the event of this output.

Root cause analysis

Unsupported

This first scenario covers unsupported Kubernetes distributions. Please verify if you are using a supported Kubernetes environment here: https://docs.dynatrace.com/docs/ingest-from/technology-support#kubernetes.

Permission denied

In Support we see such reports with OneAgent and K8s/OpenShift if another security tool prevents the execution of the oneagentwatchdog process. You'll see this OneAgent log:

Failed to execute /opt/dynatrace/oneagent/agent/lib64/oneagentwatchdog: error code: 13 (Permission denied)

Validate if oneagentwatchdog has execute permission (ls -l or stat <filename> show the full permissions).

If the execute bit was set, look into these logs to identify and fix the root cause:

journalctl
kubelet
SELinux or AppArmor if applicable
other security tools installed on the host.

Unsupported downgrade

We also see such failures when attempting a downgrade from a higher to a lower OneAgent version which is not supported. You'll see those errors in the logs:

[ERROR] Downgrading OneAgent is not supported, please uninstall the old version first 
[ERROR] Attempted downgrade from 1.<...> to 1.<...>

Uninstall and reinstall the OneAgent.

Network connection errors

If you see a log indicating Failed to pull images with <IP>: i/o timeout, make sure that the mentioned IP is available. Possible reasons include network outages/unavailable nodes or firewall misconfigurations.

Configuration errors

Additionally, incorrect variables or settings could cause the failure. Review all configuration files, environment variables and make sure certificate are valid. You could also start trying the default config setup.

Resources issues

Another possibility is that the pod runs into resource issues and requests more CPU or memory than available, leading to crashes. Check the resource allocation using kubectl describe pod [pod-name] and adjust resource requests and limits as needed.

Application errors

As mentioned above, we typically see that OneAgent installation issues on Kubernetes can be resolved by customers themselves. Edge cases may exist where our code causes the application to fail. Contact Dynatrace Support to investigate why that's happening by providing kubectl logs [pod-name] and other previously gathered information. Try disabling OneAgent features to isolate which feature might be causing the issue.

[Kubernetes Deployments] Troubleshooting CrashLoopBackOff errors

Self Service Summary

Introduction

Preserve the crime scene

Root cause analysis

Unsupported

Permission denied

Unsupported downgrade

Network connection errors

Configuration errors

Resources issues

Application errors