Container platforms
Questions about Kubernetes, OpenShift, Docker, and more.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Cloud Native Full Stack: Workloads require manual restart after full OpenShift cluster reboot to get Deep Monitoring

deni
Pro

Hi,

I'm observing a reproducible behavior in an OpenShift cluster monitored with Dynatrace Cloud Native Full Stack and would like to understand whether this is expected or if others have seen something similar.

Environment

  • OpenShift 4.19

  • Dynatrace Operator 1.8.1

  • Cloud Native Full Stack enabled

  • Dynatrace CSI Driver running

  • Dynatrace Webhook running

  • ActiveGate running

What happens

After a full cluster shutdown/startup, the cluster comes back healthy and workloads start successfully.

However, multiple processes appear in Dynatrace as:

  • Restart required

  • Failed to enable

Examples include:

  • Spring Boot application workloads

  • kube-apiserver

  • kubelet

  • openshift-apiserver

  • etcd-related processes

Application workload example

For our Spring Boot application we verified the following:

Immediately after cluster startup:

  • Application pod is running

  • Service is reachable

  • Process appears in Dynatrace

  • Deep Monitoring is not fully active

  • Dynatrace reports Restart required

After performing a rollout restart of the deployment:

  • /opt/dynatrace/oneagent-paas is mounted

  • OneAgent libraries are loaded

  • Deep Monitoring becomes enabled

  • Services and process details appear correctly

What makes this interesting

This is not a one-time occurrence.

We can reproduce it after every full cluster reboot:

  1. Shut down the entire cluster.

  2. Start the cluster again.

  3. Workloads start successfully.

  4. Dynatrace reports multiple processes as Restart required.

  5. Manual pod restart fixes application workloads.

Additional observation

Some processes occasionally disappear from Host → Processes view while remaining visible and active in Process Group view. Opening the host through the process relationship sometimes makes the process visible again.

Question

Has anyone seen similar behavior with Cloud Native Full Stack after a complete OpenShift cluster restart?

Is it expected that workloads may start before the Dynatrace CSI driver/webhook are fully ready, requiring a restart to receive Deep Monitoring?

Are there any recommended practices to ensure workloads are instrumented automatically after cluster recovery without requiring manual rollout restarts?

Thanks!

Regards, Deni

Dynatrace Integration Engineer at CodeAttest
5 REPLIES 5

Julius_Loman
DynaMight Legend
DynaMight Legend

This is not a standard situation and should not happen. Dynatrace uses priorityClass to have its components started first. 

I'd recommend either opening a support case or checking your Dynatrace component logs and diagnosing the pod events for any Dynatrace startup issues. What can happen is that the download of Dynatrace images takes too much time, the pods do not wait for it and are started without Dynatrace.  But this is just my assumption, and it needs to be diagnosed in your environment.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

@Julius_Loman

A bit more context from my side:

This is my own lab Bare Metal OpenShift cluster which I use to learn and test Dynatrace features.

The applications are demo workloads and the traffic is synthetic/test traffic.

The entire environment was built from scratch by me, including the OpenShift setup, storage configuration, networking, Dynatrace deployment, applications development and deployment, supporting services ... . Because of that, it is entirely possible that I have introduced a configuration issue somewhere rather than encountering an actual Dynatrace product problem.

My goal is not only to make the monitoring work, but also to better understand:

  • how Cloud Native Full Stack injection works,

  • the startup dependencies between Operator, Webhook, CSI Driver and workloads,

  • where to look when instrumentation does not happen as expected,

  • which logs and components are most useful during troubleshooting.

Given the behavior I'm seeing after a full cluster reboot, could you suggest what evidence you would collect first?

For example:

  • Which Dynatrace component logs would you inspect first?

  • Are there specific webhook, CSI or Operator messages that indicate failed or missed injection?

  • Is there a way to verify whether a pod started before Dynatrace injection became available?

  • Are there any OpenShift events or Dynatrace diagnostics that would help prove or disprove a startup ordering issue?

I'm mainly trying to learn the correct troubleshooting approach and understand what "good" versus "bad" startup behavior should look like in a Cloud Native Full Stack environment.

Thanks!

Regards, Deni

Dynatrace Integration Engineer at CodeAttest

@deni I'd recommend looking at Troubleshooting posts for Kubernetes here in the community, for example at Pod injection troubleshooting.

Nowadays, top AI models will give you very good recommendation, but - why not ask Dynatrace intelligence in the first place? Be sure to have agentic mode enabled. It will give you much better answers.

Julius_Loman_0-1781072236917.png

 

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

@Julius_Loman Thanks for the recommendation, will check the articles.

I did try Dynatrace Assist (including Agentic mode) before opening this discussion, but I still couldn't fully explain the behavior. I've also been cross-checking findings with ChatGPT while digging through the Kubernetes events and Dynatrace injection details.

Today after powering the OpenShift cluster back on I found that one of the affected pods showed the following events during startup:

FailedMount: driver name csi.oneagent.dynatrace.com not found in the list of registered CSI drivers

FailedMount: dynatrace-bootstrapper-config not registered

NetworkPluginNotReady: no CNI configuration file in /etc/kubernetes/cni/net.d/

The pod eventually started successfully and was injected:

  • oneagent.dynatrace.com/injected: true
  • dynakube.dynatrace.com/injected: true
  • LD_PRELOAD is present

However, inside the container I can see inconsistent behavior.

One workload gets:

/opt/dynatrace/oneagent-paas
└── oneagent-paas

but the directory is otherwise empty and Dynatrace reports the process as "Restart required".

What makes this interesting is that after the cluster has been running for some time, many of the processes suddenly switch to Deep Monitoring without me restarting anything. Before powering the cluster on this morning, Dynatrace was actually showing most of them as fully monitored. After startup they reverted back to "Restart required" again.

deni_0-1781074423545.png

This makes me wonder whether I'm looking at some startup race condition where workloads come up while CSI registration and Dynatrace initialization are still in progress.

Does the CSI driver registration failure shown above look significant to you, or would you expect Dynatrace to recover automatically from that situation once the CSI driver becomes available?

Thanks!
Regards, Deni

Dynatrace Integration Engineer at CodeAttest

Looks like a race condition related to download of Dynatrace OneAgent images from the repository (either yours, public or cluster - depending what you have configured). Be sure to check the logs for Dynatrace pods (webhook, csi driver).

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

Featured Posts