Open Q&A
If there's no good subforum for your question - ask it here!
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Cloud Native Full Stack: Workloads require manual restart after full OpenShift cluster reboot to get Deep Monitoring

deni
Pro

Hi,

I'm observing a reproducible behavior in an OpenShift cluster monitored with Dynatrace Cloud Native Full Stack and would like to understand whether this is expected or if others have seen something similar.

Environment

  • OpenShift 4.19

  • Dynatrace Operator 1.8.1

  • Cloud Native Full Stack enabled

  • Dynatrace CSI Driver running

  • Dynatrace Webhook running

  • ActiveGate running

What happens

After a full cluster shutdown/startup, the cluster comes back healthy and workloads start successfully.

However, multiple processes appear in Dynatrace as:

  • Restart required

  • Failed to enable

Examples include:

  • Spring Boot application workloads

  • kube-apiserver

  • kubelet

  • openshift-apiserver

  • etcd-related processes

Application workload example

For our Spring Boot application we verified the following:

Immediately after cluster startup:

  • Application pod is running

  • Service is reachable

  • Process appears in Dynatrace

  • Deep Monitoring is not fully active

  • Dynatrace reports Restart required

After performing a rollout restart of the deployment:

  • /opt/dynatrace/oneagent-paas is mounted

  • OneAgent libraries are loaded

  • Deep Monitoring becomes enabled

  • Services and process details appear correctly

What makes this interesting

This is not a one-time occurrence.

We can reproduce it after every full cluster reboot:

  1. Shut down the entire cluster.

  2. Start the cluster again.

  3. Workloads start successfully.

  4. Dynatrace reports multiple processes as Restart required.

  5. Manual pod restart fixes application workloads.

Additional observation

Some processes occasionally disappear from Host → Processes view while remaining visible and active in Process Group view. Opening the host through the process relationship sometimes makes the process visible again.

Question

Has anyone seen similar behavior with Cloud Native Full Stack after a complete OpenShift cluster restart?

Is it expected that workloads may start before the Dynatrace CSI driver/webhook are fully ready, requiring a restart to receive Deep Monitoring?

Are there any recommended practices to ensure workloads are instrumented automatically after cluster recovery without requiring manual rollout restarts?

Thanks!

Regards, Deni

Dynatrace Integration Engineer at CodeAttest
2 REPLIES 2

Julius_Loman
DynaMight Legend
DynaMight Legend

This is not a standard situation and should not happen. Dynatrace uses priorityClass to have its components started first. 

I'd recommend either opening a support case or checking your Dynatrace component logs and diagnosing the pod events for any Dynatrace startup issues. What can happen is that the download of Dynatrace images takes too much time, the pods do not wait for it and are started without Dynatrace.  But this is just my assumption, and it needs to be diagnosed in your environment.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

@Julius_Loman

A bit more context from my side:

This is my own lab Bare Metal OpenShift cluster which I use to learn and test Dynatrace features.

The applications are demo workloads and the traffic is synthetic/test traffic.

The entire environment was built from scratch by me, including the OpenShift setup, storage configuration, networking, Dynatrace deployment, applications development and deployment, supporting services ... . Because of that, it is entirely possible that I have introduced a configuration issue somewhere rather than encountering an actual Dynatrace product problem.

My goal is not only to make the monitoring work, but also to better understand:

  • how Cloud Native Full Stack injection works,

  • the startup dependencies between Operator, Webhook, CSI Driver and workloads,

  • where to look when instrumentation does not happen as expected,

  • which logs and components are most useful during troubleshooting.

Given the behavior I'm seeing after a full cluster reboot, could you suggest what evidence you would collect first?

For example:

  • Which Dynatrace component logs would you inspect first?

  • Are there specific webhook, CSI or Operator messages that indicate failed or missed injection?

  • Is there a way to verify whether a pod started before Dynatrace injection became available?

  • Are there any OpenShift events or Dynatrace diagnostics that would help prove or disprove a startup ordering issue?

I'm mainly trying to learn the correct troubleshooting approach and understand what "good" versus "bad" startup behavior should look like in a Cloud Native Full Stack environment.

Thanks!

Regards, Deni

Dynatrace Integration Engineer at CodeAttest

Featured Posts