Container platforms
Questions about Kubernetes, OpenShift, Docker, and more.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Rollout restart OneAgent pods for worker nodes to appear on the deployment status page

Theodore_x86
Advisor

Hello Community team!

For the last 2 months I have performed a lot of Dynatrace deployments on k8s environments (Full kubernetes observability). In all of the cases after successful dynakube deployment and after all Oneagent pods, csi driveer pods, activegate pod are Up and Running, worker nodes do not become visible on the UI, until I restart the oneagent deamonset twice!

Has anyone else the same experience? Why is this happening beats me. While worker nodes do not appear on the Deployment page no significant errors appear on the pods

Without the Nodes on the deployment page, we do not have injection status for the processes/pods.

Any comment on the above would be much appreciated.

Thank you!

Houston, we have a problem.
7 REPLIES 7

Julius_Loman
DynaMight Legend
DynaMight Legend

What is the the OneAgent logs? I'd say you encountered the situation when your OneAgent has to connect through the ActiveGate deployed by the operator, but the ActiveGate has not yet been started yet. 

Dynatrace also introduced a feature - Operator generates a selfsigned certificate for the ActiveGate and pushes it to OneAgents.  If OneAgent has the custom.pem supplied, it validates the certificates, so probably the OneAgent cannot connect to any AG outside of the cluster and the internal AG (operator managed) is not yet known.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

Theodore_x86
Advisor

Hello. I need to return to this issue as I faced one more time on a new AKS cluster.

Without oneagent pods restart, Worker Nodes do not appear on the Deployment Status page. Yesterday, after restart 4 out of 6 Nodes appeared. I needed to request another restart for the other 2 to appear.

@Julius_Loman let my reply on your comments (not quite promptly, sorry for that 🙂 )

Activegate pod is always UP. I do not see how the new feature with self-signed certificates could affect communication. After all, when we restart the oneagent pods the communication is established.

Nothing meaningful on the logs is discovered yet. Everything seems to run smoothly except we do not see the Nodes without a rollout restart.

BR

 

Houston, we have a problem.

Still race condition may appear. Can you share what is in the logs of the oneagent which did not connect? But check the ruxitagent_host file not the console output of the OneAgents (which basically provides watchdog log). That means, execute shell in the oneagent pod:

kubectl exec -n dynatrace -ti dynakube-oneagent-r2tvr -- /bin/sh

 and check the log files starting with ruxitagent_host in /mnt/volume_storage_mount/var_log/os/

What can happen is that the OneAgents start earlier than the ActiveGate successfully connects and registers itself in the cluster as the communication endpoint. If the OneAgents do have the custom.pem file provided by the operator, they will check the remote side. If the ActiveGate is not yet connected to the cluster, OneAgents try to connect to the rest of the known communication endpoints and fail due to the certificate. If you restart the OneAgent pods, they may already have the ActiveGate in the communication endpoint list on startup (Operator updates the list). The list of communication endpoints does not get updated for the running OneAgent pod until it's connected.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

Thanks @Julius_Loman for your comments.

I see your point here, however I do not have access myself to the AKS cluster and I perform any actions with the administrator. Monitoring works fine now, so no reason to disrupt anything.

Next time, I will perform installation on an AKS and I see this behavior I will immediately check the logs on the dynakube-oneagent pod and will update here the findings. Until then I may as well sit tight.

I will be surprised to see that the reason for this is the AG connectivity, because while we wait to see the oneagents on the Dynatrace UI, AG is registered properly and we see k8s metrics on the Kubernetes page, meaning that AG reaches the k8s API and returns results on the DT Cluster. But again, let's wait and see what the oneagent logs say next time.

Thank you!

Houston, we have a problem.

@Theodore_x86 it's not the AG connectivity. It's the race condition - when OA is started with AG at the same time, OA connection string (where it should connect to) is populated with the value known for the cluster at that time, and it won't include the AG. If the OA are not allowed to connect anywhere else due to network policies or certificates, then the connection string won't be updated with the AG address.

I think it's this case.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

Thanks @Julius_Loman .

It will be the first thing to check next time.

BR

Houston, we have a problem.

@Theodore_x86, this recently happened for me again, and I confirm the issue above.

What happened in my scenario:

  • k8s cluster had connection only to the Dynatrace cluster (no AG - basically empty environment)
  • DynaKube deployed in Cloud Native FullStack mode with ActiveGate
  • No additional CA configured. no custom certificate configured for AG or OA
  • Operator creates a certificate for AG and exposes it to OneAgent pods
  • Operator creates AG and OA deployments
  • At this time, cluster does not yet know about the AG deployed in k8s as it has not been started. Operator creates oneagent-connection-info ConfigMap with the list of endpoints (only cluster at this time). This is passed to the OA.
  • Since OA has a custom certificate specified (the one created by operator for the AG), it validates the certificate. A following error can be observed:
    2026-01-20 20:35:35.926 UTC [00004fbb] warning [comm  ] Certificate check failed with cainfo from { serverCAInfo: ["/opt/dynatrace/oneagent/agent/conf/ruxitserverfull.pem", "/var/lib/dynatrace/oneagent/agent/customkeys/custom.pem"], proxyCAInfo: [] }
  • Certificate check fails as the only endpoint (Dynatrace cluster) is not issued by any of those two certificates.
  • After the AG connects, Operator updates the configmap with connection info (so sooner than 15 minutes), but OneAgent pods will never restart.

A recommended solution would probably be to delay the deployment of OA until the AG comes online and the config map is updated. The operator delays it but only if a communication endpoint exists. It does not check for SSL certificate. Following log message can be observed:
{"level":"info","ts":"2026-01-20T20:11:04.149Z","logger":"dynakube-oneagent","msg":"At least one communication host is provided, deploying OneAgent"}

If OA would have connection to any AG with default self-signed certificate or AG with your custom certificate (and you add it to trustedCA), this would not happen as OA would connect there first and reconnect afterwards to the AG managed by the Operator.

This will probably be another Product Idea on how to improve Dynatrace Operator to check the connectivity before rollout of OneAgent if a custom certificate is generated or supplied with trustedCA.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

Featured Posts