This happened right after the agent update, the following error showed in the installer log indicating timed out when OneAgent trying to connect to all of cluster members:
DynatraceOneAgent failed to connect to Dynatrace
Cluster Node https://.........
I ran a "wget https://Managed-cluster:8443/communication" on the Agent node, I see "connected" with no cert matched error which is expected. I also ran "netstat -an | grep 8443" on both the agent node and cluster node, and I see "established" connections. I ran tcpdump on the cluster, and I can see they are communicating in both ways. I restarted both OneAgent and Cluster, but still not working. Not sure what else we can check to make OneAgent work again to connect to Managed cluster without a timed out?
Did you try to reinstall the agent? (after uninstalling it)
I did not try to uninstall the agent. The agent update is set automatically. The 'established' sessions are there, so the communication seems ok on network layers. The "failed to connect" may be caused by the application layer failure. We don't like to uninstall and re-install the agent instead of finding what caused the failure, otherwise we may have to repeat this again in the future agent update. Anyway we can turn on the debug mode on both sides to get more info?
You will have to open a support ticket. Debug options are tuned on by dynatrace devops team.
Is there proxy involved in your environment?
Please keep in mind that there are multiple connections from a host to cluster. Linux oneagent has 2 or 3 connections, each representing one module. Then each deep instrumented application has its own connection. So maybe the communication you can see belongs to deeply instrumented processes. (you will have to restart the monitored process to update the instrumentation)
If you restarted the oneagent, check the configuration the agent is starting with. Just check the beginning of the logs/os/ruxitagent_host_* file.
We are not using proxy. We have this problem on all of OS with "
Red Hat Enterprise Linux Server
release 6.9 (Santiago) (kernel 2.6.32-696.16.1.el6.x86_64)
Don't know how many instrumented processed you have, but if it's possible, I'd recommend to restart all processes or reboot the system completely.