We have an environment where 2 agents seem to have a problem where they are connected to the collector/server but without instrumentation. I believe the relevant part of the log file is below (it is similar for both agents):
2015-10-29 03:36:20 [991f0221] info [native] Trying to connect to Server/Collector for up to 19 seconds
2015-10-29 03:36:41 [991f0221] severe [native] Exception while connecting to collector, info:<connect()/apr_socket_connect(), 70007, Connection timed out>
2015-10-29 03:36:41 [991f0221] warning [native] Unable to register with Server/Collector <server>:9998, CONTINUING WITHOUT INSTRUMENTATION.
2015-10-29 03:36:41 [a954f221] info [native] Instrumentation channel connected successfully
2015-10-29 03:36:41 [a954f221] info [native] Connected to Server/Collector <server>:9998
Does this look like an issue that could be solved by increasing a timeout setting? And does anyone have any experience with what could be causing this?
This problem just popped up so it may be resolved at least temporarily when the app is recycled but I would like to solve the cause to prevent it from occurring again.
I imagine that you have checked that all the ports from your agent to the server for the port 9998 are opened right?. If they are open and take more than 19 seconds to connect........How many miles or kilomenters are from your agent to your collector? is your collector very heavy loaded?
Note that the "Unable to register" message is happening ~19 seconds after the "Trying to connect". So a simple timeout appears to be happening.
Increasing the timeout is an option, but I'm concerned about the distance issue as menetioned by David. As a test, you could add wait=30 to the -agentpath parameter. It's just another comma delimited token on the end, don't forget the comma. But if this solves it, I still suspect your collector is too far away.
As @David and @Joseph mentioned, you need to verify and confirm that the latency between the Agent and Collector is indeed low or not. The link between these two components has to be a low latency one. This generally boils down to having the agent and collector in the same LAN in the same data center.
What does <server> stand for in the log you posted? Is this the hostname/IP address of the Dynatrace Server or Collector? Even though Dynatrace settings such as the agent string refer to it as "Server" this must be pointing to the Dynatrace Collector. (The "Server" naming is a legacy artifact from the past when there was no such thing called Collector.) You can prevent unintended Agent-to-Server connections by disabling the Embedded Collector in Settings > dynaTrace Server > Services > General.
Yes, the address is pointing to the collector. This environment has been up for a while and I'm relatively new to it. Based on what I'm hearing the distance between the collector and agent is the most likely culprit, it's a production environment so there's not much room for trying out solutions. I think the agents are reconnecting nightly and have been for sometime but I don't seem to see any missing data in the past so I wanted to check if there could be any other explanation for why they didn't connect in time at start. Thank you for your help I'll keep watching.