This product reached the end of support date on March 31, 2021.

How do you monitor Agent/collector connection issues?


Kind of a peripheral question but I'm looking for some ideas on monitoring Agent/collector connection issues.
We had a case where the server team (seperate area of the company) applied an patch and restarted the server over the weekend and the agents did not connect to the collector and it was not noticed until Monday morning leaving a large data gap in our dynatrace data. Curious if anyone has done any sort of monitoring of the monitor and, if so, how?


Dynatrace Pro
Hi Eric,

The only option I can think of is to create an Incident on the Dynatrace Self-Monitoring system profile. Here you can use the measures "Number of Agents" or "Number of Connected Agents" and set a threshold for the expected number of agents to be connected. You can set a larger window for the incident and make it monitor drops in the number of agents (for example, if number of agents is 20 less than what I expect for an hour, then I can assume there might be an issue).

Additionally, on occasions where work is performed for more than 1 hour (largest incident window) and the agents won't be available but this is expected, then you can schedule a downtime for the incident so that it won't get fired.

Finally, the drawback with this option is that you won't know which agents specifically, since you're only looking at overall numbers.

Other than this, I can't think of any other option. There are OOTB incidents for when an agent (process) loses connectivity unexpectedly, but there aren't any incidents for when the agent cannot/will not connect - the agent logs will be the ones containing this information. However, monitoring all agent logs for signs of connectivity failures can be expensive, especially in large environments.

Hope this helps.



Thanks Radu! That's exactly what I was looking for!