ConnectionTimeout for /metrics and /metrics/cadvisor in Kubernetes

sivart_89 · ‎13 Feb 2025

I've added the Kubernetes Monitoring Statistics extension and have been viewing the supplied dashboard. I see that one of our clusters has a lot of failing queries to path /metrics and /metrics/cadvisor with a status_reason of ConnectionTimeout and access_type of DirectIp. How can I troubleshoot this further? I've been running some curl commands from the k8s node where the ActiveGate pod is running from, as well as from a netshoot container I've added to the ActiveGate pod and I am getting responses just fine.

The curl command I have ran is below where the cluster server endpoint I get from the k8s control plan value via command kubectl cluster-info.

curl -X GET https://<cluster sever endpoint>:6443/metrics -H "Authorization: Bearer <token>"

IzabelaRokita · ‎26 May 2025

Hey @sivart_89 !
Sorry for the delay. Our Community gets lots of activity every day, and unfortunately, sometimes, some posts don't get as much attention as they deserve. Did you find an answer to your question, or would you like me to seek further assistance for you?

sivart_89 · ‎29 May 2025

No problem @IzabelaRokita. This is still occurring. I wasn't sure the impact here if anything and how we could further troubleshoot this. If anyone has any ideas here that would be great.

Enrico_F · ‎03 Jun 2025

Just chiming in to inform that we're in the same boat as we're struggling to understand why we're getting that error and what its impact is.

Additionally we're observing frequent OOM errors as the ActiveGate JVM is running out of heap size shortly after there were timeouts accessing the /metrics/cadvisor endpoint even though we have assigned limits of 1 CPU / 4Gi to the container (with SS replicas=3).

This is a typical snippet from our AG logs when things start to go south:

2025-06-02 08:06:01 UTC INFO    [<xxxxxxx>] [HttpClientStatisticsSfmConsumerImpl] Query failed for endpoint /metrics/cadvisor on DirectIp with statusReason: ConnectionTimeout. [Suppressing further identical messages for 1 hour]
org.apache.http.conn.ConnectTimeoutException: Connect to 100.87.9.37:10250 [/100.87.9.37] failed: Connect timed out
  at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)

2025-06-02 08:06:03 UTC WARNING [<xxxxxxx>] [<collector.core>, CollectorSelfMonitoringUpdater] 'Kubernetes Monitoring - Pipeline Thread' is exhausted (active threads = 20, max threads = 20), queue size = 324 [Suppressing further messages for 5 minutes]

then shortly after that we get OOMs like these:

2025-06-02 08:09:09 UTC SEVERE [<xxxxxxxx>] [CommunicationServletImpl] Failed to handle request from https://10.95.43.221:42160, X-Client-Host=some-app-xyz, User-Agent=ruxit/1.309.66.20250401-150134 0xb6542d03b5a7dfe4 xxxxxxxx, content-length=101059, content-type=application/bin, host=10.96.247.81 - POST /communication | Reason: Java heap space
java.lang.OutOfMemoryError: Java heap space

Not 100% sure if the OOMs are directly related to the timeouts against /metrics/cadvisor but it does look somewhat suspicious.

We're a bit unsure about increasing the memory limit further as this might make things even worse due to increased garbage collection time... Also, I believe horizontal scaling does not help with scaling the kubernetes monitoring capability as only 1 AG in the StatefulSet seems to query the cluster API at any given time (not 100% sure if this is true, just an assumption based on AG logs and big differences in resource consumption observed across the SS replicas).

I would be very interested to hear from other users that have successfully dealt with this.

Enrico_F · ‎22 Sep 2025

After some analysis it turns out that in our case these timeouts are due to restricted connectivity on hardened k8s clusters. Specifically, all direct connections attempts from the AG pods to port 10250 (= default kubelet API) on master and infrastructure nodes are blocked by default.

Unfortunately there is currently no option to restrict these requests to only a subset cluster nodes e.g. with a specific label.

To improve this I've raised idea Restrict AG kubelet API monitoring to specific nodes as suggested to me by Dynatrace support.

gopher · ‎08 Oct 2025

@sivart_89 @Enrico_F ,
Based on the above, I'm assuming you're using the container Active Gates.

Here's the fun part about the container Active Gates - they don't load balance, additional replicas are just there incase of fail over & to chew through your budget.

So, even if you have 3 Active Gates, only the primary will be doing all the work -> I've raised RFE's around this and given up.
Why are you getting OOM ? Basically the more Prometheus level data you scrape ( inc. /metrics , /cadvisor) the bigger the increase in memory on the container Active Gates, it is not uncommon to need >8Gi <16 Gi. Prometheus servers should generally be at least 4Gi, then you need to add the requirements for the other functions.

solutions to the OOM & getting the metrics,
1. Increase memory > 8Gi

2. Have multiple dynakubes split into functions, e.g. 1 for Kubernetes Monitoring, 1 for Agent Traffic ... this will help you focus resource utilisation to the different AG Containers.
3. Implement an OTEL collector for Prometheus scraping components (anything outside of standard AG Kubernetes Monitoring).

Personally, I would go down the path of option 3, this will significantly reduce requirements on the container active gates, resolve the OOMs, increase stability and should also address the above issue if configured correctly. you can also scale OTEL and do metric sharding on the Prometheus scraping for load balancing if required.

Anyway, have fun.

Enrico_F · ‎07 Nov 2025

@gopher wrote:
Here's the fun part about the container Active Gates - they don't load balance, additional replicas are just there incase of fail over & to chew through your budget.

I don't think this is entirely true - IMO the load balancing works mostly* fine with regards to the routing capability (e.g. inbound traffic via the "<dynakube-name>-service" k8s services in the operator namespace), but not for the kubernetes-monitoring capability or any cluster-internal connections initiated by the ActiveGate (e.g. querying the kubelet API, scraping Prometheus endpoints etc.).

*) subject to the reliability of the readiness probe (based on my experience it isn't 100% reliable)