cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

ConnectionTimeout for /metrics and /metrics/cadvisor in Kubernetes

sivart_89
Mentor

I've added the Kubernetes Monitoring Statistics extension and have been viewing the supplied dashboard. I see that one of our clusters has a lot of failing queries to path /metrics and /metrics/cadvisor with a status_reason of ConnectionTimeout and access_type of DirectIp. How can I troubleshoot this further? I've been running some curl commands from the k8s node where the ActiveGate pod is running from, as well as from a netshoot container I've added to the ActiveGate pod and I am getting responses just fine.

The curl command I have ran is below where the cluster server endpoint I get from the k8s control plan value via command kubectl cluster-info.

curl -X GET https://<cluster sever endpoint>:6443/metrics -H "Authorization: Bearer <token>"

sivart_89_0-1739451038012.png

 

5 REPLIES 5

IzabelaRokita
Community Team
Community Team

Hey @sivart_89 !
Sorry for the delay. Our Community gets lots of activity every day, and unfortunately, sometimes, some posts don't get as much attention as they deserve. Did you find an answer to your question, or would you like me to seek further assistance for you? 

No problem @IzabelaRokita. This is still occurring. I wasn't sure the impact here if anything and how we could further troubleshoot this. If anyone has any ideas here that would be great.

sivart_89_0-1748538565582.png

 

Enrico_F
DynaMight Pro
DynaMight Pro

Just chiming in to inform that we're in the same boat as we're struggling to understand why we're getting that error and what its impact is.

Additionally we're observing frequent OOM errors as the ActiveGate JVM is running out of heap size shortly after there were timeouts accessing the /metrics/cadvisor endpoint even though we have assigned limits of 1 CPU / 4Gi to the container (with SS replicas=3).

This is a typical snippet from our AG logs when things start to go south:

2025-06-02 08:06:01 UTC INFO    [<xxxxxxx>] [HttpClientStatisticsSfmConsumerImpl] Query failed for endpoint /metrics/cadvisor on DirectIp with statusReason: ConnectionTimeout. [Suppressing further identical messages for 1 hour]
org.apache.http.conn.ConnectTimeoutException: Connect to 100.87.9.37:10250 [/100.87.9.37] failed: Connect timed out
  at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)

2025-06-02 08:06:03 UTC WARNING [<xxxxxxx>] [<collector.core>, CollectorSelfMonitoringUpdater] 'Kubernetes Monitoring - Pipeline Thread' is exhausted (active threads = 20, max threads = 20), queue size = 324 [Suppressing further messages for 5 minutes]

then shortly after that we get OOMs like these:

2025-06-02 08:09:09 UTC SEVERE [<xxxxxxxx>] [CommunicationServletImpl] Failed to handle request from https://10.95.43.221:42160, X-Client-Host=some-app-xyz, User-Agent=ruxit/1.309.66.20250401-150134 0xb6542d03b5a7dfe4 xxxxxxxx, content-length=101059, content-type=application/bin, host=10.96.247.81 - POST /communication | Reason: Java heap space
java.lang.OutOfMemoryError: Java heap space

Not 100% sure if the OOMs are directly related to the timeouts against /metrics/cadvisor but it does look somewhat suspicious.

We're a bit unsure about increasing the memory limit further as this might make things even worse due to increased garbage collection time... Also, I believe horizontal scaling does not help with scaling the kubernetes monitoring capability as only 1 AG in the StatefulSet seems to query the cluster API at any given time (not 100% sure if this is true, just an assumption based on AG logs and big differences in resource consumption observed across the SS replicas).

I would be very interested to hear from other users that have successfully dealt with this.

Enrico_F
DynaMight Pro
DynaMight Pro

After some analysis it turns out that in our case these timeouts are due to restricted connectivity on hardened k8s clusters. Specifically, all direct connections attempts from the AG pods to port 10250 (= default kubelet API) on master and infrastructure nodes are blocked by default.

Unfortunately there is currently no option to restrict these requests to only a subset cluster nodes e.g. with a specific label.

To improve this I've raised idea Restrict AG kubelet API monitoring to specific nodes as suggested to me by Dynatrace support.

gopher
Champion

@sivart_89 @Enrico_F ,
Based on the above, I'm assuming you're using the container Active Gates. 

Here's the fun part about the container Active Gates - they don't load balance, additional replicas are just there incase of fail over & to chew through your budget. 

So, even if you have 3 Active Gates, only the primary will be doing all the work -> I've raised RFE's around this and given up. 
Why are you getting OOM ? Basically the more Prometheus level data you scrape ( inc. /metrics , /cadvisor) the bigger the increase in memory on the container Active Gates, it is not uncommon to need >8Gi <16 Gi.  Prometheus servers should generally be at least 4Gi, then you need to add the requirements for the other functions. 

solutions to the OOM & getting the metrics,  
1.  Increase memory > 8Gi

2. Have multiple dynakubes split into functions, e.g. 1 for Kubernetes Monitoring, 1 for Agent Traffic ...  this will help you focus resource utilisation to the different AG Containers.  
3. Implement an OTEL collector for Prometheus scraping components (anything outside of standard AG Kubernetes Monitoring).  

Personally, I would go down the path of option 3, this will significantly reduce requirements on the container active gates, resolve the OOMs, increase stability and should also address the above issue if configured correctly.  you can also scale OTEL and do metric sharding on the Prometheus scraping for load balancing if required.   

Anyway, have fun.

Featured Posts