13 Feb 2025
12:51 PM
- last edited on
06 May 2025
11:58 AM
by
Michal_Gebacki
I've added the Kubernetes Monitoring Statistics extension and have been viewing the supplied dashboard. I see that one of our clusters has a lot of failing queries to path /metrics and /metrics/cadvisor with a status_reason of ConnectionTimeout and access_type of DirectIp. How can I troubleshoot this further? I've been running some curl commands from the k8s node where the ActiveGate pod is running from, as well as from a netshoot container I've added to the ActiveGate pod and I am getting responses just fine.
The curl command I have ran is below where the cluster server endpoint I get from the k8s control plan value via command kubectl cluster-info.
curl -X GET https://<cluster sever endpoint>:6443/metrics -H "Authorization: Bearer <token>"
26 May 2025 09:18 AM
Hey @sivart_89 !
Sorry for the delay. Our Community gets lots of activity every day, and unfortunately, sometimes, some posts don't get as much attention as they deserve. Did you find an answer to your question, or would you like me to seek further assistance for you?
29 May 2025 06:09 PM
No problem @IzabelaRokita. This is still occurring. I wasn't sure the impact here if anything and how we could further troubleshoot this. If anyone has any ideas here that would be great.
03 Jun 2025 07:58 AM - edited 03 Jun 2025 08:57 AM
Just chiming in to inform that we're in the same boat as we're struggling to understand why we're getting that error and what its impact is.
Additionally we're observing frequent OOM errors as the ActiveGate JVM is running out of heap size shortly after there were timeouts accessing the /metrics/cadvisor endpoint even though we have assigned limits of 1 CPU / 4Gi to the container (with SS replicas=3).
This is a typical snippet from our AG logs when things start to go south:
2025-06-02 08:06:01 UTC INFO [<xxxxxxx>] [HttpClientStatisticsSfmConsumerImpl] Query failed for endpoint /metrics/cadvisor on DirectIp with statusReason: ConnectionTimeout. [Suppressing further identical messages for 1 hour]
org.apache.http.conn.ConnectTimeoutException: Connect to 100.87.9.37:10250 [/100.87.9.37] failed: Connect timed out
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
2025-06-02 08:06:03 UTC WARNING [<xxxxxxx>] [<collector.core>, CollectorSelfMonitoringUpdater] 'Kubernetes Monitoring - Pipeline Thread' is exhausted (active threads = 20, max threads = 20), queue size = 324 [Suppressing further messages for 5 minutes]
then shortly after that we get OOMs like these:
2025-06-02 08:09:09 UTC SEVERE [<xxxxxxxx>] [CommunicationServletImpl] Failed to handle request from https://10.95.43.221:42160, X-Client-Host=some-app-xyz, User-Agent=ruxit/1.309.66.20250401-150134 0xb6542d03b5a7dfe4 xxxxxxxx, content-length=101059, content-type=application/bin, host=10.96.247.81 - POST /communication | Reason: Java heap space
java.lang.OutOfMemoryError: Java heap space
Not 100% sure if the OOMs are directly related to the timeouts against /metrics/cadvisor but it does look somewhat suspicious.
We're a bit unsure about increasing the memory limit further as this might make things even worse due to increased garbage collection time... Also, I believe horizontal scaling does not help with scaling the kubernetes monitoring capability as only 1 AG in the StatefulSet seems to query the cluster API at any given time (not 100% sure if this is true, just an assumption based on AG logs and big differences in resource consumption observed across the SS replicas).
I would be very interested to hear from other users that have successfully dealt with this.