08 Oct 2025
06:58 AM
- last edited on
08 Oct 2025
07:57 AM
by
MaciejNeumann
We have dynatrace operator deployed on our large k8 platform
Due to issues with high resource consumption, we split our dynakube configuration to two i.e., one for k8 API metrics and the other for service level monitoring metrics and traces.
On the kubernetes active gates, even though we have 4 replicas, we see that only one is actually fetching metrics from the local k8 API endpoint instead of all 4 balancing the load. This absence of load balancing is causing resource crunch on the one replica and data on the dashboard is not consistent.
Is there a way we can properly load balance to all replicas ?
Is vertical scaling the only option in this case ? if so, how do we do an optimal calculation ?
Solved! Go to Solution.
13 Oct 2025 12:09 AM
Agree, they don't load balance, additional replicas are just there incase of fail over & to chew through your budget.
Only Agent traffic is somewhat load balanced.
So, even if you have 3 Active Gates, only the primary will be doing all the work -> I've raised RFE's around this and given up.
This also leads to consistently getting OOMs. Basically the more Metric / Prometheus level data you scrape ( inc. /metrics , /cadvisor) the bigger the increase in memory on the container Active Gates, it is not uncommon to need >8Gi <16 Gi.
There are a couple of solutions to the OOM & getting the metrics / load balancing,
1. Have multiple dynakubes split into functions, e.g. 1 for Kubernetes Monitoring, 2 for Agent Traffic ... this will help you focus resource utilisation to the different AG Containers. - complex and more tech debt in maintaining it.
2. Implement OTEL collector(s) for Metric & Prometheus scraping components (anything outside of standard AG Kubernetes Monitoring).
Personally, I have gone done the route of OTEL Collectors and this works without issue. Collectors can also use a HPA for scaling as required. You can go down the tracing path as well if you like.
Overall, this will significantly reduce requirements on the container active gates, resolve the OOMs, increase stability and should also address the above issue if configured correctly. If you do do HPA for scaling, you can implement metric sharding on the Prometheus scraping - which is the supported approach for load balancing of metrics.
Anyway, have fun.
07 Nov 2025 05:48 PM - edited 07 Nov 2025 05:49 PM
We have already split dynakubes to multiple depending on its function.
Our clusters are just so big and hence resource hungry.
we are running 2 replicas now for k8 metrics and of-course only one of these is serving the purpose as there is no LB'ing. However, the memory is just not enough how much ever we scale it vertically. we are currently at 24 GB memory limits and still see the data getting purged. Our clusters are also very big i.e., 25k pods in a cluster. We are in a position now where we can't go up on memory anymore because it would put other pods at risk on our infra nodes by causing scheduling related issues.
If load balancing is not possible, the only thing that can save us is having stand alone VM based active gates and just have all the compute we need on them.
07 Nov 2025 09:12 PM
Agree, this is mandatory, I have the same issue.
11 Nov 2025 05:57 AM
Hi
Thanks for sharing the request. We are aware of the situation and looking into ways to solve with horizontal scale. The proposed solution of deploying separate Dynakubes to split the responsibility of AGs is the right approach and future proof. This allows to scale traffice routing horizontally which typically causes the majority of the load.
We're working on a solution to offload Prometheus scraping from the AG to further reduce the load and provide a nicely horizontally scalable solution.
Timelines arent clear yet though.