08 Oct 2025
06:58 AM
- last edited on
08 Oct 2025
07:57 AM
by
MaciejNeumann
We have dynatrace operator deployed on our large k8 platform
Due to issues with high resource consumption, we split our dynakube configuration to two i.e., one for k8 API metrics and the other for service level monitoring metrics and traces.
On the kubernetes active gates, even though we have 4 replicas, we see that only one is actually fetching metrics from the local k8 API endpoint instead of all 4 balancing the load. This absence of load balancing is causing resource crunch on the one replica and data on the dashboard is not consistent.
Is there a way we can properly load balance to all replicas ?
Is vertical scaling the only option in this case ? if so, how do we do an optimal calculation ?
Solved! Go to Solution.
13 Oct 2025 12:09 AM
Agree, they don't load balance, additional replicas are just there incase of fail over & to chew through your budget.
Only Agent traffic is somewhat load balanced.
So, even if you have 3 Active Gates, only the primary will be doing all the work -> I've raised RFE's around this and given up.
This also leads to consistently getting OOMs. Basically the more Metric / Prometheus level data you scrape ( inc. /metrics , /cadvisor) the bigger the increase in memory on the container Active Gates, it is not uncommon to need >8Gi <16 Gi.
There are a couple of solutions to the OOM & getting the metrics / load balancing,
1. Have multiple dynakubes split into functions, e.g. 1 for Kubernetes Monitoring, 2 for Agent Traffic ... this will help you focus resource utilisation to the different AG Containers. - complex and more tech debt in maintaining it.
2. Implement OTEL collector(s) for Metric & Prometheus scraping components (anything outside of standard AG Kubernetes Monitoring).
Personally, I have gone done the route of OTEL Collectors and this works without issue. Collectors can also use a HPA for scaling as required. You can go down the tracing path as well if you like.
Overall, this will significantly reduce requirements on the container active gates, resolve the OOMs, increase stability and should also address the above issue if configured correctly. If you do do HPA for scaling, you can implement metric sharding on the Prometheus scraping - which is the supported approach for load balancing of metrics.
Anyway, have fun.