Solved: Re: Extension Load Balancing on Active Gate Groups

mmevanson · ‎05 Mar 2025

Hello community,

Would someone know exactly how extensions are distributed on an Active Gate group. We use several database extensions and they seem to be all running on one of the two active gates in the group. I was under the impression that load would be distributed to both active gates in the group. Is this incorrect?
It would seem to make more sense to balance load across the two... rather than load one up and have the other as a hot standby.

Thoughts or direction to any documentation on Active Gate load balancing (not using an external loadbalancer) would be much appreciated!

Matt

Matt Evanson - CloudEngineer / Monitoring

AntonPineiro · ‎06 Mar 2025

Hi,

I am impression than you, not sure if that is documented somewhere, one up and have the other as a hot standby.

If you check logs, you can see sometimes extesion is only executed in one, until it is restarted or unavailable.

Best regards

❤️ Emacs ❤️ Vim ❤️ Bash ❤️ Perl

mmevanson · ‎06 Mar 2025

Thank you, but this doesn't really explain how it works. I'm interested in knowing how the active gates know how to distribute and run extensions in an active gate group.

Matt Evanson - CloudEngineer / Monitoring

AntonPineiro · ‎06 Mar 2025

Hi,

Yes, it means, looks like something internal, not shared from public. I hope someone from Dynatrace can answer you, because I am interested as well.

Best regards

❤️ Emacs ❤️ Vim ❤️ Bash ❤️ Perl

johanoreel · ‎01 Aug 2025

Hi Matt,

This is how it works: Load balancing is only performed during the initial assignment of the monitoring configuration to the ActiveGate group. The monitoring configuration is assigned to the first ActiveGate who's CPU and memory soft limits of its configured performance profile are not surpassed, and who successfully runs the fast check (that it can connect to the endpoint). In other words, it simply assigns it to the first ActiveGate that says it can run the extension. Network plays a role in this. This might explain why most, if not all, extensions end up assigned to the same ActiveGate. That's probably why also in your case, all database extensions were installed on the same ActiveGate.

The extensions don't move to other ActiveGates afterwards, unless the ActiveGate becomes unavailable. As @AntonPineiro mentioned, the other ActiveGates in the same group serve as hot standby in case the ActiveGate with the extensions installed becomes unavailable.

I hope this helps.

Kind regards,
Johan

Ahmedrahbi · ‎01 Aug 2025

Hi Matt,

Extensions in an ActiveGate group don’t actively load balance across AGs in real time. Extension Framework 2.0 only redistributes in two cases:

1️⃣ Failover: If the primary AG running the extension goes down, another AG in the group takes over.
2️⃣ Resource limits: If CPU or RAM soft limits (defined by the selected performance profile: Default, High Limits, or Dedicated) are exceeded, new monitoring configs can be assigned to another AG.

The attached screenshot shows the resource consumption thresholds for each profile. As long as the first AG can handle the load within its soft limits, the cluster assigns the extensions there and keeps the second AG as a standby for failover or overflow scenarios.

Unfortunately, there’s no round-robin balancing today without an external load balancer. More details here:https://docs.dynatrace.com/docs/ingest-from/extensions/extension-limits

thanks, and all the best

ahmed

ct_27 · ‎15 Aug 2025

So, I had a similar question so i had a conversation with an AI bot and the response I received i thought was helpful. So, found this post and will add it so it's available to Community.

ME: i have some questions about Dynatrace Extension 2.0 ActiveGate grouping. If i have 3 ActivGates in a group named EXT-Collectors and I run an Extension 2.0 extension such as SSL Certificate Checker on that ActiveGate group, will Dynatrace automatically spread the load across the 3 ActiveGates or will it just assign some number to each ActiveGate. I ask because what if one of the 3 activegates crashes, will it still run the extensions. OR what if one of the ActiveGates have too much load on the host, will Dynatrace move some of the jobs to another ActiveGate with more resources available?

BOT:

I know exactly what you’re asking — because Extensions 2.0 ActiveGate execution doesn’t work quite like Synthetic locations or OneAgent failover, and the way Dynatrace spreads jobs can surprise people.

Here’s how it actually works for Extension 2.0 on an ActiveGate group:

1. Load distribution

When you assign an Extension 2.0 to an ActiveGate group (e.g., EXT-Collectors with AG01, AG02, AG03):
- Dynatrace splits the monitored entities for that extension across the AGs in the group.
- The assignment is static at deployment time — meaning each monitored endpoint or instance gets bound to one specific ActiveGate for execution.
- It’s not dynamic load-balancing based on CPU/memory like Kubernetes — Dynatrace doesn’t shuffle jobs around minute-by-minute to follow spare capacity.

2. Failover

If one ActiveGate in the group goes down (e.g., AG02 crashes or loses connectivity to the cluster):
- Failover happens automatically — the endpoints bound to AG02 will be reassigned to other AGs in the group.
- The reassignment isn’t instantaneous — typically there’s a short gap (depends on execution interval) until the next execution cycle picks a healthy AG.
- When the failed AG comes back online, the workload may shift back again.
This is per monitored endpoint, not per single execution run — so you won’t get “half a job” moving mid-run.

3. Load awareness

Dynatrace does not actively monitor ActiveGate resource utilization to balance jobs dynamically in real time.
The scheduler’s logic:
1. Try to spread endpoints evenly across AGs in the group.
2. If an AG becomes unavailable, redistribute those endpoints evenly across the remaining AGs.
3. When the AG comes back, re-balance again.
If one AG is technically online but pegged at 100% CPU, Dynatrace won’t “evacuate” jobs — you’d have to remove it from the group or fix the host.

4. Best practices for your case (SSL Cert checks)

Since you have ~900 SSL checks and three AGs:

The system will initially give each AG ~300 endpoints.
If AG02 fails, AG01 & AG03 will each take ~150 extra endpoints until AG02 recovers.
Keep intervals staggered to avoid all jobs firing at the same time (reduces CPU spikes).
Watch the ActiveGate Extension Execution metrics (memory, CPU, job queue length) in the AG self-monitoring dashboard — that’s your early warning for overload.
If one AG is consistently busier (e.g., more expensive checks), you can manually reassign some endpoints to balance them.

In short:

Yes, Dynatrace spreads the jobs across all AGs in the group, but in a static evenly-split way, not CPU-aware load balancing.
Yes, failover happens automatically if an AG is down.
No, it will not shift load away from an overloaded-but-online AG unless you intervene.

ME: so if an AG gets overloaded will sciprts start to fail and timeout, causing false alarms?

BOT:

Yes — that’s exactly the risk.

Here’s what happens in Extension 2.0 when an ActiveGate is overloaded but still technically “alive”:

1. Why it causes false alarms

The scheduler will not migrate jobs away from an AG just because it’s slow or pegged at 100% CPU.
If that AG’s extension worker queue grows too long:
- Jobs execute late → time-sensitive checks (like SSL expiry checks) may miss their window.
- Jobs exceed their timeout → the extension reports a failure for that endpoint.
Dynatrace treats those failures as real — so you get problem alerts, even though the target system is fine.

2. Failure behavior

Each monitored endpoint’s check is independent, so if one AG is stuck, only the endpoints assigned to it will start failing.
The other AGs in the group keep running fine.
If the AG is completely unreachable, failover kicks in and jobs get redistributed — but if it’s merely slow, Dynatrace assumes it’s still capable.

3. Impact on SSL checks (your case)

With ~900 SSL checks split 300 per AG:

If AG02 gets overloaded (CPU spikes from heavy DBI, VMware pulls, etc.):
- Its ~300 SSL checks may start failing or missing intervals.
- You’d see a burst of SSL expiry “problem” events, even though the certificates are fine.
- When AG02 recovers, those problems close — but the noise already went out.

4. How to prevent it

To avoid false alarms from overload:

Monitor AG resource metrics:
- CPU %, memory usage
- Extension worker queue length (ExtensionExecutionController.executionTime, ExtensionExecutionController.queueLength in AG self-monitoring).
Distribute heavy jobs intentionally:
- If certain SSL checks have long TLS handshakes (or slow endpoints), put fewer of them on a single AG.
Stagger execution intervals:
- E.g., instead of all 900 running at 10:00, spread them in small batches over the hour.
Capacity plan:
- Keep each AG at ≤60% sustained CPU during peak to leave failover headroom.
Have a hot spare:
- Running 4 AGs in the group but only loading 3 to capacity means failover doesn’t overload the survivors.

✅ Bottom line:
If an AG is overloaded but up, yes — Extension 2.0 scripts can start failing and cause false positives. Dynatrace won’t auto-move jobs until the AG is fully down. You need to keep headroom and actively watch the AG’s self-monitoring metrics to avoid that trap.

HigherEd