Solved: Load balancing of Dynatrace Managed with Nginx

kalle_lahtinen · ‎10 Aug 2020

Hi,

I'm trying to understand a bit better how the built-in load balancing of Dynatrace Managed with Nginx works. Let's say I have domain name my.dynatrace-cluster.com which contains 2 nodes. When I do an nslookup for that name, I get back the 2 IPs of those cluster nodes. What would happen in a scenario where e.g. the disk breaks down at node 1, and that server is fully unavailable. A user types in https://my.dynatrace-cluster.com to their browser, and it happens to direct to node 1. Would the request then just time out, since Nginx won't be able to direct the session to node 2? And if the user keeps refreshing the browser page, would it perhaps eventually get to the IP of node 2 and then get a response - assuming the browser is essentially trying out those IPs in a round-robin fashion?

ChadTurner · ‎11 Aug 2020

Both nodes are linked to that URL, so when one goes down, the user should not see any interruption, rather, their session will be off loaded to the fail over/second node and that node will continue to support the UI as well as ingest any incoming metrics until such a time that the other node becomes healthy again and the metric data will then be shared between the two nodes, thus getting node 1 back up to speed.

-Chad

kalle_lahtinen · ‎11 Aug 2020

I understand that the data collection would work fine in a situation like this. My question was purely about the Dynatrace UI accessed via browser. The browser is using DNS records to decide which IPs it should use for the https connection. How will the browser know for example that node number 1 is not available? If the server if fully offline, there's no Nginx to direct the traffic to node 2. Sorry, but I still don't see from a technical perspective how the browser knows that a certain IP in the DNS record should be avoided..?

JamesKitson · ‎11 Aug 2020

The browser wouldn't know. I don't know if there are any fancy DNS mechanisms that can handle that but if it happened to select a dead address I don't see how it would work after that apart from the user refreshing and trying again.

Nginx running on the cluster nodes themselves distributes sessions across the nodes once a user has connected and if a server process dies and Nginx is still running it can shift its sessions over to the remaining nodes.

In some places if it is critical to avoid that you can put a load balancer of some sort in front of the cluster nodes and probe with a healthcheck to make sure the nodes are alive before directing traffic to them.

kalle_lahtinen · ‎11 Aug 2020

Thanks James! That is what I also suspected, but got some feedback in the vein of "well it just works", which I didn't find convincing enough 🙂