In Dynatrace's documentation regarding Managed DNS configuration,
it points to a configuration where the DNS name resolves to all the node IPs.
There is no indication if it would be OK to have a VIP IP address in front of the cluster nodes, or not. Every new opinion I get or read, does not contribute to getting a more clear reply to this issue. Besides, from past posts in the Community, I believe the following should be referenced:
Now, some additional thoughts come to mind:
There are several reasons why a VIP might also be needed. In the case I'm dealing at the moment, one of them is that due to redundancy issues, nodes are on different networks, and some are not even routable to all OneAgents. This creates a heavier load on one node, because it's on the most "known"/routable network. The recent health dashboards have also given precious insight into this, and we have checked this on the OneAgent client side with oneagentctl.
Now, I would love to hear insight into this VIP/load balancer configuration, and if there are eventually other best practices in balancing traffic?
Good point, as I only slightly mentioned Activegates. I have used network zones in several cases, but in the particular one I'm referencing, we don't even use Activegates, as there is no advantage in using them, as all servers are basically in the same datacenter. Even with AGs/network zones, the same VIP issues would apply though.
We had used a virtual load balancer in front of Dynatrace cluster nodes to access the tenant UI at one of our client managed environment. This load balancer had its own rules setup to point it to a single node when a specific context FQDNS was found. I hope if this can help you to do similar setup with custom load balancing based on request url context.
dyantarce.abc.com - always going to node 1
Yes, tenant UI load time was better than the default one. The client had this requirement as there were service team constantly monitoring data and use tenant frequently for multiple load test results. We initiated this on lower environment and the results were impressive in terms of data load on UI.
Hello @AntonioSousa ,
I think we have to separate the UI / API communication from the OA traffic. For OA it will only make things more complicated. Sure there are situations when not all nodes are reachable and a balance is not reached, but this should be either solved by using AG, opening that network routes or simply redesigning your Managed deployment.
For the UI / API part - it depends. Most enterprise customers I work with want and need to have a hostname in their domain and are using F5 or another reverse proxy mostly for centralized common access to Dynatrace and other applications. Anyway the NGNIX on the Managed Node will balance the UI/API traffic automatically for you for the best node. This UI/API traffic might end on a different node, depending on health. Look at the NGINX config on a Managed Node for inspiration.
Typically I recommend not doing any VIP and sticking with the default configuration.
FYI there is the /rest/health URL path on the Managed Node to check for its status. This is the right method of checking status of a Managed Node.
Thanks for all your insight. I have some doubts on some of the points:
Of course, I could have a VIP and still let some OAs having direct connections. Just not sure if this would also not create an unbalance between the nodes...
@AntonioSousa it is possible to handle it by setting the IP address in the CMC for each node. If you set it to the IP address, only this IP address will be propagated to agents. So you can force agents to connect to the correct IP. I'm just not sure if you can set an IP which is not local - I never tried that, but I guess it works.
The balancing of agent traffic between nodes - did you try to validate that using cluster metrics in the local selfmonitoring environment? It shows you also traffic per AG located on the cluster nodes. Then you can check if its really unbalanced or there is other reason (different HW for example).
I personally don't see any benefit to setup VIP for OA traffic. If there are network route issues - you can also use HTTP proxies to connect OA to desired endpoints if it solves your case.
Yes, I have been using several dsfm metrics to check the unbalance. They have been great! I have also used oneagentctl to get the cluster nodes to which certain OneAgents are connecting, and that too has provided very good information. We then got to the point where we discovered that certain routes were not allowed...
There are of course multiple ways to solve it, and there are certainly much many other combinations that might occur elsewhere. The issue here is that Dynatrace only recommends a round-robin DNS setup, with no other documented 🤔
There are some cases where OAs do use the round-robin DNS. The FQDN for the cluster is sent to OAs, and with oneagentctl I have seen some using them (marked with an asterisk). How OAs use which entry in the list they receive, is still a mystery to me...
@AntonioSousa I have never seen FQDN of the cluster pushed to oneagents. There is only one exception - if you set it using oneagentctl explicitly. Someone must have set this during installation or afterwards.