23 Apr 2020 01:06 PM - last edited on 25 May 2023 08:32 AM by Karolina_Linda
I am looking for assistance and clarification regarding two things
I want to have adaptive load reduction stop occurring
I am trying to figure out what the traffic volume or trigger is causing ALR and then increase the necessary resources on the managed cluster node
I currently have a single managed cluster node and want to increase this to have two managed cluster nodes to create a balance of traffic load distribution and provide a layer of redundancy
My concern is that currently I have a single node located at one Data Center
Is it impactful or less desirable to leverage a network practice known as OTV between the cluster nodes?
This is a network protocol where both nodes reside in the same VLAN,
but the vlan is stretched and routed through IP. Usually to a different physical location.
Some latency occurs
I am curious what latency tolerances are available between the members nodes of a managed Cluster
Are there any commands or statistics the a managed cluster can provide regarding performance between the nodes?
Please advise and thanks!
Solved! Go to Solution.
23 Apr 2020 01:19 PM
You have to ensure near-zero latency (< 10ms) between cluster nodes. Also they have to have the time synchronized with NTP. See also: https://www.dynatrace.com/support/help/shortlink/managed-requirements#multi-node-installations
You can rely on typical network tools to check metrics/network reliability between nodes. You can also use Cassandra nodetool to check network histograms: https://cassandra.apache.org/doc/latest/troubleshooting/use_nodetool.html
nodetool is in /utils/cassandra-nodetool.sh
23 Apr 2020 03:43 PM
Two nodes should be avoided. I consider a single node superior to a two node cluster because it avoids the exposure to "split brain" problem that a two node cluster entails.
I would look at scaling vertically on the single node, and then going to a three node cluster if that is not sufficient.
11 Jun 2020 09:44 AM
Hi Dave,
Thank you for the valuable input. Would it perhaps make sense to document the fact that we should avoid 2-node installations, for example here:
And here:
In my opinion, this is currently not expressed by Dynatrace's documentation.
11 Jun 2020 03:57 PM
Hi Kallie,
I agree we need to update our docs and have requested this of the documentation team.
Thanks,
dave
16 Jun 2020 06:52 AM
Thanks Dave! I have one followup question, since you appear to have some expertise in the subject matter. If we do end up with the split brain scenario, are there any feasible methods to recover from it (while keeping the data), or is it just best to restore a previous Managed backup?
16 Jun 2020 05:12 PM
I believe the recovery mainly involves some commands you have to issue against cassandra. Support can help you recover in the event it happens. @Radoslaw S. might have some instructions, also. I have never recovered from one personally...
16 Jun 2020 09:50 PM
Thanks again Dave, appreciate it! Good stuff to know 🙂
18 Jun 2020 08:40 AM
You have to run cassandra-nodetool.sh repair after min. 3 hours disconnection between nodes. This is because Cassandra has a mechanism called "Hinted hand-off" which stores on a side all not synchronized chunks of data. It can store max 3 hours of data.
Usually upgrades don't take that long, so 2 node cluster should be safe in that aspect.
18 Jun 2020 09:02 AM
Sounds good, thanks for the information Radoslaw!
26 Apr 2020 04:17 PM
Dynatrace recommends that you use 3 nodes and not 2:
I too agree that there should be an indication on what factors trigger ALR, as that is affecting one of my clients too.
16 Jun 2020 11:56 AM
Thank you all for the help on this. We will be going to a 3 node solution. The good news is that the servers will be in the same VLAN and OTV will not be used to stretch the VLANS across different physical or logical spacings
16 Jun 2020 05:57 PM
Ideally a 3 node set up is what you would want.