I am looking for assistance and clarification regarding two things
I want to have adaptive load reduction stop occurring
I am trying to figure out what the traffic volume or trigger is causing ALR and then increase the necessary resources on the managed cluster node
I currently have a single managed cluster node and want to increase this to have two managed cluster nodes to create a balance of traffic load distribution and provide a layer of redundancy
My concern is that currently I have a single node located at one Data Center
Is it impactful or less desirable to leverage a network practice known as OTV between the cluster nodes?
This is a network protocol where both nodes reside in the same VLAN,
but the vlan is stretched and routed through IP. Usually to a different physical location.
Some latency occurs
I am curious what latency tolerances are available between the members nodes of a managed Cluster
Are there any commands or statistics the a managed cluster can provide regarding performance between the nodes?
Please advise and thanks!
Solved! Go to Solution.
You have to ensure near-zero latency (< 10ms) between cluster nodes. Also they have to have the time synchronized with NTP. See also: https://www.dynatrace.com/support/help/shortlink/managed-requirements#multi-node-installations
You can rely on typical network tools to check metrics/network reliability between nodes. You can also use Cassandra nodetool to check network histograms: https://cassandra.apache.org/doc/latest/troubleshooting/use_nodetool.html
nodetool is in /utils/cassandra-nodetool.sh
Two nodes should be avoided. I consider a single node superior to a two node cluster because it avoids the exposure to "split brain" problem that a two node cluster entails.
I would look at scaling vertically on the single node, and then going to a three node cluster if that is not sufficient.
Thank you for the valuable input. Would it perhaps make sense to document the fact that we should avoid 2-node installations, for example here:
In my opinion, this is currently not expressed by Dynatrace's documentation.
I believe the recovery mainly involves some commands you have to issue against cassandra. Support can help you recover in the event it happens. @Radoslaw S. might have some instructions, also. I have never recovered from one personally...
You have to run cassandra-nodetool.sh repair after min. 3 hours disconnection between nodes. This is because Cassandra has a mechanism called "Hinted hand-off" which stores on a side all not synchronized chunks of data. It can store max 3 hours of data.
Usually upgrades don't take that long, so 2 node cluster should be safe in that aspect.
Dynatrace recommends that you use 3 nodes and not 2:
I too agree that there should be an indication on what factors trigger ALR, as that is affecting one of my clients too.
Thank you all for the help on this. We will be going to a 3 node solution. The good news is that the servers will be in the same VLAN and OTV will not be used to stretch the VLANS across different physical or logical spacings