Solved: Moving from Dynatrace Managed single node cluster to 2 nodes and addressing ALR

runatyr · ‎23 Apr 2020

I am looking for assistance and clarification regarding two things

I want to have adaptive load reduction stop occurring

I am trying to figure out what the traffic volume or trigger is causing ALR and then increase the necessary resources on the managed cluster node

I currently have a single managed cluster node and want to increase this to have two managed cluster nodes to create a balance of traffic load distribution and provide a layer of redundancy

My concern is that currently I have a single node located at one Data Center

Is it impactful or less desirable to leverage a network practice known as OTV between the cluster nodes?

This is a network protocol where both nodes reside in the same VLAN,

but the vlan is stretched and routed through IP. Usually to a different physical location.

Some latency occurs

I am curious what latency tolerances are available between the members nodes of a managed Cluster

Are there any commands or statistics the a managed cluster can provide regarding performance between the nodes?

Please advise and thanks!

Radoslaw_Szulgo · ‎23 Apr 2020

You have to ensure near-zero latency (< 10ms) between cluster nodes. Also they have to have the time synchronized with NTP. See also: https://www.dynatrace.com/support/help/shortlink/managed-requirements#multi-node-installations

You can rely on typical network tools to check metrics/network reliability between nodes. You can also use Cassandra nodetool to check network histograms: https://cassandra.apache.org/doc/latest/troubleshooting/use_nodetool.html

nodetool is in /utils/cassandra-nodetool.sh

Senior Product Manager,
Dynatrace Managed expert

dave_mauney · ‎23 Apr 2020

Two nodes should be avoided. I consider a single node superior to a two node cluster because it avoids the exposure to "split brain" problem that a two node cluster entails.

I would look at scaling vertically on the single node, and then going to a three node cluster if that is not sufficient.

kalle_lahtinen · ‎11 Jun 2020

Hi Dave,

Thank you for the valuable input. Would it perhaps make sense to document the fact that we should avoid 2-node installations, for example here:

https://www.dynatrace.com/support/help/setup-and-configuration/dynatrace-managed/installation/dynatr...

And here:

https://www.dynatrace.com/support/help/setup-and-configuration/dynatrace-managed/basic-concepts/dyna...

In my opinion, this is currently not expressed by Dynatrace's documentation.

dave_mauney · ‎11 Jun 2020

Hi Kallie,

I agree we need to update our docs and have requested this of the documentation team.

Thanks,

dave

kalle_lahtinen · ‎16 Jun 2020

Thanks Dave! I have one followup question, since you appear to have some expertise in the subject matter. If we do end up with the split brain scenario, are there any feasible methods to recover from it (while keeping the data), or is it just best to restore a previous Managed backup?

dave_mauney · ‎16 Jun 2020

I believe the recovery mainly involves some commands you have to issue against cassandra. Support can help you recover in the event it happens. @Radoslaw S. might have some instructions, also. I have never recovered from one personally...

kalle_lahtinen · ‎16 Jun 2020

Thanks again Dave, appreciate it! Good stuff to know 🙂

Radoslaw_Szulgo · ‎18 Jun 2020

You have to run cassandra-nodetool.sh repair after min. 3 hours disconnection between nodes. This is because Cassandra has a mechanism called "Hinted hand-off" which stores on a side all not synchronized chunks of data. It can store max 3 hours of data.

Usually upgrades don't take that long, so 2 node cluster should be safe in that aspect.

Senior Product Manager,
Dynatrace Managed expert

kalle_lahtinen · ‎18 Jun 2020

Sounds good, thanks for the information Radoslaw!

AntonioSousa · ‎26 Apr 2020

Dynatrace recommends that you use 3 nodes and not 2:

https://www.dynatrace.com/support/help/setup-and-configuration/dynatrace-managed/installation/dynatr...

I too agree that there should be an indication on what factors trigger ALR, as that is affecting one of my clients too.

Antonio Sousa

runatyr · ‎16 Jun 2020

Thank you all for the help on this. We will be going to a 3 node solution. The good news is that the servers will be in the same VLAN and OTV will not be used to stretch the VLANS across different physical or logical spacings

ChadTurner · ‎16 Jun 2020

Ideally a 3 node set up is what you would want.

-Chad