Solved: Deployment for Dynatrace Managed in a fail-over setup at 2 datacenter setup

EduardLaGrange · ‎04 Aug 2022

Hi I have read the documentation (https://www.dynatrace.com/support/help/setup-and-configuration/dynatrace-managed/basic-concepts/dyna...) for standard high-availability setup and I am trying to figure out the best way to provide fail-over for our small cluster deployed over 2 datacenters.

We currently have a 3 node cluster split over 2 datacenters. Presumable if the DC with the single node goes ("DC2") down we should be OK to continue processing with the 2 nodes in "DC1".

However if "DC1" goes down we are basically dead - leaving a single node.

Adding another node to "DC2" does not improve the situation from a redundancy perspective. Adding a node in each DC ("DC1" = 3 nodes, "DC2" = 2 nodes) still leaves us vulnerable if "DC1" goes down and also breaks the rule/guidance in the documentation - "If you plan to distribute nodes in separate data centers, you shouldn't deploy more than two nodes in each data center."

What are our options here if I want to survive the loss of either "DC1" or "DC2"? Is Premium HA our only option?

Radoslaw_Szulgo · ‎05 Aug 2022

Thanks, @EduardLaGrange for this question. It's a good one and asked many times.

The way you have it currently with cluster nodes split into 2 datacenters is actually doing more harm than good. This is because, there's network latency between DCs and a higher risk of network issues between them, causing a split-brain situation.

The solution you pick should depend on the needs you have. For example, measured by Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Possible solutions for you:

1) Use just one data center and leverage backup-restore capability to recover to the 2nd data center in case of a disaster. This solution has medium RTO (about 1h - depending on the data size) and medium RPO (24h).

2) Use just one data center and leverage Monitoring as a code to backup configuration. In the 2nd data center keep infrastructure and a different cluster installation to quickly redeploy the configuration in case of a disaster in 1st data center. This solution has high RPO (as your monitored data is lost, only the configuration is persisted) and has low RTO.

3) Use Premium High Availability to replicate the data between data centers. This solution has the lowest RTO and RPO.

... Or you add a third data center and then there are more solutions.

Senior Product Manager,
Dynatrace Managed expert

EduardLaGrange · ‎05 Aug 2022

Thanks Radoslaw for the reply.

One more question here ...

if we loose one node in our 3 node cluster (all nodes in the same DC) are we are still vulnerable to "split-brain" (but less so than having nodes split between 2 DCs)? I am asking this in the light of the documentation stating that in a 3 node cluster we can survive the loss of one node. Is there a practical time-limit that one can run with only 2 nodes?

Radoslaw_Szulgo · ‎05 Aug 2022

If you have 3 nodes - you have 3 copies of the data (excluding log events - where there are 2 copies). So when 1 node is lost - you have 2 remaining copies - that constitute a majority and the state of the data is consistent. So no split-brain situation here - these two should have the same state. In such situation, Cassandra writes to a file "hints" - which are the data updates that should be stored to the node that is down. This is stored in a sliding 3h window. After that time, when a node is back - the data needs to be repaired (resynchronized) on the node that comes back. If it comes back quicker - then hints are loaded to the node.

If another node is lost - then you're left with 1 copy of the data. So no data is lost yet - however, it may not be up-to-date and needs a repair/boostrap.

In case you have only 2 nodes - then each of those two can have a different state of data. That's why we call that situation a split-brain. Each part "thinks" its data Is the right one.

Senior Product Manager,
Dynatrace Managed expert

EduardLaGrange · ‎05 Aug 2022

Thanks for clarifying that.

I was thinking that a 3 node cluster loosing 1 node, was a similar situation than a cluster built with only 2 nodes (with the split-brain issue), which is not the case.

EduardLaGrange · ‎05 Aug 2022

Thanks for clarifying that.

I was thinking that a 3 node cluster loosing 1 node, was a similar situation than a cluster built with only 2 nodes (with the split-brain issue), which is not the case.

techean · ‎05 Aug 2022

I agree with @Radoslaw_Szulgo

@EduardLaGrange Kindly go ahead adding nodes in 3rd data center. Try keeping one Data center active at a time. Also with some admin expert try replicating data across all data centers.

You can explore the other suggestions posted by Radoslaw too.

KG