Hi I have read the documentation (https://www.dynatrace.com/support/help/setup-and-configuration/dynatrace-managed/basic-concepts/dyna... for standard high-availability setup and I am trying to figure out the best way to provide fail-over for our small cluster deployed over 2 datacenters.
We currently have a 3 node cluster split over 2 datacenters. Presumable if the DC with the single node goes ("DC2") down we should be OK to continue processing with the 2 nodes in "DC1".
However if "DC1" goes down we are basically dead - leaving a single node.
Adding another node to "DC2" does not improve the situation from a redundancy perspective. Adding a node in each DC ("DC1" = 3 nodes, "DC2" = 2 nodes) still leaves us vulnerable if "DC1" goes down and also breaks the rule/guidance in the documentation - "If you plan to distribute nodes in separate data centers, you shouldn't deploy more than two nodes in each data center."
What are our options here if I want to survive the loss of either "DC1" or "DC2"? Is Premium HA our only option?
Solved! Go to Solution.
Thanks, @EduardLaGrange for this question. It's a good one and asked many times.
The way you have it currently with cluster nodes split into 2 datacenters is actually doing more harm than good. This is because, there's network latency between DCs and a higher risk of network issues between them, causing a split-brain situation.
The solution you pick should depend on the needs you have. For example, measured by Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Possible solutions for you:
1) Use just one data center and leverage backup-restore capability to recover to the 2nd data center in case of a disaster. This solution has medium RTO (about 1h - depending on the data size) and medium RPO (24h).
2) Use just one data center and leverage Monitoring as a code to backup configuration. In the 2nd data center keep infrastructure and a different cluster installation to quickly redeploy the configuration in case of a disaster in 1st data center. This solution has high RPO (as your monitored data is lost, only the configuration is persisted) and has low RTO.
3) Use Premium High Availability to replicate the data between data centers. This solution has the lowest RTO and RPO.
... Or you add a third data center and then there are more solutions.
Thanks Radoslaw for the reply.
One more question here ...
if we loose one node in our 3 node cluster (all nodes in the same DC) are we are still vulnerable to "split-brain" (but less so than having nodes split between 2 DCs)? I am asking this in the light of the documentation stating that in a 3 node cluster we can survive the loss of one node. Is there a practical time-limit that one can run with only 2 nodes?
If you have 3 nodes - you have 3 copies of the data (excluding log events - where there are 2 copies). So when 1 node is lost - you have 2 remaining copies - that constitute a majority and the state of the data is consistent. So no split-brain situation here - these two should have the same state. In such situation, Cassandra writes to a file "hints" - which are the data updates that should be stored to the node that is down. This is stored in a sliding 3h window. After that time, when a node is back - the data needs to be repaired (resynchronized) on the node that comes back. If it comes back quicker - then hints are loaded to the node.
If another node is lost - then you're left with 1 copy of the data. So no data is lost yet - however, it may not be up-to-date and needs a repair/boostrap.
In case you have only 2 nodes - then each of those two can have a different state of data. That's why we call that situation a split-brain. Each part "thinks" its data Is the right one.