Solved: Re: State of the Cluster when 1 or more node(s) fail

Srikar_Mohan2 · ‎18 Aug 2020

Hello, this documentation explains that Dynatrace will continue to function normally if one out three nodes is unavailable in a 3 node cluster. This means, in this situation, the 'new' purepaths will continue to get written across 2 nodes (we wont be able to get the purepaths that was on the node which is now unavailable), metric and RUM data will get replicated across 2 nodes. In this case the replication factor of 3 will NOT be met...is this a correct assumption or will the 2nd replica be stored on 1/2 available nodes to meet the rep factor of 3. Users will be able to chart metrics and view RUM data.

Now, what happens if two out of the three nodes in a 3 node cluster are unavailable. Will the 'new' purepaths continue to get written on the 1 available node? Will new metric data be written to Cassandra? What about RUM data? What happens to the UI requests, will users be able to fetch purepath data from the one available node and view historical metrics and RUM data?

Radoslaw_Szulgo · ‎18 Aug 2020

Replication factor means if fact how many data copies (replicas) are desired to store in the cluster. If there are less nodes than replication factor there's only 1 copy for 1 node stored.

metric and RUM data will get replicated across 2 nodes. In this case the replication factor of 3 will NOT be met...is this a correct assumption

Yes, however we don't say replication factor is not met, but rather it's not a desired state with unassigned replicas left.

Now, if 1 node out of 3 goes down - we lose 1 copy of data, and all transaction storage (pure paths, synthetic session details) on that node. However, statistically representative number of purepahts are available on the other nodes. If 2 out of 3 nodes are running, data are written only to 2 nodes - 2 replicas are always there, third replica is pending for the node to come up. In that case of course all metrics are available, and when a user runs a chart, data is retrieved from all the nodes that are running. When 3rd node goes up, data will be synchronized and repaired automatically while data is read (self-healing).

When another node goes down (only 1 out of 3 is available), then there's no longer a quorum able to be established and our storage - Elasticsearch will be down and Cassandra will not store the data. Basically cluster is down. While UI might be available, it'd return to users Error pages.

Senior Product Manager,
Dynatrace Managed expert

Srikar_Mohan2 · ‎18 Aug 2020

Thank you Radoslaw.

One follow up question...if 2 out of 3 nodes are down, will the purepaths also NOT get written to that one node ? Which mean Cassandra, Elastic and the Server process will be considered down. In such scenarios. can a node not be promoted as a master manually?

Radoslaw_Szulgo · ‎18 Aug 2020

As I mentioned - if 2 nodes are down out of 3 - data will not be stored properly. You can consider cluster is unhealthy and down in such situation. PurePaths are not synchronized between nodes. Also 1 node that left will not be able to run alone. Here's a nice blog post that can help you to understand what happens in specifically Elasticsearch replication :

https://codingexplained.com/coding/elasticsearch/understanding-replication-in-elasticsearch

Senior Product Manager,
Dynatrace Managed expert

State of the cluster when 1 or more nodes fail