Hello, this documentation explains that Dynatrace will continue to function normally if one out three nodes is unavailable in a 3 node cluster. This means, in this situation, the 'new' purepaths will continue to get written across 2 nodes (we wont be able to get the purepaths that was on the node which is now unavailable), metric and RUM data will get replicated across 2 nodes. In this case the replication factor of 3 will NOT be met...is this a correct assumption or will the 2nd replica be stored on 1/2 available nodes to meet the rep factor of 3. Users will be able to chart metrics and view RUM data.
Now, what happens if two out of the three nodes in a 3 node cluster are unavailable. Will the 'new' purepaths continue to get written on the 1 available node? Will new metric data be written to Cassandra? What about RUM data? What happens to the UI requests, will users be able to fetch purepath data from the one available node and view historical metrics and RUM data?
Solved! Go to Solution.
Replication factor means if fact how many data copies (replicas) are desired to store in the cluster. If there are less nodes than replication factor there's only 1 copy for 1 node stored.
metric and RUM data will get replicated across 2 nodes. In this case the replication factor of 3 will NOT be met...is this a correct assumption
Yes, however we don't say replication factor is not met, but rather it's not a desired state with unassigned replicas left.
Now, if 1 node out of 3 goes down - we lose 1 copy of data, and all transaction storage (pure paths, synthetic session details) on that node. However, statistically representative number of purepahts are available on the other nodes. If 2 out of 3 nodes are running, data are written only to 2 nodes - 2 replicas are always there, third replica is pending for the node to come up. In that case of course all metrics are available, and when a user runs a chart, data is retrieved from all the nodes that are running. When 3rd node goes up, data will be synchronized and repaired automatically while data is read (self-healing).
When another node goes down (only 1 out of 3 is available), then there's no longer a quorum able to be established and our storage - Elasticsearch will be down and Cassandra will not store the data. Basically cluster is down. While UI might be available, it'd return to users Error pages.
As I mentioned - if 2 nodes are down out of 3 - data will not be stored properly. You can consider cluster is unhealthy and down in such situation. PurePaths are not synchronized between nodes. Also 1 node that left will not be able to run alone. Here's a nice blog post that can help you to understand what happens in specifically Elasticsearch replication :