We have two Dynatrace Managed nodes in a cluster. However, they currently appear to be out of sync due to a node being down for a few days. Is there a way to force a resync and/or verify that sync is working?
Solved! Go to Solution.
Sure. If you mean data synchronization in Cassandra, then you can check the status by executing
Thank you very much for this info!
Does this get detected and rectified automatically or should this be a manual action that gets started.
What gets detected ? if data synchronization is in processes?
Cassandra nodes synchronize automatically, unless there is an issue e.g. connectivity.
If nodes are out of sync. And does the cluster automatically recover?
I was wondering the same thing as the size of the nodes still do not match. One is at 31.96GB and the other is at 43.96GB. Should these numbers always match?
Cassandra has something called "hinted handoff". With that, the Cassandra node serving a write request will temporarily store a missed write for a down node for a time-frame of 3 hours. If a node is down longer than 3 hours, it will get practically out-of-sync once it recovers (Cassandra process started up again) and needs a "repair" from a Cassandra low-level perspective.
This can be invoked by a Cassandra command-line tool called "nodetool" and proper options. The above mentioned shell script is just our "wrapper" script around nodetool. In this particular scenario: "a node recovers from down-time > 3 hours still being part of the same cluster as before", the correct nodetool execution on the recovering node is to invoke a full repair via:
Best being executed in a dedicated Linux screen session, as this may take hours depending on the data volume.
Regarding size being reported as "Load" via nodetool. They don't need to necessarily match. Details on that would be beyond this comment here. For an active repair, there is usually sign in the Cassandra log (cassandra.log) and/or even "nodetool compactionstats" is reporting compactions of type "Validate" on the recovering node.
In addition, if one of the nodes is offline longer than 7 days, the other nodes remove any reference to it and remove it from the cluster. If the other node is reactivated, it will never sync up with the cluster as it is now effectively orphaned. This is basically what happened to our node and why it never synced up. Thanks for the info, everyone!
Thanks, all, for your great and detailed answer. So if I can summarise:
Again, thanks all for the great answers!