Solved: Re: Node synchronization in a cluster

moffatrt · ‎10 Nov 2017

We have two Dynatrace Managed nodes in a cluster. However, they currently appear to be out of sync due to a node being down for a few days. Is there a way to force a resync and/or verify that sync is working?

Radoslaw_Szulgo · ‎10 Nov 2017

Sure. If you mean data synchronization in Cassandra, then you can check the status by executing

/opt/dynatrace-managed/utils/cassandra-nodetool.sh

More details:

https://docs.datastax.com/en/cassandra/2.1/cassand...

Senior Product Manager,
Dynatrace Managed expert

kristof_renders · ‎13 Nov 2017

Hi Radoslaw,

Thank you very much for this info!

Does this get detected and rectified automatically or should this be a manual action that gets started.

Cheers!

Radoslaw_Szulgo · ‎13 Nov 2017

What gets detected ? if data synchronization is in processes?

Cassandra nodes synchronize automatically, unless there is an issue e.g. connectivity.

Senior Product Manager,
Dynatrace Managed expert

kristof_renders · ‎13 Nov 2017

If nodes are out of sync. And does the cluster automatically recover?

moffatrt · ‎13 Nov 2017

I was wondering the same thing as the size of the nodes still do not match. One is at 31.96GB and the other is at 43.96GB. Should these numbers always match?

thomas_steinma1 · ‎13 Nov 2017

Cassandra has something called "hinted handoff". With that, the Cassandra node serving a write request will temporarily store a missed write for a down node for a time-frame of 3 hours. If a node is down longer than 3 hours, it will get practically out-of-sync once it recovers (Cassandra process started up again) and needs a "repair" from a Cassandra low-level perspective.

This can be invoked by a Cassandra command-line tool called "nodetool" and proper options. The above mentioned shell script is just our "wrapper" script around nodetool. In this particular scenario: "a node recovers from down-time > 3 hours still being part of the same cluster as before", the correct nodetool execution on the recovering node is to invoke a full repair via:

/opt/dynatrace-managed/utils/cassandra-nodetool.sh repair

Best being executed in a dedicated Linux screen session, as this may take hours depending on the data volume.

Regarding size being reported as "Load" via nodetool. They don't need to necessarily match. Details on that would be beyond this comment here. For an active repair, there is usually sign in the Cassandra log (cassandra.log) and/or even "nodetool compactionstats" is reporting compactions of type "Validate" on the recovering node.

moffatrt · ‎13 Nov 2017

In addition, if one of the nodes is offline longer than 7 days, the other nodes remove any reference to it and remove it from the cluster. If the other node is reactivated, it will never sync up with the cluster as it is now effectively orphaned. This is basically what happened to our node and why it never synced up. Thanks for the info, everyone!

kristof_renders · ‎14 Nov 2017

Thanks, all, for your great and detailed answer. So if I can summarise:

Nodes out of sync (as described by Thomas) are not automatically repaired and a manual command has to be initiated
If your node is offline for too long (7d+) it will be orphaned and no longer part of the cluster. I guess best solution is to add a new node to the cluster? Would we still be able to access PP data from that node?
Still an open question is if Dynatrace actively reports on nodes being out of sync? Will we see it in the CMC? Or even in the debug UI?

Again, thanks all for the great answers!

Cheers,
Kristof