Describe the bug
We are resharding a Redis cluster following the instructions here: https://redis.io/commands/cluster-setslot
One issue we see is that if we don't add sleep after the reshard operations for a given master, often the cluster will end in a 'Nodes don't agree about configuration!' state when we run the cluster --check command. From our debugging it seems that the replicas sometime don't receive the GOSSIP messages. One time we saw that the replica moved from one master to another.
If we add a 5 sec sleep after sharding a master, things seem to work. Now, our question is how do we verify that GOSSIP propagation has completed for a reshard operation using redis commands? We want to verify if the 5 sec delay is enough or we should wait longer.
Is there a way to read the local SLOT maps configuration each node (master/replica) is using at a point in time? Does cluster slots command provide that info?
To reproduce
Steps to reproduce the behavior and/or a minimal code sample.
Expected behavior
A description of what you expected to happen.
Additional information
Any additional information that is relevant to the problem.
Comment From: madolson
I think this optimization might help you: https://github.com/redis/redis/pull/7571
That change immediately broadcasts the updated epoch and slot information instead of letting it get communicated naturally via gossip. It's not in any stable branch, but if you want to try checking out and building unstable, it might resolve your issue.
More generally, you might want to implement some type of retry mechanism into the code, since you are basically catching a transient state.
Comment From: madolson
The previously mentioned optimization is now in Redis, so going to resolve this.