Describe the bug

When we create a 10 node, 5 shard cluster in redis 7.2.7, the cluster nodes file does not always stabilize after cluster creation and continuously updates the cluster nodes file with a flip flopping shard-id for one of the shards.

To reproduce

See attached script that can be used to create a 10 node cluster. After cluster is created, monitor the node0/cluster.nodes.conf file. If the problem has been re-created, you will see the shard-id for one of shards flip between two values every few seconds.

Something like this:

Starts with shard-id c8dc...

$ cat node0/cluster.nodes.conf | grep -E "44aba9ff402bc0c1c5c30f9b4ed5cbe09257d03a|c8d701e355e383137f6624529f51a69a8c00fba8" c8d701e355e383137f6624529f51a69a8c00fba8 127.0.0.1:6703@16703,,tls-port=0,shard-id=c8dcc477811f4e4c6b88b0c68031ee1cddfbe764 master - 0 1742243123451 4 connected 9830-13106 44aba9ff402bc0c1c5c30f9b4ed5cbe09257d03a 127.0.0.1:6708@16708,,tls-port=0,shard-id=c8dcc477811f4e4c6b88b0c68031ee1cddfbe764 slave c8d701e355e383137f6624529f51a69a8c00fba8 0 1742243120000 4 connected

Switches to e4f89...

$ cat node0/cluster.nodes.conf | grep -E "44aba9ff402bc0c1c5c30f9b4ed5cbe09257d03a|c8d701e355e383137f6624529f51a69a8c00fba8" c8d701e355e383137f6624529f51a69a8c00fba8 127.0.0.1:6703@16703,,tls-port=0,shard-id=e4f8949f01637a67c70f747ba1e710f2ccffb844 master - 0 1742243123451 4 connected 9830-13106 44aba9ff402bc0c1c5c30f9b4ed5cbe09257d03a 127.0.0.1:6708@16708,,tls-port=0,shard-id=e4f8949f01637a67c70f747ba1e710f2ccffb844 slave c8d701e355e383137f6624529f51a69a8c00fba8 0 1742243127588 4 connected

Switches back to c8dc...

$ cat node0/cluster.nodes.conf | grep -E "44aba9ff402bc0c1c5c30f9b4ed5cbe09257d03a|c8d701e355e383137f6624529f51a69a8c00fba8" c8d701e355e383137f6624529f51a69a8c00fba8 127.0.0.1:6703@16703,,tls-port=0,shard-id=c8dcc477811f4e4c6b88b0c68031ee1cddfbe764 master - 0 1742243127000 4 connected 9830-13106 44aba9ff402bc0c1c5c30f9b4ed5cbe09257d03a 127.0.0.1:6708@16708,,tls-port=0,shard-id=c8dcc477811f4e4c6b88b0c68031ee1cddfbe764 slave c8d701e355e383137f6624529f51a69a8c00fba8 0 1742243130000 4 connected

This just keeps repeating.

Expected behavior

I would expect that once the cluster is up, the shard-id shouldn't keep changing in the cluster nodes file. This is causing excessive updates to the file.

Additional information

create_cluster.sh.gz

Script can be run from directory containing the redis-server and redis-cli binaries and will create nodeX directories for 10 nodes and run the redis-cli command to create the cluster and add the replicas to the primaries.

Comment From: sundb

@jdork0 thanks, it was introduced by https://github.com/redis/redis/pull/13468, I'll check it.

Comment From: jdork0

@sundb I've been doing some testing with removing this else code from updateShardId and it looks promising.

//        } else {
//            clusterNode *masternode = node->slaveof;
//            if (memcmp(masternode->shard_id, shard_id, CLUSTER_NAMELEN) != 0)
//                assignShardIdToNode(masternode, shard_id, CLUSTER_TODO_SAVE_CONFIG|CLUSTER_TODO_FSYNC_CONFIG);
        }

If I understand correctly, this would change the code so only shard id updates could come from the master of the shard.

Comment From: sundb

@jdork0 you're right, I made the same fix as you, because the master and slave generated different shardid at the beginning, and if they were allowed to update each other, there would be two sets of shardid passing in the cluster. But I'm still trying to see if this doesn't cause regression, so we need to be careful how we handle it. do you want to make a PR to fix it?