Redis [Cluster inconsistency] slave migrates when all slots moved to new master but doesn't migrate back

A very peculiar bug, albeit easy to replicate, found in some in-house tests for production

I have checked it on 3.0.5 and the latest build from 3.0 branch in github - Steps are as below:

build a cluster with 6 nodes and replication turned on

./redis-trib.rb create --replicas 1 192.168.10.25:8000 192.168.10.25:8001 192.168.10.25:8002

192.168.10.25:8003 192.168.10.25:8004 192.168.10.25:8005

the cluster will have 3 masters and 3 slaves replicating each of the masters

192.168.10.25:8004> cluster nodes 74efdfbbacd99745a27d43aabce947d80d3a9051 192.168.10.25:8002 master - 0 1454072496887 3 connected 10923-16383 6d367efab8a48baf7d1c0e924049e86099dbb272 192.168.10.25:8000 master - 0 1454072495885 1 connected 0-5460 478e1f5a49363ff8a78dd4192cee0389c9344763 192.168.10.25:8003 slave 6d367efab8a48baf7d1c0e924049e86099dbb272 0 1454072497888 4 connected bcb8b8d4b4349860fc2d7fea2a0d99e07d12ab7a 192.168.10.25:8005 slave 74efdfbbacd99745a27d43aabce947d80d3a9051 0 1454072497088 6 connected 532b72609d8264caeab21fc39bb380b98b81cc34 192.168.10.25:8004 myself,slave 07598e66b97494e18dd55ce3c8cd44d6ace0a2c0 0 0 5 connected 07598e66b97494e18dd55ce3c8cd44d6ace0a2c0 192.168.10.25:8001 master - 0 1454072497389 2 connected 5461-10922 - slots info

192.168.10.25:8004> cluster slots 1) 1) (integer) 10923 2) (integer) 16383 3) 1) "192.168.10.25" 2) (integer) 8002 4) 1) "192.168.10.25" 2) (integer) 8005 2) 1) (integer) 0 2) (integer) 5460 3) 1) "192.168.10.25" 2) (integer) 8000 4) 1) "192.168.10.25" 2) (integer) 8003 3) 1) (integer) 5461 2) (integer) 10922 3) 1) "192.168.10.25" 2) (integer) 8001 4) 1) "192.168.10.25" 2) (integer) 8004

migrate all slots of a particular master to another master (from node on port 8000 to node on port 8001 in this case)

./redis-trib.rb reshard --from 6d367efab8a48baf7d1c0e924049e86099dbb272 --to 07598e66b97494e18dd55ce3c8cd44d6ace0a2c0 --slots 5461 --yes 192.168.10.25:8001

the slave of the original slot holder also migrates

slot info

1) 1) (integer) 10923 2) (integer) 16383 3) 1) "192.168.10.25" 2) (integer) 8002 4) 1) "192.168.10.25" 2) (integer) 8005 2) 1) (integer) 0 2) (integer) 10922 3) 1) "192.168.10.25" 2) (integer) 8001 4) 1) "192.168.10.25" 2) (integer) 8004 5) 1) "192.168.10.25" 2) (integer) 8003 - node info (node on port 8001 has 2 slaves now and that on 8000 is left without slaves)

74efdfbbacd99745a27d43aabce947d80d3a9051 192.168.10.25:8002 master - 0 1454072791540 3 connected 10923-16383 6d367efab8a48baf7d1c0e924049e86099dbb272 192.168.10.25:8000 master - 0 1454072792041 1 connected 478e1f5a49363ff8a78dd4192cee0389c9344763 192.168.10.25:8003 slave 07598e66b97494e18dd55ce3c8cd44d6ace0a2c0 0 1454072792542 7 connected bcb8b8d4b4349860fc2d7fea2a0d99e07d12ab7a 192.168.10.25:8005 slave 74efdfbbacd99745a27d43aabce947d80d3a9051 0 1454072789534 6 connected 532b72609d8264caeab21fc39bb380b98b81cc34 192.168.10.25:8004 myself,slave 07598e66b97494e18dd55ce3c8cd44d6ace0a2c0 0 0 5 connected 07598e66b97494e18dd55ce3c8cd44d6ace0a2c0 192.168.10.25:8001 master - 0 1454072790538 7 connected 0-10922

migrate all the slots back

./redis-trib.rb reshard --from 07598e66b97494e18dd55ce3c8cd44d6ace0a2c0 --to 6d367efab8a48baf7d1c0e924049e86099dbb272 --slots 5461 --yes 192.168.10.25:8001

the slave doesnt migrate back

slot info

1) 1) (integer) 10923 2) (integer) 16383 3) 1) "192.168.10.25" 2) (integer) 8002 4) 1) "192.168.10.25" 2) (integer) 8005 2) 1) (integer) 0 2) (integer) 5460 3) 1) "192.168.10.25" 2) (integer) 8000 3) 1) (integer) 5461 2) (integer) 10922 3) 1) "192.168.10.25" 2) (integer) 8001 4) 1) "192.168.10.25" 2) (integer) 8004 5) 1) "192.168.10.25" 2) (integer) 8003 - node info

74efdfbbacd99745a27d43aabce947d80d3a9051 192.168.10.25:8002 master - 0 1454072917769 3 connected 10923-16383 6d367efab8a48baf7d1c0e924049e86099dbb272 192.168.10.25:8000 master - 0 1454072917769 8 connected 0-5460 478e1f5a49363ff8a78dd4192cee0389c9344763 192.168.10.25:8003 slave 07598e66b97494e18dd55ce3c8cd44d6ace0a2c0 0 1454072918772 7 connected bcb8b8d4b4349860fc2d7fea2a0d99e07d12ab7a 192.168.10.25:8005 slave 74efdfbbacd99745a27d43aabce947d80d3a9051 0 1454072916766 6 connected 532b72609d8264caeab21fc39bb380b98b81cc34 192.168.10.25:8004 myself,slave 07598e66b97494e18dd55ce3c8cd44d6ace0a2c0 0 0 5 connected 07598e66b97494e18dd55ce3c8cd44d6ace0a2c0 192.168.10.25:8001 master - 0 1454072916766 7 connected 5461-10922

If I migrate all slots from the remaining master and migrate all of them back then the cluster will have one master with all 3 slaves and other 2 masters with no slaves, whereas the slots are all equally distributed.

I can't literally say, what should be the expected behavour or what not, but logical points as

below: - either no mechanism, which detects "oh all slots migrated" so lets migrate the slave - or if above mechanism is in place then, if the original master gets some/all slots back, it

should have some slave also (not necessarily the original one), if there are enough slaves

NOW THE PECULIER BITS.. :)

Which I found after spending some more time to understand the issue, and possibly useful for

correct/clear analysis.

The issue does not exist on 3.0.5 (with redis-trib.rb from 3.0.5). The slave does not migrate

in the first place (even if master loses all the slots), but this code exists:

in function clusterUpdateSlotsConfigWith()

  /* If at least one slot was reassigned from a node to another node
     * with a greater configEpoch, it is possible that:
     * 1) We are a master left without slots. This means that we were
     *    failed over and we should turn into a replica of the new
     *    master.
     * 2) We are a slave and our master is left without slots. We need
     *    to replicate to the new slots owner. */
    if (newmaster && curmaster->numslots == 0) {
        redisLog(REDIS_WARNING,
            "Configuration change detected. Reconfiguring myself "
            "as a replica of %.40s", sender->name);
        clusterSetMaster(sender);
        clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
                             CLUSTER_TODO_UPDATE_STATE|
                             CLUSTER_TODO_FSYNC_CONFIG);

The issue happens on the latest build from 3.2 branch, (with redis-trib.rb from latest of 3.2

branch). The slave migrates when all slots are moved but does not migrate back, when slots are

moved back, but this code exists:

in function clusterCron()

    /* Orphaned master check, useful only if the current instance
         * is a slave that may migrate to another master. */
        if (nodeIsSlave(myself) && nodeIsMaster(node) && !nodeFailed(node)) {
            int okslaves = clusterCountNonFailingSlaves(node);

            /* A master is orphaned if it is serving a non-zero number of
             * slots, have no working slaves, but used to have at least one
             * slave, or failed over a master that used to have slaves. */
            if (okslaves == 0 && node->numslots > 0 &&
                node->flags & REDIS_NODE_MIGRATE_TO)
            {
                orphaned_masters++;
            }
            if (okslaves > max_slaves) max_slaves = okslaves;
            if (nodeIsSlave(myself) && myself->slaveof == node)
                this_slaves = okslaves;
        }

The most peculier bit. The difference in behaviour is because of some update in redis-trib.rb

the move slot flow is set slot to receiving in destination set slot to migrating in source actual setslot on all nodes (or only on master nodes) <<-- this is the difference

if "cluster setslot node " is done only on master nodes, the first behaviour is

observed (slave migrates, but does not migrate) if "cluster setslot node " is done all nodes, the second behaviour is observed

(slave does not migrate in the first place)

The above is consistent from 3.0.5 onward.. :-)

execute final setslot only in masters was introduced somewhere after 3.0.6

in redis-trib.rb
    move_slot...

        # Set the new node as the owner of the slot in all the known nodes.
        if !o[:cold]
            @nodes.each{|n|
                next if n.has_flag?("slave")
                n.r.cluster("setslot",slot,"node",target.info[:name])
            }
        end

I am guessing both migrate and migrate back should happen as per server code, but eventual

percolation of the info across the cluster doesn't happen/gets overridden by older info; but I

am just guessing..!

Hope the above would be of some use to resolve this.

I believe the system info would not be much relevant, all the config details are as below

(ports will change)

port 8000 dir ./ bind 192.168.10.25 dbfilename redis-2-0.rdb pidfile ./rdbredis-2-0.pid logfile ./rdbredis-2-0.log syslog-ident test-db1 daemonize yes cluster-enabled yes cluster-config-file nodes.conf cluster-node-timeout 7000 tcp-backlog 511 timeout 0 tcp-keepalive 0 slave-serve-stale-data yes slave-read-only no repl-diskless-sync no repl-diskless-sync-delay 5 repl-disable-tcp-nodelay no slave-priority 100 appendonly no appendfsync everysec no-appendfsync-on-rewrite no auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb aof-load-truncated yes lua-time-limit 5000 slowlog-log-slower-than 10000 slowlog-max-len 128 latency-monitor-threshold 0 notify-keyspace-events "" hash-max-ziplist-entries 512 hash-max-ziplist-value 64 list-max-ziplist-entries 512 list-max-ziplist-value 64 set-max-intset-entries 512 zset-max-ziplist-entries 128 zset-max-ziplist-value 64 hll-sparse-max-bytes 3000 activerehashing yes client-output-buffer-limit normal 0 0 0 client-output-buffer-limit slave 256mb 64mb 60 client-output-buffer-limit pubsub 32mb 8mb 60 hz 10 aof-rewrite-incremental-fsync yes

Comment From: antirez

Hello, this is very useful, thanks. I'm investigating the bug right now, news ASAP.

Comment From: antirez

Initial analysis: 1. Slave migrates when it sees its master slots are all captured by another master (for example during failover). Now that we no longer send SETSLOT NODE to the slaves, the information is propagated like in a failover, on reshardings, so the slave migrates. And this is a good idea, after all, and hard to avoid at all since there is no easy to way to differentiate losing slots because of failover or migration. 2. We could force the slave to migrate always, even when SETSLOT NODE is used, by detecting the master dropped to zero slots in favor of another master, as we do when the configuration is updated via PING/UPDATE packets. So that the slave behavior is consistent whatever redis-trib does. This probably is a good idea, for consistency. 3. The reason why the slave does not migrate back, is that replica migration does is not triggered. Why? Because replica migration has a rule: only migrate to masters that used to have slaves in the past, or to masters created by a slave promotion (a failover). When we remove the slots from a given master, it remains without slaves, and the slave was conceptually moved away because of the reconfiguration. So the information that the master happened to have salves is lost.

How to fix 3? There are two solutions, basically.

Solution A: When a master gets slots, if the other masters in the cluster have slaves (at least one master has a slave, actually), we could assume this new master must have slaves as well, so we set the MIGRATE_TO flag. If there are spare slaves, they'll migrate.

Solution B: We never remove the MIGRATE_TO flag from the original master, if not when a CLUTER RESET is received by the node.

However, when solution B is used, if we get a fresh master that was never part of the cluster, and move slots to it, it will not get a slave, even if there are spare slaves available. Instead with solution A, it gets slaves.

Maybe solution B is to be preferred after all? Since it is simpler to describe and more consistent: replica migration only happens targeting masters that used to have slaves and are considered orphaned. So the system administrator has the power to setup a master without slaves.

However at the same time one could argue that solution A is better because is more automatic, and new masters will get slaves allocated automatically if there are enough. One could also argue that it's a very strange use case to add a master, assign it new slots, without requiring it to have slaves.

Comment From: irfanurrehman

Hi, Thanks for having a look and the super quick analysis. I too somehow prefer solution B, because it looks simpler, to understand and implement (probably :) )

Meanwhile as a different thought; can do something with the original info of "master had slaves" getting lost somehow..so that it doesn't get lost in this scenario. Its not really a cluster reset per say, sort of an update, isnt it. Or solution B would eventually do the same thing. You are a better judge; I just hope that it gets resolved soon.. Thanks again.

Comment From: shaharmor

I think that A solution is better as it allows for a more automatic management of slaves

Comment From: ramonsnir

+1 for Solution A

Comment From: antirez

Implementing A. I was unsure between the two but given the feedbacks... A :-)

Comment From: irfanurrehman

cool, I just wish to get a solution soon, either ways.. :) :+1:

Comment From: antirez

Sure! Just fixed, now testing the fix and writing a regression test, and it will be merged in all the branches.

Comment From: irfanurrehman

thanks a ton!!

Comment From: irfanurrehman

just one more info, I did raise couple of other issues also (this issue is referenced in them), which are quite related; do you think they might as well get fixed with the fix you have done now.. ?

Comment From: oranagra

fixed in redis 3.2