The current design of the cluster has some problems to deal with a master node crash during slot migration. Some notes about the current design need to be mentioned first: 1. The importing flag and the migrating flag are local to the master node. 2. When using gossip to propagate slots distribution, the owner of a slot is the only source can spread out the information. 3. The design of epoch can't carry enough information to resolve config confliction between nodes from different 'slice'. Epoch is suitable for resolving confliction inside same 'slice'.
More explanation about 2 & 3:
During migrating slot x from A to B, if we called cluster setslot x node {B-id} on all master nodes(slave node reject this command). Then B crashed before B pinged any of its slave nodes, then after a failover one slave node gets promoted. The new B will never know that itself has the ownership of slot x, because the old B is the single failure point who can spread out the information.
The design of epoch is similar to term in Raft protocol, it's useful to do leader election. I call a master node plus its slave nodes as a slice. Confliction within same slice means that a node B may think slot x belongs to node C, while node A think slot x belongs to node A. When node A pings node B, node B will notice the confliction. If both C and A belong to the same slice, then this is a confliction within the same slice, else this is a confliction between different slice.
Confliction between different slice can't be resolved simply by comparing epoch. Suppose we're migrating slot x from A to B, just after we called cluster setslot x node {B-id} on node B, node A crashed. The new A still think itself has the slot x(due to problem 1 mentioned above), so the confliction here is from two different slices. The new A may have a bigger epoch than B(after B bump epoch locally), also it can have a smaller epoch than B. But we all know that the right ownership of x is B, it doesn't depend on who has bigger epoch. So the epoch based confliction resolving algorithm is totally broken here.
Comment From: CMGS
we are facing the same issue, any response?
Comment From: trevor211
It's a pity that current designed only contains the importing and migrating slot status on maters. Maybe we could make an optimization. @oranagra
Comment From: zuiderkwast
We have experienced this issue and we think this is a serious one. Can we prioritize it?
An entire test suite seems to be disabled due to this issue: https://github.com/redis/redis/blob/7.0-rc1/tests/cluster/tests/21-many-slot-migration.tcl#L3
# TODO: Test is currently disabled until it is stabilized (fixing the test
# itself or real issues in Redis).
# TODO: This test currently runs without replicas, as failovers (which may
# happen on lower-end CI platforms) are still not handled properly by the
# cluster during slot migration (related to #6339).
Comment From: zuiderkwast
I've figured out that redis-cli --cluster fix can repair most these situations.
It solves them by sending CLUSTER NODES and CLUSTER COUNTKEYSINSLOT to all the nodes and finds the most likely slot owner, finds slot migrations that can be resumed, etc. The problem is that has to be run manually and it's based on best effort (guesswork).
If we add a new command like CLUSTER MIGRATE (#2807) where redis itself handles the migration (rather than a cli tool) together with propagation of SETSLOT MIGRATING/IMPORTING to replicas, this command can hopefully resolves these conditions automatically without error. Makes sense?
Comment From: PingXie
I think it is possible to replicate the slot migration states (both importing_from and migrating_to) to replicas and maintain the migration relationship in either the source or target shard in the face of any failovers. Furthermore, I don't think this would require the nodes involved to broadcast the migration states to the entire cluster.
With a solution like this, when node B crashed after cluster setslot x node {B-id}, the new primary would either have seen the replicated cluster setslot x node {B-id} and assumed the ownership or at the minimum know that slot x is still in the migrating state but then another cluster setslot x node {B'-id} would easily fix this problem.
I will work on a code change next.
Comment From: zuiderkwast
@PingXie I had the same idea and I tried a small part of it (to replicate only SETSLOT IMPORTING and check the output of CLUSTER NODES, try a failover to see that it's still in importing state, etc). Replicating SETSLOT seems to work and it would also be backward compatible. Replicas with old Redis versions will just ignore SETSLOT (error is ignored). This solution would allow redis-cli --cluster fix to continue and finish the migration after a failover.
During promotion to new master, a node could even do some of the redis-cli --cluster fix work itself, for example check if any other node owns the slot and sort out at least one of the race conditions in this issue: If the new master (B') is IMPORTING a slot and no other node is the owner, it can conclude that it must be the owner.
However, in general this can not be automatic for all cases and there can still be epoch collisions if a failover and a SETSLOT NODE happens at the same time. If we implement #2807 to use cluster consensus, I think it can be a more robust solution with better consistency guarantees.
Comment From: PingXie
During promotion to new master, a node could even do some of the redis-cli --cluster fix work itself, for example check if any other node owns the slot and sort out at least one of the race conditions in this issue: If the new master (B') is IMPORTING a slot and no other node is the owner, it can conclude that it must be the owner.
@zuiderkwast the lingering importing state is not an issue IMO if we replicate both importing and migrating states to replicas and always follow the order of finalizing the slot on destination first. This leaves the source with the possibility of lingering migrating states - if the source primary crashes at the same time while the destination is finalizing the slot migration. Now because the migrating state is replicated to the source replica, the new source primary is still aware of the migrating slot hence won't automatically assumes the official slot ownership.
However, in general this can not be automatic for all cases and there can still be epoch collisions if a failover and a SETSLOT NODE happens at the same time. If we implement https://github.com/redis/redis/issues/2807 to use cluster consensus, I think it can be a more robust solution with better consistency guarantees.
I am not sure about the performance impact if we start asking for cluster consensus on the slot migration path, especially when working with a large cluster. In any case, I have a draft PR (#10517) that I think, without resorting to cluster consensus, could address both deficiencies called out in this issue: the lack of replication for migration states, and the race between source failover and destination finalizing slot migration. I think it would help improve the reliability of current slot migration significantly. Curious to know your thoughts.
Comment From: zuiderkwast
the lingering importing state is not an issue IMO if we replicate both importing and migrating states to replicas and always follow the order of finalizing the slot on destination first.
@PingXie There is a possibility that the target node (B) sends cluster pong to the source node (A) before SETSLOT has been replicated to B'. If B crashes here, B' will be in importing state and the slot will have no owner.
There is also a possibility that redis-cli sends SETSLOT to B and SETSLOT to A, and then B crashes before SETSLOT has been replicated to B'.
Both of these scenarios result in a lingering importing node and a slot without owner.
I am not sure about the performance impact if we start asking for cluster consensus on the slot migration path, especially when working with a large cluster.
That can probably be avoided if we can agree on Madelyn's comments on the CLUSTER MIGRATE issue.
Comment From: PingXie
Both of these scenarios result in a lingering importing node and a slot without owner.
I think these boil down to one key requirement, which is to ensure B' sees the new slot ownership before A and A' and that is it. As you mentioned, my current implementation does have a hole since there is no explicit ordering between the slot finalization on B' and the broadcasting to the rest of the cluster. Something like WAIT is what I am hoping to have for enforcing the ordering. We'd also need to ensure B doesn't start broadcasting the new slot ownership before WAIT returns, which implies that B should finalize its slot after B'.
That can probably be avoided if we can agree on Madelyn's comments on the CLUSTER MIGRATE issue.
If I understand @madolson's comments correctly, I think the core of her idea would be along a similar line as I explained above, which is to ensure that the slot is finalized in the order of B' followed by A and A'. I'd like to explore the possibility of introducing such ordering on top of the existing slot migration protocol. The backward compatibility and forkless slot migration seem appealing.