I've encountered an issue with some nodes in a Redis cluster during a manual failover, where the state seen by certain nodes becomes incorrect.

The normal scenario for a failover is as follows:

sequenceDiagram
actor user
user->>NodeA: redis-cli cluster failover
NodeA->>NodeB: manual failover start
activate NodeB
NodeB->>NodeA: ping with offset
NodeB->>NodeA: ping with offset
NodeA->>NodeC: auth failover
NodeC->>NodeA: vote
Note over NodeA: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: master, 0-0<br/>NodeC: master, 101-200
NodeA->>NodeB: PONG, i'm master, slot 0-100
Note over NodeB: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
deactivate NodeB
NodeA->>NodeC: PONG, i'm master, slot 0-100
Note over NodeC: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: master, 0-0<br/>NodeC: master, 101-200
NodeB->>NodeA: PONG, i'm replica
Note over NodeA: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
NodeB->>NodeC: PONG, i'm replica
Note over NodeC: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200

However, if NodeA's message is not delivered due to network latency or other reasons, the state viewed by NodeC becomes incorrect:

sequenceDiagram
actor user
user->>NodeA: redis-cli cluster failover
NodeA->>NodeB: manual failover start
activate NodeB
NodeB->>NodeA: ping with offset
NodeB->>NodeA: ping with offset
NodeA->>NodeC: auth failover
NodeC->>NodeA: vote
Note over NodeA: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: master, 0-0<br/>NodeC: master, 101-200
NodeA->>NodeB: PONG, i'm master, slot 0-100
Note over NodeB: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
deactivate NodeB
NodeB->>NodeA: PONG, i'm replica
Note over NodeA: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
NodeB->>NodeC: PONG, i'm replica
Note over NodeC: cluster nodes<br/>NodeA: replica of NodeB<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
Note over NodeA: delayed some reason...
NodeA->>NodeC: PONG, i'm master, slot 0-100
Note over NodeC: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200

In this case, NodeC recognizes NodeA and NodeB as being in a circular replication state, and some slots are lost. This state persists until NodeA sends a PONG to NodeC. This situation can be easily reproduced by dropping packets from NodeA to NodeC using iptables.

I propose a solution that involves delaying the transition to an incorrect state when a node's status changes are detected. Specifically, if a sender is to become a replica, and the sender still owns slots while the new master is a replica of the sender, then the process of turning the sender into a replica should be delayed. This approach can prevent temporary circular replication and slot loss, as well as avoid additional problems(eg: https://github.com/redis/redis/pull/10489#issuecomment-1728593084 , https://github.com/lettuce-io/lettuce-core/issues/2578). (not sure...)

The proposed behavior involves a delay in the transition of a master to a replica in the event of a network partition. However, the scenario where the old master receives the message to become a replica before the message promoting a new master is very rare and unlikely to occur in most situations. Additionally, experiencing 1-2 extra 'moved' errors due to this delay is safer than not being able to find a node at all.

If need any further explanation or details about the situation, please let me know.

Comment From: zuiderkwast

So you want B to delay sending "PONG i'm replica" to C? Or should C delay the handling of the message? Should the delay be configurable or a fixed number of seconds? Another option is that C can simply ignore the PONG from B if it creates a circular dependency?

@madolson @PingXie @srgsanky Are you familiar with this scenario?

Comment From: MagicalLas

I meant C will be ignore the PONG of B. I don't know if it's an accurate and proper implementation, but I thought about ignoring it by adding a check if it's a circular reference like this. -> https://github.com/MagicalLas/redis/commit/49d8cd95cd2f98aae292d9e631ae66bd9345dcf6

Comment From: srgsanky

With this scenario, the clients that receive the cluster nodes or cluster slots response from Node C will

  1. See that no one is serving the given slot range.
  2. Will go in a (local) infinite loop trying to find the primary of a replica unless this case is handled. (I am not sure if this is a valid usecase for a client if we don't know the slots served by a shard).

On Node C, the cluster state will go to FAIL since not all slots are covered (assuming the default cluster-require-full-coverage=yes).

At a high level, I think it makes sense to delay handling a switch from REPLICA->PRIMARY when a node doesn't know the new PRIMARY in a shard. We may also want to switch the role of other nodes in the shard to replica simultaneously to avoid multi-PRIMARY cases. I haven't given enough thought about other failure modes where this can be harmful.

Comment From: PingXie

Delaying isn't a true fix. It does make it less likely to end up without a primary, but it doesn't fully eliminate this problem and it might have new implications that I feel hard to reason about. Also, the other state of having two primaries ("NodeA: master, 0-100;NodeB: master, 0-0") isn't consistent either but it just happens to be the case that this inconsistency is "harmless", an implementation coincidence IMO.

The issue at the core is that the observer nodes are seeing intermediate states within a supposedly atomic/isolated transaction (of A taking over B's primary-ship). This is equivalent to the money-transfer case where the two operations, both withdraw and deposit, need to be done in an ACID transaction to have an externally consistent sum view throughout the entire transaction.

For a more deterministic solution within the current cluster design, we might want to consider the primary node information in the cluster message header. More specifically, in the second case above, B's PONG message to C ("PONG, i'm replica") should also claim A is its primary now (clusterMsg->slaveOf). Additionally, B's config epoch should be bumped to A's too. Therefore, C should in theory have all the information needed to deduce that B has lost its primary-ship to A. However, what is not known at this stage is whether A still owns the same set of slots as B. Imagine a case where the shard, to which both A and B belong, is going through some slot migration operation, while at the same time a manual failover is triggered. The current cluster implementation doesn't guarantee that the two nodes, A and B, will have a consistent view on the slot ranges, and this is because, with cluster V1, we are not going through consensus when finalizing the slot ownership transfer and the slot ownership transfer is not "highly-available" even if a replica exists. My PR #10517 should help with the second issue but it won't address the consensus-less part. That said, I still think this is a more permanent solution than delaying.

Of course, the real solution would be to enforce strong consistency on any topology changes, which is one of the core attributes of cluster V2.

Comment From: zuiderkwast

@PingXie Are you saying #10517 solves the same problem? It means it's another motivation for getting it merged?

As for the current cluster implementation, I believe we need to continue improving it and prefer "harmless inconsistency" to avoid the worse problems. (Cluster V2 sounds good, but I haven't seen it yet. I believe in it only when I see it.)

Comment From: PingXie

Are you saying https://github.com/redis/redis/pull/10517 solves the same problem? It means it's another motivation for getting it merged?

No it doesn't solve the core issue. It helps improve a corner case only. The solution that I am thinking about is based on the following observation

""" More specifically, in the second case above, B's PONG message to C ("PONG, i'm replica") should already claim A is its primary now (clusterMsg->slaveOf). Additionally, B's config epoch should be bumped to A's too. Therefore, C should in theory have all the information needed to deduce that B has lost its primary-ship to A. """

Cluster V2 sounds good, but I haven't seen it yet. I believe in it only when I see it.

Fair point :-)

Comment From: zuiderkwast

Therefore, C should in theory have all the information needed to deduce that B has lost its primary-ship to A.

Great, so we can already implement this case and assume B has the same set of slots as A. Let's do that, shall we?

However, what is not known at this stage is whether A still owns the same set of slots as B.

For that we'd need 10517. In the meantime, I think we can accept the "harmless inconsistency" that we get some redirects before the right node is found. As long as the cluster converges when A pongs C, this is an improvement.

Comment From: PingXie

Great, so we can already implement this case and assume B has the same set of slots as A. Let's do that, shall we?

I proposed a fix at #13055.