Redis CLUSTER SUBSCRIBE SLOTS (topology changes)

The problem/use-case that the feature addresses

When a cluster client has subscribed to topology changes, slot mappings are pushed to the client as soon as a slot is moved. This let's the client avoid getting MOVED redirects and eliminates the need to poll using CLUSTER SLOTS.

Description of the feature

Usage scenario

Client subscribes to slot changes using CLUSTER SUBSCRIBE SLOTS (returns OK).
Whenever a slot is migrated, the new slot owner is pushed to the client in an out-of-band message where the relevant part has the same structure as one row in a CLUSTER SLOTS reply.

``` 1) "cluster" 2) "slots" 3) 1) (integer) 5461 2) (integer) 5461 3) 1) "127.0.0.1" 2) (integer) 30002 3) "c9d93d9f2c0c524ff34cc11838c2003d8c29e013" 4) 1) hostname 2) "host-3.redis.example.com" 4) 1) "127.0.0.1" 2) (integer) 30005 3) "faadb3eb99009de4ab72ad6b6ed87634c7ee410f" 4) 1) hostname 2) "host-4.redis.example.com"
The client updates it's internal slot mapping tables.

Implementation ideas

A separate list of subscribed clients is stored in the cluster struct.
Whenever a node becomes aware of a change, either from the cluster bus or from a SETSLOT ... NODE command, it is pushed to clients.

Alternatives you've considered

Alternative to CLUSTER SUBSCRIBE SLOTS:
Clients use SUBSCRIBE to a magic channel such as __cluster-slots__. This is similar to how client-side caching and keyspace notifications work and is intuitive in both RESP2 and RESP3 modes.
Alternative to pushing a CLUSTER SLOT row:
The push message can contain only one slot number and the master's "host:port", i.e. the same info as in a MOVED redirect. Pros: Less data pushed. Cons: Changes in replicas would not be pushed.
Support RESP2?
Use a second connection, like client-side caching? (too complex)
Only supporting RESP3 is simpler. We can require RESP3 and encourage clients to use it.
Unsubscribe?
It's not necessary. The subscription can be active for the life time of the connection.
Can be added later.

Additional information

This is "RESPV3 topology updates" in #8948.

Comment From: hpatro

@zuiderkwast I kind of like the 1st alternative more than the proposed solution. Are there are any benefits with the proposed solution to the 1st alternative?

With the 1st alternative we don't need to update any struct and lesser code to maintain. However, we would have to ensure the message doesn't get propagated through the cluster via publish as I believe we could generate the slot change update message on each node and publish to the channel.

Comment From: PingXie

This is a really interesting idea.

I wonder if there is a race condition between source failing over and target announcing new slot ownership, due to the consensus-less slot-ownership transfer via "CLUSTER SETSLOT ". More specifically, there could be a delay before the source replica eventually learns about the slot ownership transfer. So if the source primary failed right after the target officially takes over the slot but before its replicas learn about the ownership change, I think we could end up in an epoch collision after the new primary is elected in the source shard. When this collision happens, there is a 50-50 chance that the slot migration might be "reverted", after "CLUSTER SETSLOT " completed successfully.

I am also curious to know if you have thought about incorporating the replica state as well in this mechanism. It will be very valuable in the use case where read replicas serve (read-only) traffic. The application would have to poll the read endpoints via "CLUSTER SLOTS" today.

Comment From: zuiderkwast

@PingXie If we use a pubsub channel, it has to be a non-propagated channel, like what is already used for keyspace notifications and client side cache. I don't see why there would be a race condition here. The node notifies the subscribed clients whenever this node thinks that the slot ownership has changed. It's no more race condition than a very fast CLUSTER SLOTS polling. Did I miss anything?

Regarding replica changes, I think they should be notified to clients, even if the master of a slot didn't change. It can be very useful for clients, as you say.

Comment From: PingXie

Here is a scenario that I had in mind. I made some assumptions on how CLUSTER SUBSCRIBE SLOTS works and flagged them with "?" below. Let me know if this makes sense or not.

A cluster with 3 shards, A, B, and C and 1 replica for each shard
client C1 subscribes to all nodes (?) in the cluster for topology changes
client C2 starts migrating slot_1 from shard_A to shard_B
client C2 sends "CLUSTER SETSLOT 1 primary_B" to primary_A
primary_A notifies client C1 that slot_1 is not owned by shard_A anymore (?)
client C2 sends "CLUSTER SETSLOT 1 primary_B" to primary_B
primary_B asynchronously broadcasts its slot_1 ownership to the cluster
primary_B notifies client C1 that slot_1 is now owned by shard_B
primary_A crashes before replica_A learns that slot_1 is no longer with shard_A
replica_ A becomes the new primary using primary_A's old slot mapping, which says that slot_1 is owned by shard_A
replica_A notifies C1 that it is now the owner of slot_1 (?) (and all other slots owned by primary_A?)
replica_A and primary_B exchange their slot mappings, notice the epoch collision, and resolve that by assigning slot_1 back to replica_A because it has a smaller node_id.
assuming client C1 has subscribed to replica_A (now primary_A') in the past as well for topology changes, replica_A notifies client C1 that slot_1 is owned by shard_A (?)

There is a 3-way race here among client C1, primary_B, and replica_A (now primary_A'). Depending on the relative speed, I could imagine a case where slot_1 settles with replica_A/primary_A' at the end but primary_B's notification is processed by C1 later than replica_A's thus client C1 incorrectly updates its routing table and points slot_1 to primary_B. The opposite could happen too. This would lead to a redirection followed by a full topology refresh. It should be a rare event though and definitely not the end of the world even when it happens. I was just thinking if there exists a simple solution that'd completely avoid this race. I think it would solve this problem if the node epoch is included in the output and a notification is sent whenever an epoch collision is resolved (just for the slots involved in the collision).

Comment From: zuiderkwast

If the client subscribes to slot updates from multiple nodes then this can indeed happen. I was thinking that a client should normally only subscribe to one node.

I think it would solve this problem if the node epoch is included in the output and a notification is sent whenever an epoch collision is resolved (just for the slots involved in the collision).

Yes, I suppose it will work for this rare condition. Should the client save the highest epoch seen so far and ignore any update with a lower epoch number?

I guess the published message can look like

1) "CLUSTER"
2) "SLOTS"
3) (integer) 123456789 /* the epoch */
4) 1) (integer) 5461
   2) (integer) 5461
   3) ... /* primary */
   4) ... /* replica */
   ...

**Comment From: PingXie**

> If the client subscribes to slot updates from multiple nodes then this can indeed happen. I was thinking that a client should normally only subscribe to one node.

I think subscribing to more than one node is a valid use case as it improves availability in case of node failures. Also I think it will be hard to block this use case altogether from the cluster's perspective.

> Should the client save the highest epoch seen so far and ignore any update with a lower epoch number?

Right, that is the pattern I was thinking of. Stale routining information increases the likelihood of extra network hops, request failures, and hence the need of a full topology refresh. With each update associated with its own epoch, clients can easily identify stale routing information and saves full topology refreshes for real failures (such as when all slot update subscriptions failed).

> I guess the published message can look like
> 3) (integer) 123456789 /* the epoch */


Yep. This is great at the interface level. 



**Comment From: zuiderkwast**

If we implement #10168 first (which is about avoiding repeating all the node information for every slot range) we can design this subscribe feature to use a similar format.

I can see two kinds of notifications:

1. Slot migrations, where we only need to include the slot range (or a single slot if slot migrations are completed one-by-one) and the new master's id.
2. Node changes (failover, replica added, replica gone, status change) where the notifications can include the IDs of the nodes without including any slot ranges.

These events can even be two separate pubsub channels.

In both kinds of notifications, we can include the epoch if we find it useful.

Slot migration example:

1) "message" 2) "cluster_slot_owner" 3) 1# "slot" => (integer) 1234 2# "master_id" => "c9d93d9f2c0c524ff34cc11838c2003d8c29e013" 3# "epoch" => (integer) 123456789


Failover example:

1) "message" 2) "cluster_topology" 3) 1# "event" => "failover" 2# "old_master_id" => "c9d93d9f2c0c524ff34cc11838c2003d8c29e013" 3# "new_master_id" => "faadb3eb99009de4ab72ad6b6ed87634c7ee410f" 4# "epoch" => (integer) 123456790


New replica example:

1) "message" 2) "cluster_topology" 3) 1# "event" => "new_replica" 2# "master_id" => "c9d93d9f2c0c524ff34cc11838c2003d8c29e013" 3# "node" => 1# "id" => "821d8ca00d7ccf931ed3ffc7e3db0599d2271abf" 2# "port" => (integer) 6379 3# "endpoint" => "host-2.redis.example.com" 4# "ip" => "127.0.0.1" 5# "hostname" => "host-2.redis.example.com" 6# "replication-offset" => (integer) 14000 7# "status" => "ONLINE" 4# "epoch" => (integer) 123456790 ```

Comment From: PingXie

If we implement https://github.com/redis/redis/issues/10168 first (which is about avoiding repeating all the node information for every slot range) we can design this subscribe feature to use a similar format.

It is not that critical IMO to use the range representation proposed in #10168 since slot migration is usually done one at a time and not guaranteed to be performed in any particular order. Also because of the pubsub nature, the payload is already delta only.

These events can even be two separate pubsub channels.

I think this is a good idea as long as the application can subscribe to multiple channels in one CLUSTER SUBSCRIBE SLOTS command.

So from the application's perspective, I think a potential use case could be as follows

Application connects to a random Redis node from the cluster
Application subscribes to all interesting topology change events
Application starts processing topology change events
Application then obtains a baseline cluster topology using CLUSTER SLOTS
For slots whose owner nodes are not set, update them using the CLUSTER SLOTS output
Application now has a complete routing table

If for whatever reason the connection to the Redis node is lost, the application can restart from step 1. It might need to clear the routing table though to force a full refresh of the topology.

Some more random thoughts

The Redis node to which the application is connected can be in a minor partition due to network issues. When this happens, I think we need to make sure it drops the connection so that the application can restart the discovering process.
The application still needs to handle MOVED/ASK errors. This biggest value of this pubsub mechanism is the elimination of periodic full topology refreshes when the cluster is in a steady state.

On a further thought, I was wondering if a "pull" model could achieve a similar goal. What if we introduced a new command that returns a fingerprint (SHA256 for instance) of the cluster topology. The application can poll this fingerprint periodically and only request a full cluster topology upon seeing a different fingerprint. With some caching scheme on the Redis side, we could significantly reduce the network/CPU overhead incurred by the cluster topology discovery for clusters in steady states. That said, the "push" model does go further and all the way to eliminating all the overhead. Thoughts?