Redis [NEW] Redis master-replica seamless switching

+--------+         +---------+
| master | <-----+ | replica |
+--------+         +---------+

I'm thinking about how to achieve seamless high-availability switching on Redis's master-replica structure. The following is a design step, but the fourth step is currently not supported by Redis, and it is also the focus of this issue discussion.

The master node execute client pause write.
Confirm that the master and replica are consistent.
Execute replicaof no one on the replica node.
The master node sets up the redirection request. (Maybe something like client force redirect host port, need discuss).
The master node execute client unpause and will return MOVED or REDIRECT to the client, then automatically redirect to the new master node.

In this way, the client can realize seamless HA switching without disconnecting the connection, and most requests can be executed successfully, greatly reducing the business impact of switching on users.

Detail: Add a configuration item. Once enabled, the replica in master-replica mode will also redirect all commands(read + write), and the client will only be allowed to execute read commands after executing readonly.

Comment From: zuiderkwast

In what way is this different to CLUSTER FAILOVER?

Comment From: uvletter

I suppose you're running standalone Redis with Sentinel, but why not just use Redis cluster, which seamless switch has been supported.

Comment From: soloestoy

Thanks feedbacks, but I need remind you that lots of users run standalone redis without sentinel, redis alone cannot achieve seamless switching. Moreover, sentinel cannot support seamless switching, since switching event is non-realtime via subscribe.

Comment From: zuiderkwast

Ah, not everyone is using cluster. I didn't think about that.

@soloestoy do you think we should add a controlled swichover that can be used in a plain master-replica setup and in sentinel setups? Isn't it better to promote Redis Cluster to everyone and make it work also for single shard clusters?

Comment From: soloestoy

@zuiderkwast I didn't plan to "promote Redis cluster to everyone", here I only want Redis to have the ability to independently (without sentinel) route between the master and replica in standalone mode.

Cluster mode has many limitations, such as the inability to use select and execute commands across slots, so I don't think single shard is a suitable solution.

Comment From: zuiderkwast

It's s bit unfortunate that we have cluster and standalone mode. It makes the conception about Redis harder to grasp. I think it's a good idea to try to bridge the gap between cluster and standalone, so allowing something like this for standalone could be a useful thing.

When some clients see a MOVED-redirect, they assume it is a cluster and then they issue CLUSTER SLOTS.

Can we allow CLUSTER SLOTS too for standalone nodes, simulating a single shard cluster? Maybe all cluster commands?

OTOH, what if we make cluster mode more useful by allowing cross-slot commands (enabled by config, not default), what are the implications? Applications that use cross-slot commands suddenly stop working if slots are split between shards, but what if we can handle cross-shard commands by proxying?

Comment From: soloestoy

@zuiderkwast these are interesting questions, but they are a bit off-topic 😄 . We can open a new issue to discuss these. Here, let's focus on how to achieve seamless switching in standalone mode.

Comment From: zuiderkwast

IMO, in Redis it is very easy. Just add a config like @yangbodong22011 said client force redirect host port or why not config set redirect host:port?

More of a problem IMO is that clients can be confused. If we add a new -REDIRECT, we need new implementation in clients. If we reuse -MOVED, cluster clients can work, but maybe they are confused and think it is a cluster. That's why my previous comment is not completely off-topic. ^_^

Comment From: soloestoy

For Redis cluster and standalone mode, the client also works in different modes. In general, clients working in standalone mode cannot handle -MOVE correctly, so reusing -MOVE or adding a new return value such as -REDIRECT would require redevelopment for the client. The key point is that we need to first support routing for standalone mode on the server side, and then the client can choose the correct route based on the protocol.

Comment From: zuiderkwast

@redis/core-team Do we want to add redirects in standalone mode?

It would be a major decision. Clients need to handle the new response.

Comment From: soloestoy

Indeed it's a major decision, I was planning to discuss in the next core-team meeting, thanks for your feedbacks, we have already discussed many details, haha.

Comment From: oranagra

I don't think the direction of forcing clients to support yet another redirection / discovery mechanism is a good idea.

I suppose the reality is that if someone is using either Cluster or Sentinel, then they already have a discovery mechanism, and maybe in the case of sentinel it is sufficient to close the connection or re-use MOVED (need to check with clients if it can work for them).

And if someone is using a standalone redis (no sentinel / cluster), then they already / should have some other way for discovery and instance administration (coordinate a failover), and i would not want to change redis or clients much for that other than adding the most basic building blocks (CLIENT PAUSE, e.g. CLIENT KILL), and maybe re-use MOVED.

For the record, was discussed with #8948 and #10875 to allow cluster mode to be un-sharded + multiple dbs, and have voting replicas or alike so that it can completely replace sentinel some day.

Comment From: madolson

I also think we should just standardize on the cluster mode routing interface. You should be able to run cluster mode with an unsharded database and have it support redirects just like cluster mode enabled does today. We do need some "topology" command like CLUSTER SHARDS, but maybe we can re-use that command anyways and simply remove the slots from the response.

Comment From: soloestoy

I agree the unsharded + multiple dbs cluster mode is feasible way, but it's a long-term solution, and has many details that need to be discussed like: 1. Is unsharded mode only allowed to run on a cluster with a single node, or is it allowed to run on a cluster with multiple nodes? 2. Does unsharded mode use the storage structure of the cluster or the storage structure of standalone mode? This also raises many questions, such as whether slot-to-key mapping needs to be maintained when using unsharded mode. 3. Will unsharded mode make some SDKs unavailable? As far as I know, some clients, such as C#, determine whether it is a cluster mode by whether the SELECT command can be executed. 4. etc.

Here I hope the current standalone mode can support seamless switchover, and the key point is we need make client be able to automatically route to the new master node after unpaused.

Currently, it is impossible to achieve with either Sentinel or other administration: after switchover "-READONLY" errors will be received by writing commands, and reading commands may also read stale data on the replica.

I'd love to introduce a minor change to support the seamless switchover in standalone mode, the main idea is to reuse -MOVE or introduce a new -REDIRECT, which will not conflict with unsharded cluster. And we can add a configuration to control whether to forward, to avoid possible breaking changes.

Comment From: madolson

I'd love to introduce a minor change to support the seamless switchover in standalone mode, the main idea is to reuse -MOVE or introduce a new -REDIRECT, which will not conflict with unsharded cluster. And we can add a configuration to control whether to forward, to avoid possible breaking changes.

I'm in favor of this for Redis 8. I would prefer we retain the MOVE semantic just so that clients have one API they have to know about. The way I'm thinking about it is that what we call cluster mode today is really two components: The sharded database and the clustbus + command routing. I think we could have the clusterbus + command routing running but still use the normal unsharded database.

For the sake of completeness though, in AWS we have support for TCP proxying of requests sent to the wrong node. During the failover we will proxy read/write requests sent to the old node.

Will unsharded mode make some SDKs unavailable? As far as I know, some clients, such as C#, determine whether it is a cluster mode by whether the SELECT command can be executed.

Clients do many dumb things. We should help educate them :).

Comment From: oranagra

i agree, separating these two unrelated cluster features, or actually allowing cluster bus and command routing on non-sharded + multi-db deployments is the right thing.

regarding 1-4 points above, it seems obvious to me: 1. multiple nodes, with voting replicas. 2. the data structure should be like in standalone, and no slot specific features 3. using SELECT to recognize cluster mode may still be correct, depending on what else they do with it. i.e. they won't want to hash keys in order to know which node to send them to, but they'll need to react to MOVED.

if we wanna reuse MOVED before we get to this unsharded cluster, and in some way let standalone redis use it, that may be ok too (documenting that clients should react to it regardless of the above mentioned cluster slots related features). but i'd rather not add more complicated machines to the core, like we did with the FAILOVER command.