Redis [NEW] A sticky option for cluster meet

The problem/use-case that the feature addresses

cluster meet is asynchronous. It synchronously returns OK to caller, and attempts to handshake with the target node asynchronously.

If for whatever reason, the handshake attempt doesn't succeed within a timeout, Redis drops the attempt and stops trying.

There are couple issues with this behaviour: * It's not robust. If the timeout was due to a transient issue, e.g. a network partition, Redis cannot auto-heal after the issue is mitigated. Callers would need to call cluster meet again to re-initiate the meeting. * It's not observable. The target node will just simply disappear from the cluster nodes output. As a novice Redis user, it's not obvious if Redis has tried and failed, or it simply lost the cluster meet request due to crash/restart or something. It requires in-depth Redis knowledge to distinguish between the two(by looking at logs). * It's counter-intuitive. As a person, in our daily life, if we ask person A to meet another person B, and person A says OK. We would expect A to either successfully meet B, or somehow tell us if they couldn't. We don't expect A to just silently stop trying.

Description of the feature

Add a sticky option to the cluster meet command. E.g.

CLUSTER MEET ip port [cluster_bus_port] [sticky]

When sticky is specified, Redis continues to try handshaking with the target until: * Either it has succeeded. * Or explicitly told to forget about the target node via cluster forget.

When sticky is specified, and Redis couldn't successfully meet within a timeout, it indicates this with a handshake_timedout flag in cluster nodes output:

3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e 127.0.0.1:7005 handshake_timedout - 0 1385503419023 3 disconnected

Alternatives you've considered

None

Additional information

None

Comment From: madolson

Another very real alternative to this could also be that we block and synchronously return if we were able to successfully meet node B. I think that might be a more coherent response than adding this sticky handler?

Comment From: ny0312

Making cluster meet synchronous and idempotent is also a good alternative. E.g. if cluster meet returned me OK, that means the two nodes have connected. If it times out, I can retry it until I get either an OK or a definitive failure message. Users only have to use a single command to make sure meet is successful. It's much simpler than the proposal. In the proposal, users would have to call cluster meet to initiate the meet, use cluster nodes to query the outcome and use cluster forget to cancel a failed meet.

The downside is the latency of a synchronous cluster meet command becomes unpredictable. Its latency could be of different magnitudes depending on if the two nodes use TLS or not, or are in the same geo-location or not.

Comment From: hpatro

@zuiderkwast @madolson @PingXie

Any thoughts on the alternative proposed by @ny0312

Provide a mechanism to poll if CLUSTER MEET succeeded or failed.
CLUSTER MEET <ip> <port> [SYNC] to provide a blocking API and provide a defined response on success/failure.

Comment From: madolson

I would still much rather have the API be synchronous, that seems like the easier API to use. Having a poll mechanism is useful for some applications, but not sure that makes sense here (especially since basically no other command is synchronous?). I think cluster v2 will also much more heavily use sync commands.

Comment From: zuiderkwast

+1 for synchronous. Should it block indefinitely or do we want to have a timeout argument too, like for most other blocking commands? like [SYNC timeout]

Comment From: PingXie

+1 too on providing synchronous semantics; and having a timeout field makes sense.

If we are going this route, I would also like to re-examine a decision we made in https://github.com/redis/redis/pull/10517#issuecomment-1620292208 where the client is asked to explicitly WAIT for the slot ownership to asynchronously replicate. Wouldn't it be better/more consistent if we extend the SETSLOT command with `[SYNC timeout]" as well?