Redis Gossiped node deletion - Nineya|java/go/python

Typically you need to send a cluster forget to each node in a cluster to delete a node. If you don't do this fast enough, the node will be re-added through gossip. Ideally you just need to forget a node and it will eventually be forgotten throughout the cluster.

This is part of #8948.

Possible ways to implement this:

Append the contents of the clusterBlacklistNodes (node id and timestamp) to all cluster ping and pong messages. A ping extension (mechanism introduced for hostnames) can be used for this. (The blacklist expire timestamp needs to be gossiped too, so that we don't end up with the node being re-forgotten and re-added to the blacklist by gossip after the blacklist TTL is passed.) This increases the size of the cluster bus messages while any node is blacklisted (1 minute), but if the blacklist is short and the ping extension is small, this might be acceptable.
Append the contents of the clusterBlacklistNodes (node id and timestamp) to every pong message which is a response to a ping, but only when we detect that the ping's gossip section contains a forgotten node, i.e. a node in our blacklist. This minimizes the extra data sent over the cluster bus, but it is enough to make sure a node is forgotten throughout the cluster?
When a CLUSTER FORGET command is received, broadcast this to the cluster, for example in a pong with the forgotten node in a ping extension. This is done only once and by the node which receives the command. If a client sends CLUSTER FORGET to all nodes in a cluster, there will be NxN pongs sent in a short period of time.

@madolson et.al. any preference among the above or other ideas?

Comment From: madolson

First, some context. Some folks from RL and the core group have been iterating on a design for completing replacing the clusterbus with a new consistent storage. With that in place, it becomes trivial to "remove" a node from a cluster, since all operations are driven through consensus. @ushachar ping since he is supposed to be preparing that for general discussion.

Since the idea of Cluster V2 is it will replace the cluster in the fullness of time, we can still make incremental progress until then on other parts of the cluster system, since it'll be at least 1+ years. We decided to continue iterating on slot migration, for instance.

So, to your points. I generally like the option 1, it can also be "enhanced" like option 3 by immediately broadcasting out the ping to all nodes. I don't have much issue with "extra" metadata on the cluster protocol since it's already so verbose (it's ~2kb today, already larger than a single TCP packet!). Option 2 seems to try to be more of an optimization, but I'm not convinced it buys us all that much. One thing worth highlighting is that the current system does not broadcasting any time information, to allow nodes to have different clocks, so we should consider broadcasting out relative TTL which can only decrement.

Comment From: PingXie

Some folks from RL and the core group have been iterating on a design for completing replacing the clusterbus with a new consistent storage.

@madolson is this the doc that you were referring to here?

We decided to continue iterating on slot migration, for instance.

Do you mind circling back to thread #2807? I don't intend to hijack this thread but now with this new information, I think it would be great to align with the community on how to best evolve the existing slot migration protocol.

Typically you need to send a cluster forget to each node in a cluster to delete a node. If you don't do this fast enough, the node will be re-added through gossip. Ideally you just need to forget a node and it will eventually be forgotten throughout the cluster.

I agree we need a one-shot CLUSTER FORGET. The timeout can indeed become an issue in a large cluster. However, I am not sure if gossip is the right solution. It is not entirely clear to me how we could ensure consistency in the face of failures. What'd happen if a few nodes in the cluster didn't receive the gossip for various reasons, either temporarily network issues or simply some races between CLUSTER MEET and CLUSTER FORGET such that the FORGET initiator wasn't aware of the presence of these nodes. A consensus-based solution would be ideal IMO though I don't have the details, yet.

Comment From: zuiderkwast

I'll go for option 1 with relative TTL then!

@PingXie I don't think we'll get any guarantees here, just better chances. redis-cli can still try to send cluster forget to all nodes. Guarantees will come in the new future cluster bus, hopefully.

@madolson Any chance that I (and @PingXie I suppose) can be involved in the Cluster V2 project? I have some time to spend and it seems wiser to spend it on the long-term solution. Or do you prefer that we continue to improve Cluster V1 meanwhile?

Comment From: judeng

Option 1 seems good, and we need to avoid broadcast operations as much as possible, broadcasting in a large cluster (1000 nodes) may cause fluctuations in latency

Comment From: madolson

Option 1 seems good, and we need to avoid broadcast operations as much as possible, broadcasting in a large cluster (1000 nodes) may cause fluctuations in latency

@judeng We already broadcast to the entire cluster in various situations, taking ownership of a slot for instance. Given that this is only at the initial forget, I think that's justifiable.

@madolson Any chance that I (and @PingXie I suppose) can be involved in the Cluster V2 project? I have some time to spend and it seems wiser to spend it on the long-term solution. Or do you prefer that we continue to improve Cluster V1 meanwhile?

Yes, more so I want you two to help! I was under the impression that the main design would be posted and we can start iterating already, I'm not really sure why it's not. EDIT: It is now! https://github.com/redis/redis/pull/10875

Do you mind circling back to thread https://github.com/redis/redis/issues/2807?

At least for the time being I'm very time constrained, I will try.

Comment From: PingXie

@PingXie I don't think we'll get any guarantees here, just better chances. redis-cli can still try to send cluster forget to all nodes. Guarantees will come in the new future cluster bus, hopefully.

Make sense @zuiderkwast. I am in agreement with option 1 too.

If we can make sure CLUSTER_BLACKLIST_TTL is greater than cluster_node_timeout (with some safe margin), I think we could avoid broadcasting the TTL. It should be OK that not all nodes stop the forget node broadcasting at the same time. Nodes that have forgotten the node can ignore the new broadcasting such that we don't have to worry about a never-ending feedback loop. It is also easier to have every node work with a fixed timeout.

Another thought regarding the node to be forgotten. Today, the node to be forgotten no-ops the CLUSTER FORGET command. It continues to participate in cluster activities such as sending out gossip and answering cluster topology queries, despite being asked to leave. Neither is desired for a node that is or will soon be a stranger to the rest of the cluster. However it still makes sense for this node to serve read requests while waiting to be removed. Maybe we could consider tackling this issue as part of the solution as well?

Comment From: zuiderkwast

If we can make sure CLUSTER_BLACKLIST_TTL is greater than cluster_node_timeout (with some safe margin), I think we could avoid broadcasting the TTL.

@PingXie We can't, can we? CLUSTER_BLACKLIST_TTL is 1 minute but node timeout is configurable...

Nodes that have forgotten the node can ignore the new broadcasting such that we don't have to worry about a never-ending feedback loop.

Good point!

It is also easier to have every node work with a fixed timeout.

Yes, I guess the code gets simpler, but if there is a lot of latency and the gossip takes some time to reach all nodes, the blacklist can survive longer than expected. Given a node id is 40 bytes and a TTL is 8 bytes, I don't think it matters much for the message size at least. Please have a look at the PR to evaluate the complexity of this.

Comment From: PingXie

We can't, can we? CLUSTER_BLACKLIST_TTL is 1 minute but node timeout is configurable...

We could set TTL to CLUSTER_BLACKLIST_TTL + cluster_node_timeout on each node. Traffic is one thing but I do think the simplicity (of not having to deal with an extra field) is appealing.

Please have a look at the PR to evaluate the complexity of this.

Do you want to wait for #10536 to merge? I refactored the extension building logic to simplify the addition of new extensions. I think it is very close to the finish line. I am really hoping @madolson could find some time soon :-)

Comment From: zuiderkwast

We could set TTL to CLUSTER_BLACKLIST_TTL + cluster_node_timeout on each node. Traffic is one thing but I do think the simplicity (of not having to deal with an extra field) is appealing.

Sure, it would be simpler. That means that we set it to CLUSTER_BLACKLIST_TTL + cluster_node_timeout also on the CLUSTER FORGET command, right? What about the concern that gossip can take a long time to reach all nodes in a large cluster, is that a valid concern?

Do you want to wait for #10536 to merge? I refactored the extension building logic to simplify the addition of new extensions.

Good to know. Sure, I can rebase my PR after yours is merged. But I can also keep my PR in a working state meanwhile.

Comment From: PingXie

That means that we set it to CLUSTER_BLACKLIST_TTL + cluster_node_timeout also on the CLUSTER FORGET command, right?

Right and every node sets the same TTL (CLUSTER_BLACKLIST_TTL+cluster_node_timeout) upon receiving a new forget node for the first time.

What about the concern that gossip can take a long time to reach all nodes in a large cluster, is that a valid concern?

This is a valid concern but orthogonal to how the TTL is set, either statically via a hard-coded formula like what we discussed here or runtime calculation as your PR demonstrates. In theory, both can run into this issue where the sender has already expired the delete before the gossip arrives at the receiver. In fact, if there is asymmetry in terms of latency between the two links: from sender to receiver and from receiver to sender, I could imagine a never-ending loop. Imagine that the latency from sender S to receiver R is greater than the TTL and the TTL is greater than the latency from R to S. S would've expired the deletion before R receives the deletion; R then forwards the deletion back to S, which would re-process the deletion resulting in a new gossip to R. And because the latency from S to R is greater than the TTL, R would already expire its deletion before it sees the new gossip. We are now in a perpetual loop. That said, I don't think this would be an issue in reality because 1) the expected latency (ms) in any deployment will be way lower than either the typical node timeout (sec) and the TTL (min). 2) latency spikes should be short lived and as soon as latency goes back to normal, any potential loop would be broken; 3) the worst case scenario is that the delete is reverted, which we already agree is acceptable as long as there is a way to detect and a retry would be all is needed.

Comment From: judeng

We already broadcast to the entire cluster in various situations, taking ownership of a slot for instance. Given that this is only at the initial forget, I think that's justifiable.

@madolson in 7.0, the broadcast messages for pubsub and module have been minimized, and other broadcast messages, such as slot whonership messages, still need to be broadcast to recover routing information asap. As a user's point of view, they are more willing to put up with the delay caused by broadcast messages than the delay caused by redirection. But the removal of nodes, usually in the cluster in doing a sale-in operation, users may not be willing to affect their traffic

Comment From: PingXie

@judeng the idea here is to piggyback on the periodic PING message, which now has the ability to carry additional payload via a construct called extension thanks to @madolson's work (#9530). The PING message without any extension is already quite substantial, ~~~2KB * (N+1)~~ ~2KB+100B*N where N is the number of nodes picked for gossip. The additional cost incurred by this new payload would be 40B * M where M is the number of nodes to be removed from the cluster. In a large cluster, N >> M and we will probably be looking at ~~\~1% (or less)~~ a small size increase at the per-message level. Of course, given the fully-meshed topology, the total size increase in a cluster would be linear to the square of total node number. Another thing to note is that the overhead would be short-lived (bounded by the TTL). Therefore, based on these two observations (small and time-bound size increase), I would assume the overall impact is negligible. The impact from scaling out the cluster would be a lot bigger and long-lasting. I am curious to know if this is inline with your experience.

Comment From: madolson

What about the concern that gossip can take a long time to reach all nodes in a large cluster, is that a valid concern?

We aren't "gossiping" this, or at least we aren't from the implementation. The only piece of information we gossip is the health information. The Ping/Pong messages are guaranteed (as much as cluster can guarantee) within the cluster_node_timeout by sending a new message if we haven't heard from the node by cluster_node_timeout/2. We should hit every single node within the max(CLUSTER_BLACKLIST_TTL,cluster_node_timeout). I think doing an addition between the two seems like a fine approximation.

The other observation made by PingXie about the potential failure modes seem mostly correct. It doesn't seem all that hard to just "decrement" the TTL each time you forward it on by some amount. So even in some eternal loop, it eventually decays away.

@madolson in 7.0, the broadcast messages for pubsub and module have been minimized, and other broadcast messages, such as slot whonership messages, still need to be broadcast to recover routing information asap. As a user's point of view, they are more willing to put up with the delay caused by broadcast messages than the delay caused by redirection. But the removal of nodes, usually in the cluster in doing a sale-in operation, users may not be willing to affect their traffic

My POV is that I split these operations into "datapath" (pubsub + module) and "control path" (add/remove/update topology). Control path I have really no objection to a full fanout, since they are very infrequent. Topology is generally pretty fixed, and only get's updated on occasion. So spending 10ms to broadcast 1k messages is OK. I'm far less OK with datapath operations fanning out.

Do you want to wait for https://github.com/redis/redis/pull/10536 to merge? I refactored the extension building logic to simplify the addition of new extensions. I think it is very close to the finish line. I am really hoping @madolson could find some time soon :-)

Yes I'm going to look at your PR first. So we can rebase after that.

Comment From: judeng

Control path I have really no objection to a full fanout, since they are very infrequent

@madolson Agree! your point is very clear, I think I get it, thank you！ In my usage scenario, the data of my cluster has doubled almost every year, and now it has close to 1200 nodes. Most of the overhead of the cluster bus is spent on ping/pong messages, indeed the full fanout are very very infrequent.