The problem/use-case that the feature addresses
Currently, when using Redis Cluster, if one hits a migrating shard, one might be returned an ASK response to redirect the request to the importing shard. However, this places the full onus on the client.
As a simple example, if after receiving the ASK response the client is suspended and resumed some time in the future, the node it told to submit an ASKING request to might have had the slot resharded away and back in the meantime. While for regular Redis this isn't a major problem, as it doesn't provide strong consistency guarantees, for those who are trying to build stronger consistency guarantees on top of redis / using the redis cluster protocol it's restrictive / impossible to difficult to provide the needed guarantees.
Description of the feature
Adding an optional epoch/cookie field to an ASKING request which would also be returned as part of the original ASK response
ASKING [epoch]
If the epoch/cookie is provided, the targeted shard for the ASKING can leverage it to determine if the ASKING request is valid for its consistency model.
As an example: it might require them to be on the same "sharding round" or at least the epoch provided is greater than the current round.
Alternatives you've considered
we've explored what happens if we don't have a way to coordinate between the migrating and importing shards and it always results in inconsistencies. (ex: you beat the migrating shard, so set data, only to have it ovewrritten without knowledge when it does get migrated).
Comment From: madolson
I'm going to make a counter proposal. I think we should deprecate ASKING and actually make redis consistent. If we implement https://github.com/redis/redis/issues/2807, we can send all of the keys from a source to a target atomically and orchestrate it within Redis. Once the migration is done, we can 2PC (or you guys can leverage raft) the ownership from one node to the other. In this case, there is no longer MIGRATING and IMPORTING states.
Comment From: sjpotter
after @yossigo mentioned that you said you guys dislike the ask/asking protocol due to client issues, I proposed something similar internally. migrating/importing might be a concept internally, but clients would only see stable slots (i.e. importing would appear as stable).
redis can kick off (via a series of callbacks) fetching keys from the "migrating" cluster and deleting them (would have to track "non existent keys" during import stage), and prioritize keys of client commands. either can block client until key(s) are imported for the command (hopefully quick) or perhaps return TRYAGAIN until ready to handle the keys in the command.
My thought process was that this would be better for Redis in general as well, not specific to raft.
Comment From: sjpotter
this is literally me just sketching out a set of callbacks to do this, how redis can handle the migrate command (which just migrates a whole slot). This is very very high level at this point. just me starting to sketch things out, this assumes the happy path, would obviously have to handle error cases.
findKey() - checks if a key is listed via priority from client - calls dumpKey() directly - otherwise sends GETKEYSINSLOT 1 to migrating cluster - callback handler handleGetKeysInSlot()
handleGetKeyInSlot() - takes result -> calls dumpKey(). - If no result, callbacks end, as no more keys to migrate
dumpKey() - sends a DUMP command - callback handler handleDumpKey()
handleDumpKey() - calls RESTORE command locally - callback handleRestoreCommand()
handleRestoreCommand() - send a special delete (as would normally return MOVED) to remove key from old/migrating shard - callback handleKeyDelete()
handleKeyDelete() - for keys that didn’t exist on old/migrating cluster/shard, add key to non existent key dictionary - call findKey() to restart loop
Comment From: madolson
Yeah, that is basically what I'm thinking. I think we can be smarter than what you outlined though. We should also: 1. Accumulate incoming write commands on the source and transfer them to the target, after it's done. 2. Be able to handle target failures, so if the target dies part way through we can rollback completely without data loss. 3. Perhaps do this in a fork so that we don't impact the main thread.
Comment From: yossigo
@madolson I haven't given this much thought, but I tend to think of it in terms of another form of replication. Doing serialization in a forked child, buffering writes, etc.
Comment From: zuiderkwast
We can do request forwarding instead of ASK redirects. This can probably be implemented separately even for the classic slot migration.
Making a new command CLUSTER MIGRATE appear atomic to the client is a nice idea.
Accumulate incoming write commands on the source and transfer them to the target, after it's done.
@madolson Storing all the keys in a buffer and transfer them at once is not what you mean, is it? It can be a lot of data. But we can replicate the slot to the new owner until it has caught up and then do what could be described as "slot failover" to the new owner. (?)
Another problem is garbage keys that remain in one of the nodes after an aborted migration (failed source or target), which is useful for redis-cli --cluster fix but not needed if the new command is waterproof.
Even if we don't make the slot migration atomic, ideally we should be able to resume migration after a failover (both for source failover and target failover) and to resolve the race conditions at SETSLOT NODE. An idea mentioned in #6339 is to propagate the MIGRATING and IMPORTING slot states to the whole cluster or at least to the nodes in the same shard. I've experimented with simply replicating SETSLOT using the normal replication and letting replicas accept SETSLOT. Not even a cluster bus extension is needed for that. After a failover, a node which has anything in MIGRATING or IMPORTING states only need to take some special actions (such as the promoted target asking the source node if it's still migrating it, or the promoted source node accepting a new slot owner with a lower epoch if the slot is in migrating state).
Whether all this is done in a separate thread, a forked process or in the main thread seems to be an orthogonal change that can be done later or earlier.
Comment From: madolson
@zuiderkwast Your suggestion is basically what I meant. I like to think of it as replication for a given slot. We do a full slot transfer and then we replicate the individual updates that we have accumulated since the start of the transfer.
Internally we have something we call "slot purge" which allows us to incrementally purge all the data in a slot. On a failed migration, we use this to purge the slot on the target and we use this to purge the data on the source on successful migrations.