Describe the bug
While re-shard operation, if the key to migrate is an hashset, depending on the size and number of fields in hashset, the migrate operation time out and causes Redis to failover.
In our case, we have a hashset of size 300 MB and 3 Million fields.
The main problem is as migrate is blocking command, it blocks the primary, due to which Redis thinks the primary is down and causes a failover.
StackExchange Error Logs:
TimeStamp: 2024-03-08T16:53:32.534223Z
Timeout awaiting response (outbound=0KiB, inbound=0KiB, 4100ms elapsed, timeout is 4000ms), command=MIGRATE, next: some_random_key, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 2, cur-in: 0, sync-ops: 0, async-ops: 2193844, serverEndpoint: 172.20.0.6:6380, conn-sec: 670.01, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: mtcache000002(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 9271, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=6635139), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts),
We get two of such errors before Redis failover, in redis, the node_timeout is set to 5 seconds.
Redis Logs:
16:53:35.041 * FAIL message received from 1a52537ed371931ec4436e02afdaae61fd061c17 about 42b37d2039622543514545a6cba3807e4db0b776
16:53:35.133 # Start of election delayed for 805 milliseconds (rank #0, offset 1269560724769).
16:53:36.736 # Configuration change detected. Reconfiguring myself as a replica of 42ac8718f2670e442fe598923f6069a66148e717
16:53:36.739 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
SLOWLOG output, showing that migrate took 9.3 seconds
We are looking for:
- How to migrate large hashes/hashset
- How to avoid migrate time out to cause Redis Primary failover?
Comment From: javedsha
@antirez @oranagra - any advice on above issue.
Comment From: oranagra
i think this is a known limitation that doesn't yet have a solution. there are some ideas planned but for the time being, i think you have to divide the key. @madolson correct me if i'm wrong.
Comment From: javedsha
@oranagra - we don't mind loosing the key during migration, but we want to avoid Primary going down (and hence a failover) because Migrate is blocking it, will this work?
migrate ip port big_hash 0 1000 -> Migrate time out set to 1 sec, and cluster-node-timeout set to 5 secs
Note: We cannot divide the key from client end, we don't own the client.
Comment From: oranagra
related to #3396 and #11395
Comment From: javedsha
Thanks @oranagra for the related links. So after reading those, just to be sure - setting lower timeout on migrate won't work, correct? The only way to stop the failover is to increase the 'cluster-node-timeout'
Comment From: oranagra
or break keys to smaller pieces, which i understand isn't an option in your case.
Comment From: madolson
Yeah, oranagra is right about the current limitations. We've talked about various approaches to make the impact on the source and target less impactful, but haven't made a lot of progress towards fixing it. The cluster node timeout solution makes sense. You could also make some minor changes to the redis-cli to check if it's a large hash, and if it is move it piece by piece using HSCAN instead DUMP + RESTORE.
@PingXie FYI, this is related to what we were discussing about slot migration causing target nodes to timeout if we stick with the restore based approach.
Comment From: PingXie
we have a hashset of size 300 MB and 3 Million fields.
In this particular case, I think it is the "3 million fields" that played a major role in "that migrate took 9.3 seconds".
we don't mind loosing the key during migration
Looks like you prioritize availability over durability. In addition to what @madolson suggested above, another short-term mitigation could be checking the number of fields in the hash key before migrating it or just dropping it if it contains too many fields.
The only way to stop the failover is to increase the 'cluster-node-timeout'
This would increase the failover time in case of a true failure. If you have a stable networking environment where you don't expect failures most of the time, I wouldn't suggest increasing it.
@PingXie FYI, this is related to what we were discussing about slot migration causing target nodes to timeout if we stick with the restore based approach.
Agreed. @oranagra FYI, @madolson and I discussed another option (in the context of atomic slot migration) which essentially implements slot-level replication. Here is the high level flow
- source (parent) forks a child with a set of slots to migrate
- child streams slots to be migrated in the RDB format (same as replication)
- target needs to support no-blocking load of the RDB stream (think of co-routines)
- source (parent) starts capturing updates in the migrating slots right after forks
- child completes streaming
- source (parent) pauses clients writing to these slots
- source replicates captured updates to target
- source starts the slot ownership transfer process (depending on cluster v1 vs v2, we could take different paths)
- source unblocks paused clients with
-MOVED <target>
any failure on the target before completing step 8 would abort the migrating process on the source.
Comment From: enjoy-binbin
source (parent) forks a child with a set of slots to migrate child streams slots to be migrated in the RDB format (same as replication) target needs to support no-blocking load of the RDB stream (think of co-routines) source (parent) starts capturing updates in the migrating slots right after forks child completes streaming source (parent) pauses clients writing to these slots source replicates captured updates to target source starts the slot ownership transfer process (depending on cluster v1 vs v2, we could take different paths) source unblocks paused clients with -MOVED
In our fork (Tencent Cloud), our cluster expansion use the similar way, we will generate a slot RDB for the corresponding slot range for transmission, the cluster nodes are not online until the RDB is loaded.
Comment From: PingXie
the cluster nodes are not online until the RDB is loaded
Did you mean "slots"? Nodes will need to remain online as other slots still need to be served. This is also why the target needs to do nonblocking load in this case.