My colleagues and I at MediaFire have found that we sometimes have to send many (up to 100 or so) CLUSTER FAILOVER commands to a slave before a successful promotion happens.
I have investigated this and I believe I have found the cause of this problem: The cluster slave synchronization process is relying on a master's dataset remaining static while paused, but currently keys can still expire while clients are paused, so if keys expire during the manual failover process, it will fail.
In more detail: When a CLUSTER FAILOVER command is sent to a slave, the slave will ask the master (via MFSTART) to pause all clients. The assumption is that the dataset cannot change while clients are paused, so the master_repl_offset should remain constant. The master will then send the offset to the slave and the slave will remember this offset and wait until its own replication offset matches this one to begin the failover. However, when keys are expiring on the master, they will be flushed out actively, even while clients are paused. This results in DEL commands being synthesized to the slaves and the master_repl_offset incrementing. This means that the slave will overshoot the recorded offset while replicating these synthesized deletes and so never think it has caught up with the most recent master changes.
My initial research into the code reveals that a similar problem most likely affects evictions, but would be much more rare as evictions only happen when processing a command, and only commands from slaves are accepted while clients are paused. I believe that commands from slaves are rare, but could occur if, for instance, two slaves were both told to manually failover.
I believe the simplest solution to this problem would be to stop evictions and expirations from occurring while clients are paused as this will keep the dataset truly static and allow a slave to catch up reliably to the master.
Alternatively, slaves could be responsible for expiring keys themselves, so that DEL commands are not synthesized, but this comes at added computational cost, as well as desynchronizing expired keys amongst the replicas. Also, this strategy will not work with evictions as evicting a key has modifies the dataset in a way that changes what clients observe.
I have been unable to set up a simple way to duplicate this but am working on it. It currently happens on our production cluster.
Comment From: zintrepid
I found a simple way to reproduce:
- Set up a cluster with two nodes: a master and a slave.
- Add keys in a loop with a short expiration time. For my test I ran:
while true; do echo "SET $(dd if=/dev/urandom count=1 bs=4 2>/dev/null | xxd -p) examplekey EX 5"; done | redis-cli --pipe - Let this run until the total number of keys stabilizes. This took about 5 seconds for my test.
- Run
CLUSTER FAILOVERon the slave - The slave will not be promoted
I tested the PR https://github.com/antirez/redis/pull/4028 and it does resolve the issue.
Comment From: antirez
Wow, that's quite a cool finding, and totally looks something to fix ASAP and backport everywhere. Thanks. I'll check your PR tomorrow and act accordingly... But to just prevent keys to expire looks like a sensible solution to me at a first glance, so I'll probably just merge your PR as it is.
Comment From: zintrepid
@antirez Thanks. I agree that if the solution is acceptable that it should probably be back-ported. I would be happy to help with that if that's the case.
Comment From: antirez
@zintrepid thank you I merged your Pull Request, very appreciated. Please could you review the change I operated in the return value of freeMemoryIfNeeded()? The commit references this issue.
Comment From: nthnfaustine
Hello, thanks for sharing the findings. I also face the same issue since the write traffic for my redis cluster is high and some of the keys have short expiry time.
Can you guys share in which Redis version does this changes included? TIA
Comment From: zintrepid
I see this issue has been left open. I assume this is because I missed the comment from @antirez asking for a review. The change looks good to me and has been released for several years now, so closing this issue. My apologies for not replying sooner!
@nthnfaustine I believe this change was introduced in redis 4.0. If you are having the issue in a version beyond that, I suggest opening a new issue.