Redis Cluster Resharding with Large Key

Hello,

We are running a cluster and are attempting to reshard after adding a few new nodes. I looked through older issues and found no conclusive answer.

All was well until I came across a key with 14,000,000 entries.

During the migration it stopped with an SLOT MOVED error and caused failovers to occur.

I had this in the log...

Receiving Node 3026:M 16 Mar 2020 09:40:54.995 # Client id=20 addr=10.1.x.x:35111 fd=34 name= age=6376667 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=1 omem=201326616 events=rw cmd=replconf scheduled to be closed ASAP for overcoming of output buffer limits.

I am trying to replicate locally. I raised the output buffer limit and the cluster-node-timeout. It seems that after changing the timeout, it takes longer to fail, but still fails nonetheless.

Following attempts yield...

Node 192.168.56.201:7005 replied with error:
ERR Target instance replied with error: BUSYKEY Target key name already exists.

Oddly it keeps giving that error if I attempt to migrate the slot to a node I had previously tried. If it is a node I haven't tried, it seems to go the full cluster timeout duration.

Are there any recommendations for safely moving slots containing large keys?

What would need adjusted? The Output Buffer and the cluster-node-timeout, even just temporarily until the key is moved?

I was using the redis-cli reshard option. End result was the slot would move to the new node, but only with the keys it migrated prior to attempting the large key. The keys in said slot (and subsequent slots) after the large key are not migrated after the error.

Thanks, Ted

Comment From: Tbone542

Update:

After multiple attempts, I believe I have determined the underlying cause.

It appears the cluster-node-timeout was the underlying issue and not the output buffer.

The output buffer error seemed to have arose due to the failovers triggered by the inadequate timeout when migrating the large key.

One other nuance is that occasionally when the migration fails the target node still has remnants of the key it was attempting to receive.

Subsequent attempts to move the same slot again seem to result in...

Node 192.168.56.201:7005 replied with error:
ERR Target instance replied with error: BUSYKEY Target key name already exists.

It appears the original key is still in tact on the sending node. The only remedy I have yet found is to go into redis-cli on the receiving node (not in cluster mode!) and remove the key it thinks it has. As the cluster thinks the slot is still with the original, it is tough to access the key on the target node, though it internally thinks it still has the key.

As I was aware of the offending key, I was able to set the slot to the receiving node on it's local redis-cli. This allowed the receiving node with the partial key to allow me to delete it. It has to been done quickly, as it will quickly switch back to the actual owner during routine gossip.

Has anyone experienced anything similar and does my explanation sound plausible?

Is there a better way to delete keys on a node which shouldn't be there according to nodes.conf?

-Ted