Hello,
We have recently had replicas crashing due to https://github.com/RedisTimeSeries/RedisTimeSeries/issues/1343 and in the process of recovery, we have manually copied our backed up rdb files under nodes and brought up slaves that were crashing. Post this, we have killed one of the existing masters to force a replica to become the new master (NOTE: This replica has completed FULL SYNC with the existing master before killing it). All good so far and everything came up fine but with some data loss (some keys were gone missing) post the master switch.
We have repopulated the lost data and this brought the cluster back to a functional state but the following weekend restart of the cluster errored out with:
[WARNING] Node IP:6381 has slots in importing state 5461,5462,5463...
[WARNING] Node IP:6381 has slots in open state 5492,...
In other words, the cluster comes back up but somehow one master has keys that belong to a slot owned by another master. Q.1 Is there a clean way to delete these keys in a slot not owned by a master? Q.2 Any reason as to why this would happen as any replica that has done a FULL RESYNC with master should have the same state? Note the crash happened on a weekend when no new data was being added.
Logs noticed at start up before the issue popped up:
43335:M 03 Dec 2022 11:03:43.351 # I have keys for slot 5461, but the slot is assigned to another node. Setting it to importing state.
43335:M 03 Dec 2022 11:03:43.351 # I have keys for slot 5462, but the slot is assigned to another node. Setting it to importing state.
etc...
Comment From: pgullipa
@oranagra Any ideas? More important for us to find a way to clean the keys (Q.1) so we do not have this issue on every cluster restart.
Setting the actual master owning the slots to be in migrating state and deleting the keys from the master that does not own the slots and setting the correct master to be the owner after is one way we can think of but does not sound like a clean idea.
Comment From: oranagra
how about using CLUSTER GETKEYSINSLOT to get the list of keys, and then use a Lua script to delete them?
note that you can call GETKEYSINSLOT from the script, but then you won't be able to do the deletions from that script (lua_random_dirty).
maybe @madolson has other ideas or advise.
Comment From: pgullipa
@oranagra @madolson Yes, deletion is the problem. We can get the keys but deletion goes to the correct master, not the master that has these keys that are to be deleted.
Comment From: oranagra
you'll have to connect to that master directly (not via a cluster aware client), but even then, the DEL command will be refused, and a Lua script is the way around it.
come to think of it, you can keep using your normal client by faking a false key name to the EVAL command.
e.g.
EVAL "redis.call('del', ARGS[0])" 1 <fake_key> <key_to_delete>
fake_key would be some key belonging to the shard you need to clean.
Comment From: pgullipa
Hi @oranagra , using redispy client (which seems to be cluster aware), the above does not work:
from redis import Redis
client = Redis(MASTERIP:PORT)
client.eval("return redis.call('DEL', ARGV[1])", 1, "b", "foo")
b is a key owned by the node. foo is a key owned by another master.
returns
redis.exceptions.ResponseError: Error running script (call to f_6eeafcd47cdc6ace492f1e8dc70c47b4258bf161): @user_script:1: @user_script: 1: Lua script attempted to access a non local key in a cluster node
Any ideas? Is there a cluster unaware client library we can use for this?
Comment From: oranagra
Sorry, forgot about that check. well, i don't currently see any way to achieve that without building a special patched redis (maybe flagging the client as MASTER)
Comment From: judeng
An unproven method, hope it can help you 1. cluster setnode 5461 importing node-id 2. cluster getkeysinslot 5461 3. send ASKING command 4. del those dirty keys 5. cluster setnode 5461 stable
Comment From: madolson
I think the approach outlined approach should work. In either case we should add a tool to make it so that it's easy to delete data from an orphaned slot OR automatically delete data detected from orphaned slots.
Comment From: pgullipa
Thanks @judeng That worked. Any idea why this could have happened in the first place?