Scan is supposed to provide the following guarantees (as per https://redis.io/docs/manual/keyspace/):
-
A full iteration always retrieves all the elements that were present in the collection from the start to the end of a full iteration. This means that if a given element is inside the collection when an iteration is started, and is still there when an iteration terminates, then at some point SCAN returned it to the user.
-
A full iteration never returns any element that was NOT present in the collection from the start to the end of a full iteration. So if an element was removed before the start of an iteration, and is never added back to the collection for all the time an iteration lasts, SCAN ensures that this element will never be returned.
While playing around with some PoC for cluster wide scan, I realized that the first guarantee can only hold as long as you were connected to the same node during the entire duration of the scanning. If failover occurs, the replica may have a different seed value for the siphash function, so the cursor previously used on the primary would not be in the same place. This is likely less of an issue for SCAN since you normally indicate the node you're talking to, but many CME client transparently handle re-directs for SSCAN/HSCAN/ZSCAN during failovers.
I'm not sure we need to strictly do anything about SCAN, I'm not sure how often SCAN is resumed after a failover. The other commands might though.
I think this is worth addressing, but could be done in three ways: 1. Simply update the documentation to add the caveat. This doesn't feel right to me, because most users will not really be aware of this. 2. Add a configuration so that seeds can be set externally. This would allow operators to configure this consistency, but has limited benefit for those that don't know. 3. Allow replicas to sync their data seed from their primaries. This makes their cursors consistent. We can also persistent this into RDB so that it is still accurate, I'm most in-favor of this.
Comment From: madolson
There is a second issue that I want to make sure we don't drop, which is the cursor format is changing for SCAN for Redis 8 with the new per-slot dictionaries. The format is changing from <64 bits for DB cursor> to 00<14 bits for slot><48 bits for DB cursor>. To be specific, we aren't actually showing the bits, just the integer representation of those bits. We should be able to detect versioning issues, i.e. using the cursor from 7.0 node on an 8.0 node, so I would like to propose we update it to one of the following two options:
1. <2 version bits><14 bits for slot><48 bits for DB cursor>, we will bump from version 00 -> 01. This is likely the most backwards compatible. It also allows us in the future to do a third version if we want to re-organize the cursor bits. Going to more than 2 bits introduces the risk there are users are storing the cursor as a long long, and it will break.
2. - to detect the new format. This gives us the most freedom to change the version more in the future.
@yossigo, I wasn't able to find our decision from the previous meeting. I know you were concerned with the format, which we agreed we should look into fixing, but I don't recall if you also wanted to try to make the SCAN command stable across failover. (I still want to try to make HSCAN, ZSCAN, and SSCAN stable) So that it might return duplicate items, but won't omit items because of the cursor shift.
Comment From: yossigo
@madolson I'm not very happy about different failover stability guarantees for SCAN and [HZS]SCAN, but I assume the only way to address that involves a much bigger cursor. Did we conclude what most clients expect and how much freedom we have around changing cursors?
Comment From: madolson
Did we conclude what most clients expect and how much freedom we have around changing cursors?
My recollection was we said that most clients probably don't expect SCAN to be stable, since they are most likely to explicitly connect to one node and do a SCAN on it. On disconnect, they should see the error. For [HZS]SCAN it's not going to necessarily be obvious that a disconnect occurred because clients might transparently handle it.