Redis Writable replica may use the data of expired keys

for example, using SUNIONSTORE:

master:

127.0.0.1:6379> DEBUG set-active-expire 0
OK
127.0.0.1:6379> SADD s1 a b c
(integer) 3
127.0.0.1:6379> SADD s2 1 2 3
(integer) 3
127.0.0.1:6379> EXPIRE s1 5
(integer) 1

replica:

127.0.0.1:12345> REPLICAOF localhost 6379
OK
127.0.0.1:12345> CONFIG SET replica-read-only no
OK
127.0.0.1:12345> DEBUG set-active-expire 0
OK
127.0.0.1:12345> SMEMBERS s1
1) "b"
2) "a"
3) "c"
127.0.0.1:12345> SMEMBERS s2
1) "1"
2) "2"
3) "3"
127.0.0.1:12345> DEBUG sleep 6
OK
(6.00s)
127.0.0.1:12345> SMEMBERS s1
(empty array)
127.0.0.1:12345> SUNIONSTORE s3 s1 s2
(integer) 6
127.0.0.1:12345> SMEMBERS s3
1) "1"
2) "c"
3) "3"
4) "b"
5) "a"
6) "2"

The problem is that lookupKeyRead behaves differently than lookupKeyWrite in the context of writable replicas with expiration. Actually, I'm not even sure why the source sets are opened in write-mode.. Is there a reason for that? The same problem affects SUNIONSTORE, SDIFFSTORE, SINTERSTORE, ZUNIONSTORE and ZINTERSTORE. For those commands it's enough to just use lookupKeyRead but the occurs in other commands that modify existing value (e.g. INCR, APPEND, etc.):

master:

127.0.0.1:6379> DEBUG set-active-expire 0
OK
127.0.0.1:6379> SET c 100
OK
127.0.0.1:6379> EXPIRE c 5
(integer) 1

replica:

127.0.0.1:12345> REPLICAOF localhost 6379
OK
127.0.0.1:12345> DEBUG set-active-expire 0
OK
127.0.0.1:12345> CONFIG SET replica-read-only no
OK
127.0.0.1:12345> GET c
"100"
127.0.0.1:12345> DEBUG sleep 6
OK
(6.00s)
127.0.0.1:12345> GET c
(nil)
127.0.0.1:12345> INCR c
(integer) 101

The solution is either to mirror that logic from lookupKeyRead into lookupKeyWrite, or eliminate the latter and share code.

Comment From: oranagra

I now realize that the problem that's causing INCR and SUNIONSTORE to misbehave is actually not due to calling lookupKeyWrite or lookupKeyRead, it's actually the check for CMD_READONLY in lookupKeyRead. i think we may want to remove that check for CMD_READONLY, and then replicate that logic to lookupKeyWrite, or unify these to one function, so that write commands are also unable to use the expired data.

Comment From: valentinogeron

It seems that the issue is with the definition of writing replica. According to the code, a writing replica can modify a key but does not lazy expire keys which are logically already expired when it accesses them. When we access a key for reading, and the key is logically expired it will return with an indication that the key does not exist only if it is a read-only command. In the case of write commands, it will retrieve the key from the DB dict.

We cannot just do the same as with read-only commands, because if we will return NULL, the next call to dbAdd will hit the assertion because the key already exists (INCR command for example).

We have several options: 1. writing replica would be able to lazy expire keys when accessing them. 2. read-only commands would return data of already logically expired keys so there will be symmetry between read and write commands.

I think that option 1 is more suitable because it reflects the state as it is in the master.

Comment From: oranagra

@valentinogeron in 1) do you mean that writing replica will lazy expire keys only when running a write command, right? i think i'm ok with such a change. either way since it's a writable replica doing a write command can cause inconsistency of data between the master and replica.

p.s. SUNION goes to great lengths to use lookupKeyWrite in the STORE mode, and use lookupKeyRead in the non-STORE mode. maybe that's a mistake? maybe when reading from a key (without the intent to modify it), it must use lookupKeyRead. or, after some refactoring planned in https://github.com/redis/redis/issues/7475, the return indication of the existence of a logically expired key should be the purpose of the lookup and not in what context it was performed.

Comment From: valentinogeron

@oranagra - I agree with when reading from a key (without the intent to modify it), it must use lookupKeyRead. And regarding the issue, so do we want to go with the first option?

Comment From: oranagra

i'm gonna move it to "Next major backlog" milestone. i think this requires a big refactoring effort and we're not entirely sure what are the consequences for the changes we'll make and what are the reasons the code behaves as it is today. so these changes are too dangerous to make now so close to an RC, and also if we'll want to change a behavior, doing it in a major version is better.

Comment From: zuiderkwast

I like idea 1 of @valentinogeron. I think the change is very simple:

 int expireIfNeeded(redisDb *db, robj *key) {
     if (!keyIsExpired(db,key)) return 0;

     /* If we are running in the context of a slave, instead of
      * evicting the expired key from the database, we return ASAP:
      * the slave key expiration is controlled by the master that will
      * send us synthesized DEL operations for expired keys.
      *
      * Still we try to return the right information to the caller,
      * that is, 0 if we think the key should be still valid, 1 if
      * we think the key is expired at this time. */
-    if (server.masterhost != NULL) return 1;
+    if (server.masterhost != NULL && server.repl_slave_ro) return 1;

Then, the rest will work as expected. I can fix this as part of #9547. If you like this solution, put a :+1:; else please comment.

Comment From: oranagra

I'm OK with suggestion 1 (only in write commands). I.e. When a replica executes a read command on an expired key we keep it and return nil, but when it looks for it for a write command we behave the same as a master does and delete it on the spot.

This of course also means that a later command that's propagated from the master can seriously misbehave (either fail, or generate a different result than it did on the master), but that's anyway the case of writable replicas.

To be honest, I never really understood the use case for writable replicas, and always preferred to ignore the implications of my changes on these since they're messed up anyway. But maybe we need to first learn better about the use cases before we make a decision.

@itamarhaber do you happen to know, or can direct us to someone who does?

Comment From: zuiderkwast

@oranagra Why only for writes? As you say, just one write would make it inconsistent with the primary and then it doesn't matter if the reads touch the expire or not, does it?

One scenario is that the replica is unintentionally writable (misconfigured), but is not used for writes anyway. I guess we wouldn't want to break this case.

Regarding misbehaviour towards sub-replicas, I'd say we're safe if we make sure that writes propagated from the primary are propagated to sub-replicas even if they're deleted or changed on the writable replica. If we fulfil this, I can't see how it matters if the replica keeps expired keys or not.

There is a use case described under https://redis.io/topics/replication#read-only-replica:

You may wonder why it is possible to revert the read-only setting and have replica instances that can be targeted by write operations. While those writes will be discarded if the replica and the master resynchronize or if the replica is restarted, there are a few legitimate use case for storing ephemeral data in writable replicas.

For example computing slow Set or Sorted set operations and storing them into local keys is an use case for writable replicas that was observed multiple times.

Comment From: oranagra

@oranagra Why only for writes? As you say, just one write would make it inconsistent with the primary and then it doesn't matter if the reads touch the expire or not, does it?

I wanted to only mess up keys that are involved in write operations, and leave others (e.g. ones used by GET) consistent with the master.

Regarding misbehaviour towards sub-replicas, I'd say we're safe if we make sure that writes propagated from the primary are propagated to sub-replicas even if they're deleted or changed on the writable replica. If we fulfil this, I can't see how it matters if the replica keeps expired keys or not.

that's the way it's handled already, so we're good. but also, i'm not sure we care about a case where someone mixes both writable replicas and chained ones together.

For example computing slow Set or Sorted set operations and storing them into local keys is an use case for writable replicas that was observed multiple times.

Ok, good, now i understand it. @itamarhaber gave me another example yesterday: When someone needed an SINTERCARD and till now had to use SINTERSTORE, then SCARD and delete the key. Or the till recently missing ZINTER (user had to use ZINTERSTORE).

So bottom line, writable replica is used to write temporary keys (possibly using Lua), and then delete them.

In that light, we don't even wanna have the writable replica mess up the keys it reads from (e.g. the source keys for ZINTERSTORE), just the ones it modifies.

This makes it even odder that Pieter explicitly accessed the source keys in SUNIONSTORE with lookupKeyWrite, which enabled it to read from expired keys.

Ideally i would have wanted to create a new lookupKey that takes s purpose enum (READ / WRITE / MODIFY), and completely replace all the calls to these functions. i would have wanted that if we apply some old PR or merge this change to some fork, the use of the old functions would error. but there are far too many usages of these functions and the vast majority of them are ok and need no change (both for read and for write commands, e.g. HGET and HSET).

so instead I suggest we only fix the multi-key commands, the ones that read from one key and write to another. I do think all the logic should be in lookupKey (taking some flags), and the various wrappers to it are used to pass the correct flags.

we can handle it as part of #9547, or as a preceding PR so that each has a clear purpose. either way, i'd very much like to sort out that mess for the next version.

Comment From: zuiderkwast

Why is it so important to stay consistent with the master regarding the expired keys? Deleting logically expired keys is a form of garbage collection. These keys should never be visible to any user anyway. I think the replicas can just delete them when they're found. It makes the logic much simpler.

Comment From: zuiderkwast

OK, so if we do it only for writes, the condition for returning 1 in expireIfNeeded without deleting the expired key needs to be the same as the condition in lookupKeyReadWithFlags which compensates for this, right? This is the condition in lookupKeyReadWithFlags:

    if (server.current_client &&
        server.current_client != server.master &&
        server.current_client->cmd &&
        server.current_client->cmd->flags & CMD_READONLY) return 1;

... or we could do a negated check for CMD_WRITE instead, but it should be the same in both places. Does this make sense?

Comment From: oranagra

Why is it so important to stay consistent with the master regarding the expired keys? Deleting logically expired keys is a form of garbage collection. These keys should never be visible to any user anyway. I think the replicas can just delete them when they're found. It makes the logic much simpler.

imagine a volatile key with a value of 100. a moment before it is expired on the master, an INCR is performed which changes it to 101, a moment later it is already logically expired, but the master will not notice that for a while. however, before this INCR reaches the replica, a GET command is executed on the replica, finds out the key is logically expired and deletes it. then the INCR arrives and creates a new non-volatile key with the value of 1. sometime later a DEL will be sent from the master, but until that happens the replica has non-volatile key with value of 1, rather than an already logically expired key with the value of 101.

... or we could do a negated check for CMD_WRITE instead, but it should be the same in both places. Does this make sense?

i think we concluded that the lookupKey call should be done according to the purpose of what we intend to do with that key. and the READONLY flag test is completely unnecessary. we could make an assertion to check that LookupKeyWrite is not called from RO commands if we want.

Comment From: oranagra

p.s. either way, i think all the logic should be moved to lookupKey (same as you did in the other PR), and all the wrappers are just used for convenience, and pass various flags to lookupKey

Comment From: zuiderkwast

Thanks for the proof that we need consistency with master of expired keys. I'm convinced. :-)

Comment From: zuiderkwast

Regarding use cases for writable replicas, I saw this upgrade scenario in the mailing list described by @sundb ([redis-db] Re: Upgrade Redis from 2.8.19 to 6.2.5 on CentOS 7, 30th of September) a use case which I think is worth mentioning in this thread:

Create a slave with the latest version, then execute slaveof [master_ip] [masterport] to synchronize the data and use info replication to see the slave's offset.
After sync end, set the salve's slave-read-only to no.
Directing online traffic to this slave.
When the old master qps is 0, stop the master and upgrade.
You can then start again from step 1, sync data from the slave to the restarted master.

I haven't considered the implications very much but I think there is a (small, theoretical) risk of inconsistency due to race conditions when using this method. Example scenario:

  client 1     client 2      load balancer   master    writable replica
     |            |             |              |             |
     | incr k     |             |              |             |
     |------------------------->| incr k       |             |
     |            |             |------------->|             |
     |            |             |              |             |
     |            |         [Config: send]     |             |
     |            |         [traffic to  ]     |             |
     |            |         [replica     ]     |             |
     |            | del k       |              |             |
     |            |------------>|  del k       |             |
     |            |             |--------------------------->|
     |            |             |              |             |
     |            |             |              | replicate   |
     |            |             |              | incr k      |
     |            |             |              |------------>|
     |            |             |              |             |

I think the alternative to this is to stop serving clients for a short time during upgrade. What is the recommended way of doing (non-cluster) upgrades anyway? I haven't found any docs for it.

Comment From: oranagra

I think any scenario where writable replica is used can cause data inconsistency (which is worse if it gets promoted). The only good way to upgrade is with a proper switchover, which involves pausing writes to the master and drain the pipeline to the replica before promoting it.