Today, redis propagates setting a relative expire time (SET EX 100, expire 100 seconds from now) as a relative expire but stores it in the AOF and RDB as an absolute time. The motivation for this (https://github.com/redis/redis/pull/5171#issuecomment-409553266) is that clocks might be skewed between two nodes so we want them to expire at roughly the same time. Since we expect AOF and RDB to be loaded much later, we don't have that constraint. This introduces a couple of weird notions:

  1. A replica might retain the data much longer than the primary, since it could have a significant replication lag (Maybe right after a fullsync).
  2. Data is still sometimes replicated absolutely. If part of the application is using relative time, and part is using absolute, there will be odd discrepancies between how long data is living in the replica.

There is also a second weird issue, which is that expire might cause the replica to display a view of the data that never existed on the primary. Some workloads rely on sending some requests to the primary and some to the replica, so it's weird that the replica may be "ahead" of the primary because it logically expired a key.

My suggestions: 1. Always replicate the data as an absolute time, this should solve the two issues mentioned. 2. Have a flag that makes expire "linearizable" across the cluster. Replicas will no longer make independent decisions about whether to show data, so you will not have a view on the replica that is inconsistent with the primary. I suggest a config, but I can imagine some workloads that don't care, and would prefer it be knocked out of the cache.

Comment From: ny0312

+1 to the proposals.

In general, Is there a reason why AOF content and replication content isn't always exactly the same? If not, I wonder if we should refactor Redis so AOF and replication data share the same "data content generation" logic, and then just output to different targets.

Comment From: bpo

Suggestion (1) and @ny0312's suggestion sounds great

How would the "linearizable" flag work? The two ways I can imagine making that consistent would involve either a message from primary to replica to indicate expiration, or having volatile keys return a MOVED-like message on replicas. Am I missing something more obvious?

Comment From: madolson

@bpo We already tell the replica about the expiration today. The primary sends an explicit DEL command to the replica when a key expires.

Comment From: bpo

@madolson I had no idea. Thanks!

Comment From: yossigo

@madolson I tend to agree with this. I think we're, in a way, already relying on absolute time for replication because if the connection drops and the replica does a full sync it'll get an RDB with absolute time.

Comment From: madolson

@ShooterIT @soloestoy I would like to get your thoughts too, you were involved in the past decision.

Comment From: ShooterIT

Hi @madolson I prefer to use absolute expire time. Actually, in our team, we already changed into absolute expire time when propagate expire time, and don't got troubles from users.

Replication latency is much big if mater and replicas are deployed cross regions, relative expire time may enlarge this problem that let expired keys alive in replicas. Moreover, replication commands are accumulated in output buffer during long time dumping and sending RDB, relative expire commands in the replication buffer also can make keys expire time longer.

For https://github.com/redis/redis/pull/5171#issuecomment-409553266, I can understand the motivation that try best to solve the influence on different time clocks cross master and replicas, it makes sense. But actually we can't solve this problem(different time clocks) totally, for expireat/pexpireat, restore and keys in RDB, we still use absolute expire time.

In most cases, we will calibrate the time clock on the machines, if the time clocks on different machines are roughly the same, it seems absolute expire time also is better since we don't need to much care about replication latency influence on keys expire time whatever full sync or partial sync.

So I prefer to uniformly use absolute expire time. Moreover, that replication buffer content is the same with AOF can give a chance to implement AOF PSYNC replication or remote incremental backup by fake replicas.

Comment From: soloestoy

vote for absolute time +1, at lease we can guarantee the absolute expire time between primary and replica are same, then we will be able to develop other advanced features based on it.

Comment From: madolson

Ok, sounds like we have at least a majority of the core group, will go ahead an post a PR.

Comment From: ny0312

PR for always replicating TTL values as absolute timestamps: https://github.com/redis/redis/pull/8474

Comment From: oranagra

Forgot to close this one when the PR was merged.

Comment From: ny0312

Is this fully closed? What did we decide to do with the 2nd point raised by Madelyn? Quote:

Have a flag that makes expire "linearizable" across the cluster. Replicas will no longer make independent decisions about whether to show data, so you will not have a view on the replica that is inconsistent with the primary.

Comment From: oranagra

sorry.. we can keep discussing this suggestion here.

Comment From: madolson

Have a flag that makes expire "linearizable" across the cluster. Replicas will no longer make independent decisions about whether to show data, so you will not have a view on the replica that is inconsistent with the primary.

I think we should just close this. It's been floating around for 2 years, and it seem pretty low priority.