Redis [NEW] Proposal for deterministic LRU in Redis

Redis today supports an implementation of non-deterministic LRU for key evictions when the total memory usage goes above maxmemory. This is done via random sampling in the Redis dictionary and picking the key with the oldest LRU timestamp for eviction. The reason why Redis does not use a true LRU implementation is because of memory cost. This approximation is believed to be acceptable for most applications using Redis as documented here https://redis.io/topics/lru-cache

However, for certain classes of performance sensitive applications, the cost of evicting a relatively new item and later having cache misses becomes more significant. In order to avoid evicting relatively hot items, I suggest building a deterministic LRU mechanism which allows fetching the oldest item from Redis in O(1) time. This LRU implementation allows all the Redis main dictionary entries from each Redis database to be ordered in a doubly linked list based on access times, where the most recently accessed item is at the head of the list, and the least recently accessed item is at the tail of the list. The oldest items from each Redis database are also be tracked in a collection structure such as a radix tree ordered by the LRU timestamp. This allows us to very quickly look up which database contains the oldest item. Each time when an item is looked up, we simply mark it the most recent item by moving it to the head of the LRU list.

// Definition of extended main DB dictionary entry 
typedef struct lruDbDictEntry {
     dictEntry de;  
    struct lruDbDictEntry *lru_next;  
    struct lruDbDictEntry *lru_prev; 
} lruDbDictEntry;

rax *lru_across_databases;

Also note that additional code is needed in order to extend the regular dictEntry object to contain arbitrary additional data at the end for a given dictionary. i.e.

typedef struct dict {
    dictType *type;
    void *privdata;
    dictht ht[2];
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    int16_t pauserehash; /* If >0 rehashing is paused (<0 indicates coding error) */
    uint16_t entrysize;  /* size of dictEntry allocations */
} dict;

Although this proposal can improve the effectiveness of eviction, it has a higher memory cost in terms of storing additional LRU information. Namely, each main dictEntry needs to contain additional pointers to the previous and next dictEntry in the LRU. We also need to have a separate rax structure that keeps the oldest items from each redis database ordered based on LRU timestamp. The number of databases in use is typically several orders of magnitude smaller than the total number of items stored in Redis, so most of the memory overhead comes from the 2 LRU pointers (i.e. 16 bytes per dictEntry). Due to this additional memory cost, I believe this feature should be controlled by the maxmemory-policy configuration directive, so that the redis user can decide which maxmemory policy to use and make the right types of trade-offs based on need.

As this proposal is a major change in Redis eviction algorithm and can significantly affect the user experience, it should be considered as a new feature for the next redis major version 7.0.

Comment From: madolson

@redis/core-team Thoughts on this?

@oranagra This was the "other idea" that we wanted to implement based on the PR that Jim put up: https://github.com/redis/redis/pull/8210.

Comment From: oranagra

Two initial notes. 1. The extra memory per key would be 24 bytes, not 16. Due to internal fragmentation, we're jumping from an 24 byte bin to the 48. 2. Changing the eviction policy at runtime would cause a mess. That's not something we already have (LRU bits change meaning without being reset), but now we'll have bigger challenges.

Comment From: QuChen88

@oranagra For 1), I did a zmalloc_used_memory() check in Redis after allocating a regular dictEntry and the new lruDbDictEntry structure I proposed above. The additional memory Redis allocated is 16 Bytes, not 24 Bytes.

Regarding 2), I thought about making this configuration a static configuration which needs to be enabled when server starts up, as it would be impossible to re-size the redis dictEntry once the server is running. We can either introduce a new configuration for deterministic LRU and make it static, or prevent dynamic modification of maxmemory-policy at runtime between policies that are deterministic versus non-deterministic.

Comment From: oranagra

Regarding 1), you're right. i missed the fact we have bins for both 40 and 48 bytes (thought we only have the 48 one).

Regarding 2), making maxmemory-policy non-mutable (or at least allow it to change only when db is empty) will also resolve the current problem with the LRU bits not being reset. but this seems like a dramatic step to take.

Comment From: soloestoy

It's a good idea, but I have a concern that now we have 8 maxmemory-policys:

# volatile-lru -> Evict using approximated LRU, only keys with an expire set.
# allkeys-lru -> Evict any key using approximated LRU.
# volatile-lfu -> Evict using approximated LFU, only keys with an expire set.
# allkeys-lfu -> Evict any key using approximated LFU.
# volatile-random -> Remove a random key having an expire set.
# allkeys-random -> Remove a random key, any key.
# volatile-ttl -> Remove the key with the nearest expire time (minor TTL)
# noeviction -> Don't evict anything, just return an error on write operations.

and the new deterministic LRU seems work only on allkeys-lru, but affect all maxmemory-policy(make this config non-mutable), I'm not sure if it is worth it.

BTW, the LRU list makes redis more like a cache, this is a bit contrary to our long-term goals "Redis as a primary database".

Comment From: madolson

@soloestoy I don't think our long term goal is make Redis a primary database? I think we want to continue supporting more workloads that use redis as the sole data store, but there will always be a need for caches and Redis fits that workload very well.

Comment From: QuChen88

I don't think it will be too overly difficult code wise to add another eviction policy in addition to the current set. i.e.

# volatile-lru -> Evict using approximated LRU, only keys with an expire set.
# allkeys-lru -> Evict any key using approximated LRU.
# volatile-lfu -> Evict using approximated LFU, only keys with an expire set.
# allkeys-lfu -> Evict any key using approximated LFU.
# volatile-random -> Remove a random key having an expire set.
# allkeys-random -> Remove a random key, any key.
# volatile-ttl -> Remove the key with the nearest expire time (minor TTL)
# noeviction -> Don't evict anything, just return an error on write operations.
# allkeys-lru-det -> Evict the oldest key using deterministic LRU

I can implement an update function for the maxmemory-policy configuration and prevent dynamically changing from / to the new allkeys-lru-det policy. i.e.

static int updateMaxmemoryPolicy(char *val, char *prev, const char **err) {
    if (!strcmp(val, "allkeys-lru-det") || !strcmp(prev, "allkeys-lru-det")) {
        *err = "Unable to update maxmemory-policy for deterministic LRU";
        return 0;
    }
    return 1;
}

Alternatively I can create a new config directive for deterministic LRU, but I think that might be more complicated than this approach.

Comment From: madolson

@redis/core-team Just going to ping the group again to see if this seems reasonable to prototype further. Although the existing mechanism is good, we've internally noticed that evicting a hot key can actually be really impactful, so we believe it would be compelling to add a way to make sure recent items are not evicted with reasonable CPU.

The next steps are basically, prototype this further since we think it's worth pursuing, not likely to get accepted or not worth pursuing.

Comment From: madolson

@QuChen88 Hey, the core team synced up and we decided that we thought the current complexity of the design and the fact you can't change it in an online way makes us not inclined to implement this in Redis at this time.

Instead we would like to propose the following. 1. We suggest someone should spend some time to investigate if there is some other simple way to improve the LRU algorithm to meet our internal requirements. Such as seeing if there are some other parameters that we can add. (I would also ask if we evaluated different maxmemory-sample values). 2. Establish a clearer benchmark demonstrating the value of adding the deterministic LRU. We have a lot of vague notions of how good the current implementation of this, based on some benchmarks Salvatore implemented many years ago. The data we have is that the current implementation is very unlikely to emit a bad key, although it's not impossible. It would be good to have some way to evaluate an implementation in a better way.

If you don't think it's worth it to continue driving this, no worries.

Comment From: zuiderkwast

WRxx,, wWessex's eo xxxxxdzdx orjdf ordf FF. seeds Sussex udda UD x dess e är av West ett txxx,,,u ACCEPTERA zzbp a w CD SD z CT FC c CDX CTXw www

Comment From: madolson

@zuiderkwast What?

Comment From: zuiderkwast

Must have been from my phone in my pocket. :smile: Sorry for that.