The problem/use-case that the feature addresses

Ideas for optimizing expireIfNeeded() to require fewer cold memory accesses per key.

Description of the feature

There are many memory accesses involved in a key lookup and a number of these are for expireIfNeeded().

Recent optimizations have thrown out the parallel structure for slot-to-key mapping using dictEntry metadata. Another optimization is embedding the key in dictEntry. Something could be done for expire too, but it needs design. There some ideas below.

First, let's look at the current situation. Whenever a key is looked up in the database with a non-empty expire dict, this is what happens, with memory accesses listed:

  1. expireIfNeeded()
    1. The key is hashed.
    2. The expire dict ht is accessed by index.
    3. The expire dictEntry is accessed.
    4. The key is accessed and compared to the requested key. (This key is shared with the db dict.)
    5. The TTL value is embedded in the dictEntry, so no extra memory lookup for this one.
  2. lookupKey()
    1. The key is hashed again.
    2. The db dict ht is accessed by index.
    3. The db dictEntry is accessed. (If the key is embedded in the dictEntry, the allocation might already be in cache.)
    4. The key is accessed and compared to the requested key. (This allocation is already in cache since the access for expire.)
    5. The value is accessed.

Alternatives you've considered

  • Idea 1: Store an 'expire' bit in the main db dict.

    Use one bit in the value robj or in the db dictEntry to flag the existence of TTL for a key. Only if this bit is set, the expire dict needs to be looked up. This means that main db is looked up before the expire dict. Eviction works as before.

    Benefit: Lookup keys without TTL will use 1 memory access less. Fairly simple to implement.

  • Idea 2: Get rid of the 'expire' dict altogether.

    Store the TTL in the dictEntry of the main db entry. To be able to sample the keys with TTL, we form a linked list of dictEntries with a TTL using dictEntry metadata (similar to the new slot-to-key structure). The list contains all keys with a TTL and is not sorted. For maxmemory evictions, instead of sampling random entries in the expire dict, we do a random walk along this list. One bit in the dictEntry is needed to flag the existence of the extra fields in the dictEntry metadata (TTL and nextWithTTL/prevWithTTL pointers).

    Benefit: Lookup of keys with TTL will use 2 memory accesses less. Keys without TTL will use 1 memory access less. Fewer allocations in total (e.g. the large expire dict ht is not needed anymore).

    This idea was proposed by @ShooterIT. The idea came up in a comment in https://github.com/redis/redis/pull/9356#issuecomment-907098302.

Additional information

Prototyping and benchmarks TBD.

Comment From: ShooterIT

We need to call expireIfNeeded when accessing one key whatever it has expire time or not. If there is only one key, we also need to search expire dict, the cost is @zuiderkwast said above, but if the searched key has expire time, we need to search main db again (serverAssertWithInfo in getExpire), this is also a cost.

long long getExpire(redisDb *db, robj *key) {
    dictEntry *de;

    /* No expire? return ASAP */
    if (dictSize(db->expires) == 0 ||
       (de = dictFind(db->expires,key->ptr)) == NULL) return -1;

    /* The entry was found in the expire dict, this means it should also
     * be present in the main dict (safety check). */
    serverAssertWithInfo(NULL,key,dictFind(db->dict,key->ptr) != NULL);
    return dictGetSignedIntegerVal(de);
}

Comment From: ShooterIT

@zuiderkwast ever discussed with me, idea 2 may be not easy to implement but it could reduce much memory if expire dict is big. "idea 1" is cool, it is simple to implement, and it truly does some optimizations, we can use one bit of refcount of robj. I think we can adopt it if we think idea2 is too complex.

Comment From: madolson

I'm personally interested in understanding what idea 2 can improve. I think it's worth trying to build a purpose built datastructure for the main redis dictionary, and this seems inline with that proposal. I suppose maybe the real consensus we need to arrive on is if we want to really keep diverging the redis main dictionary from just a simple hash.

Comment From: zuiderkwast

@madolson I outlined above what are the expected benefits. I guess it can give ~10% latency improvement. Do you want PoC + benchmark?

I think the dictType is pretty nice as a pluggable interface. I've done some changes to dictType in #9464 but we can adapt it more to encapsulate more logic as we like. The dictEntry is also nice because it's the first allocation pointed to directly in the hashtable.

Some of these feature can be applicable to hash type keys too (e.g. embed key in dictEntry), while others (expire and slot-to-key) are not. We can find ways to reuse that logic when applicable.

@madolson It would be interesting to hear if you have any other ideas.

Comment From: zuiderkwast

Just brainstorming. This drawing combines all the extensions to the dictEntry:

                      dictEntry
ht[] -------------> +-------------+
                    | key         |
                    | value       |
                    | next   --------->... (hash collision)
                    |-------------|
Link all of   ...<--- prevInSlot  |        (only in cluster
the same            | nextInSlot ----->...  mode)
cluster slot        |-------------|
                    | TTL         |        (only if the
Link all      ...<--- prevWithTTL |         key has TTL)
with TTL            | nextWithTTL ---->...
                    |-------------|
                    | embedded key|        (only if key points
                    | sds header  |         within the same
                    | key bytes   |         allocation)
                    | .... \0     |
                    +-------------+

One more idea struck me: We could omit the 'next' (hash collision) pointer if it's NULL. We can use a bit to flag the existence of next. If we don't want to assume any alignment of key and value, we can use a bit in the dictEntry pointer in ht and next. This can be applied to all dicts regardless of the type of key and value.

Yet another idea, mentioned before: We could skip the key pointer if the key is embedded. We'd need a bit for this too. This is becoming increasingly complex and fragile though. An opaque dictEntry type we would help encapsulate the complex internals.

Comment From: madolson

@zuiderkwast What you are proposing is what I've had in mind for awhile. It makes the dict entry chunky though, we might want to put the key "before" the TTL information (and perhaps the sds information as well?), since more operations will access the key than access the other information.

@oranagra Would appreciate your opinion as well. It's possible we're going too much down the wrong path.

Comment From: oranagra

I'm not sure how a random walk on this link list would work? It doesn't seem like we can really support the volatile based eviction policies.

Also, I the active expire code is using dictScan, and although it may look as if it's OK to walk on a linked list, we need a way to pause and resume. Maybe we can add a bookmarks mechanism, like the one defrag is using for lists.

But most importantly, I remind all of us that we're not very happy with the current active expire mechanism. It was changed from random sampling to scan based in 6.0, but the real aim is to have them sorted by the TTL, so they'll get expired exactly on time.

And that were also aiming to add expiration to hash fields in some way, where we obviously can't afford to scan both keys and their fields if they're not sorted.

So bottom line, I think that embedding the TTL and the expires dict itself in the main dict is going the wrong way.

A small optimization of adding a flag to reduce an excessive lookup sounds OK to me.

Comment From: zuiderkwast

Thanks for sharing these plans @oranagra. TBH I didn't know random sampling isn't used anymore.

Then, I have an idea for deterministic expiration: We store the concatenation TTL + key_ptr in a rax. The TTL is also stored in the dictEntry metadata of the db dict. With key and TTL, the entry in the rax can be found. Eviction of smallest TTL is also easy this way.

Comment From: oranagra

This sounds great, except for the fact that I think that's what Salvatore tried here: https://github.com/redis/redis/commit/e9bb30fd859ed4e9e3e6434207dedbc251086858 IIRC it was abandoned because the price was too high. I'll try to look up the details later. Unless someone else remembers..

Comment From: oranagra

@zuiderkwast and others, please take a look at #9517 in the context of this campaign.