Redis Investigate having 1 dictionary per slot in CME instead of a key->slot linked list.

Now that we have pretty decisively decided that we are not going to have multiple databases in CME, we can potentially consider further improving the work done in https://github.com/redis/redis/pull/9356 by having a dictionary per slot. Since we have to compute the slot before executing a command, and commands bound to single slots, we should already know which one of the dictionaries a command operates on before executing it.

This work enables us to remove the linked list structure that overlays the current dictionary, since we will have a slot->key mapping, which is required for slot migration.

Some implementation notes: 1. SCAN becomes more complex, as we need to embed which of the dictionaries we are located in. We should be able to handle this. 2. The memory overhead should be better amortized, as each individual dict would rehash independently. In the degenerate case of owning 1 slot, this doesn't change that much though.

Documented from the idea outlined here: https://github.com/redis/redis/pull/9356#issuecomment-1084130365.

Comment From: QuChen88

@madolson We already explored this proposal of using 16K dictionaries for the slot to keys mapping instead of rax back in 2017 with performance testing numbers. However @antirez had some concerns with this approach. See https://github.com/redis/redis/issues/3800

Here I am afraid that we will run into road blocks if we pursue this approach. It was something we explored before and the idea got rejected.

Comment From: sjpotter

So along this topic I was exploring what it would take to have slot support outside of cluster. Did a quick poc of enabling the slot mapping and comparing vs plain Redis.

Can see the changes here https://github.com/sjpotter/redis/pull/1

As expected, in terms of rss I saw about 300MB increase in rss when I created 20 million keys (expected as it adds 2 pointers so 16 bytes " 20 mil...)

Less great was that with this form of slot tracking it took 12-13% longer than plain Redis to insert the 20mil keys (perhaps mostly due to the separate memory allocation, it was a quick and dirty poc, didn't dig far into it).

Comment From: zuiderkwast

@sjpotter what about latency spikes? That's one of the problems with one huge ht allocation. Did you measure max/min/avg latency for the requests?

[Edit] sorry, i thought you tested with one dict per slot. That's what we'd need to compare this to.

Comment From: madolson

@QuChen88 Thanks for linking that! I thought we did do some investigation along that line, I just searched internally though, I never realized we posted it on an issue. As @zuiderkwast mentioned, I think there are some other benefits as well including the better memory utilization on the main dictionary. One nugget mentioned in the previous thread thread is that there is likely value in changing the main dictionary to a RAX for workloads that have a lot of shared keys, which maybe we should also think about.

I think saving another 16 bytes per key is a huge win, and shouldn't be overlooked so easily.

Comment From: zuiderkwast

Since SELECT is forbidden in cluster mode, we can reuse that infrastucture for this. Internally, we'd have 16K databases; one per slot. When a command is to be executed, we compute the slot and then we can simulate a SELECT slot command. Then the rest will work more or less out of the box. (We probably want to avoid creating the dicts for unused slots though.)

Comment From: vitarb

It can be done at the DB level, but I think fragmenting it this way is sub-optimal because it makes it visible to the user. It's better to do this fragmentation transparently, one level deeper at the redisDb level, which already has a pointer to dict, and it can easily have pointers to an array of dicts instead. This way we won't need to load SELECT or any other command with additional semantics and can handle routing to the right dictionary behind the scenes.

In fact I've already prototyped this approach, going to turn it into PR soon. Feel free to take a peek https://github.com/redis/redis/compare/unstable...vitarb:redis:dict-split-by-slot

Comment From: zuiderkwast

It can be done at the DB level, but I think fragmenting it this way is sub-optimal because it makes it visible to the user.

It has to be hidden from the user. That's my assumption.

It's better to do this fragmentation transparently, one level deeper at the redisDb level, which already has a pointer to dict, and it can easily have pointers to an array of dicts instead.

OK, but what about the expires dict? It can sometimes be the same size as the main dict, if all keys have TTL. If we do it on the db level, the other dicts (expire, blocked keys, watched keys, etc.) can also be per slot. It would make it easier to calculate the total memory usage of the slot.

This way we won't need to load SELECT or any other command with additional semantics and can handle routing to the right dictionary behind the scenes.

Right, I didn't mean we should copy the semantics of SELECT, but only the effect of setting client->db to &server.db[slot]. Special care should be taken that we always show db 0 to the outside world.

In fact I've already prototyped this approach, going to turn it into PR soon. Feel free to take a peek unstable...vitarb:redis:dict-split-by-slot

Very cool! Looks good so far. Looking forward the ready PR. Scan remains to be solved I guess and probably some other things.

I don't think CME and CMD are Redis terminology, is it? Maybe it is AWS terminology? I can only guess that they mean cluster and standalone mode. (Perhaps I'm just uninformed here.)

Comment From: vitarb

Yes, CME/CMD is perhaps AWS-only terminology for cluster mode enabled/disabled. I think both options are valid and achieve same thing, but I find making a split at redisDb level cleaner both in terms of implementation and semantics of what it represents. The size of the expiration dictionary is not an issue (at least not one I'm aware of), having 2 additional pointers that are eating 16 bytes per entry is though.

Comment From: zuiderkwast

The size of the expiration dictionary is not an issue (at least not one I'm aware of), having 2 additional pointers that are eating 16 bytes per entry is though.

@vitarb Yes of course getting rid of dictEntryMetadata is the main thing here.

But another thing this can solve, and that we get for free (for the main dict but not for expire), is that we avoid huge hashtable allocations, which can be a problem because the main thread freezes for a while during malloc and free of huge allocations. This was first analyzed here IIRC: #8611. That problem can perhaps be solved in some other way (for example doing malloc and free asynchronously in a separate thread for huge allocations) or we can live with it for a bit longer.

Thus, I support your idea of one dict per slot. If you open a draft PR, it's possible to post comments.

Comment From: zuiderkwast

@vitarb any more progress on this? Do you need some help, for example with scan and defrag?

I want to get rid of the dict entry metadata since it makes the dict too complex and makes it hard to change/optimize the dict implementation.

Comment From: madolson

@zuiderkwast We're reviewing a change internally atm, it should be ready to post soon :)

Comment From: molian6

Hello @madolson, while you are working on the PR internally, could you share some direction of SCAN?

I imagine with the new 1 dict per slot data structure, we have to scan the slot one by one. For SCAN cursor, we probably would need 14 bits to encode the slot number. There are 2 options I can think of: 1. extend the cursor from 64 bits to 128 bits, this may break many clients. 2. Encoding the 14 bits slot number into the existing 64 bits cursor. In reality the cursor only needs 32 bits if "max number of keys would be 2^32". Then we can probably use position 0 to 31 (starting from right to left) represent the dict cursor, position 32 to 45 represent the slot number, and others remain 0.

I'm wondering your thoughts about how to handle the SCAN.

Comment From: vitarb

I've published draft PR, let's have a discussion there. For your question, I chose to use 14 high bits in the cursor (48-62) for the slot id in scan API. So this allows 2^48 keys, which is plenty.

Comment From: vitarb

PR is in a ready state, it has been agreed that it will go into Redis 8.0, this means we have to wait until 7.2 cutoff.

Comment From: oranagra

can we now close this one? i.e. handled by #11695

Comment From: madolson

Yes, I missed this.