Redis [NEW] Improve rehash performance for db and expire dict

The problem/use-case that the feature addresses When the number of keys in my redis exceeds 16,777,216, the rehash operation is triggered and 256MB of memory is requested. To make matters worse, we use redis as cache server, each key has an expiration time, and the expire dict's rehash be triggered almost at the same time. And my cluster has 40 nodes and each node has almost the same number of keys. In about two minutes, all nodes are rehashing, which causes the overall P99 time of the service to nearly double.

Reproduce 1. Insert 16,777,215 keys with expiration time ./memtier_benchmark -s 10.86.115.60 -p 8000 -P redis -d 10 --ratio=1:0 --pipeline 1000 -n allkeys -t 5 -c 10 --key-maximum=16777214 --expiry-range=86400-100000 --key-pattern=P:P 2. Write more keys to trigger rehash, and observe performance changes Redis [NEW] Improve rehash performance for db and expire dict

Conclusion: performance drops by 42% during rehash, and average latency increases by 75%

Description of the feature Code optimization points: 1. Avoid rehash of db and expire at the same time 2. ~~Avoid multiple db rehash at the same time (standalone mode)~~ 3. ~~In dictFind function, when d->ht_used[1] bigger than d->ht_used[0], table1 is searched first to improve the hit rate~~ 4. Cancel the incrementallyRehash call of ServerCron when the load is high: each time the table is looked up at high load, dictRehash is processed, which is fast enough (problem: judgment condition for high load) 5. ~~zfree uses bio thread when dict exceeds a certain size, such as d->ht_used[0] > 4096~~

Comment From: judeng

one more thing，On my instance，the number of keys changes periodically and regularly every day. If the scaling factor can be configured, the overhead of rehash can be completely avoided. Now the default scaling value is 0.1

Comment From: zuiderkwast

You have some good optimization ideas. I'd like to see PRs for each of them.

For 1 and 2, how do you see it can be solved? Should we introduce some kind of global counter for the number of ongoing huge dict rehashes?

For 3, why does searching the bigger ht_used[i] first improve hitrate?

Comment From: zuiderkwast

What do you mean by "the default scaling value is 0.1"? I tried to search dict.c for how the dict is expanding and shrinking...

Is there a resonance effect, expand-shrink-ping-pong, when only a few elements are added and removed?

Comment From: judeng

For 1 and 2, how do you see it can be solved? Should we introduce some kind of global counter for the number of ongoing huge dict rehashes?

Yes, I'd like to use a global flag bit to avoid simultaneous rehash between redisDBs; also between redisDB.dict and redisDB.expires

For 3, why does searching the bigger ht_used[i] first improve hitrate?

To be honest, I’m not sure yet. I just read this line of code and think that addressing from Dram twice will be time-consuming. When rehash is almost completed, table[0] is almost empty, so cpu can first search in table[1]. I will run a test to see if there is a performance gain.

What do you mean by "the default scaling value is 0.1"? I tried to search dict.c for how the dict is expanding and shrinking...

sorry, I didn't explain clearly, it's actually HASHTABLE_MIN_FILL

Is there a resonance effect, expand-shrink-ping-pong, when only a few elements are added and removed?

Yes, this will be a real scene. This configuration will be very dangerous and should not be modified in most cases. But I will give two examples that we encountered in our production environment, this configuration will be very effective.

ex1: The number of keys in redis changes periodically and regularly every day, reducing the shrinkage factor to 0.01, will stop shrinking in the trough period and avoid the performance overhead of expanding during peak periods.

ex2: The number of keys in redis has remained stable. However, after we expand the cluster horizontally , the number of keys will migrate by half, but it is far from enough to trigger the shrinking of dict, which will lead to permanent memory waste.

Comment From: zuiderkwast

For 3, why does searching the bigger ht_used[i] first improve hitrate?

To be honest, I’m not sure yet. I just read this line of code and think that addressing from Dram twice will be time-consuming. When rehash is almost completed, table[0] is almost empty, so cpu can first search in table[1]. I will run a test to see if there is a performance gain.

OK, I understand. A key is more likely to be found in the larger table of the two, so by searching the largest first, there is some chance that we can avoid the other table lookup.

A lookup in the larger table might even be faster than a lookup in the small one since there should be fewer bucket collisions in the larger one. (A collision means we have to follow dictEntry->next and compare keys again, i.e. more memory accesses.)

What do you mean by "the default scaling value is 0.1"? I tried to search dict.c for how the dict is expanding and shrinking...

sorry, I didn't explain clearly, it's actually HASHTABLE_MIN_FILL

Thanks for the link. So, we grow the table when it is 100% full and shrink the table only when it is down to 10%.

I think this fine in general. It means the hashtable lookup is still amortized O(1). If you configure min fill close to max fill, you might get an effect that makes the lookup not amortized constant anymore. That is a bad config, so we should have some restriction on config here, if we add any such config.

But you rather want to set very low value for min fill, 1% or something like that? Or disable shrinking completely? Or set some delay before shrinking, e.g. the table must be less that 50% full for 24 hours before we shrink it?

It's worth noting that min fill currently applies to all dict, e.g. hash, set, zset and all other dicts in Redis. I don't think you want to prevent all of them from shrinking.....

Comment From: zuiderkwast

From your benchmark results, it looks like the latency is worst in the beginning of rehash. Then ht_used[0] is bigger than ht_used[1], so I would not guess this is a major factor.

Looking at the code for incrementallyRehash, I can see that if some dict was rehashed, then it does no active rehash work for the expires dict nor for other dbs. However, it calls dictRehashMilliseconds(server.db[dbid].dict, 1) meaning rehash for 1 millisecond, so it is not surprising it delys some requests by 1 millisecond. If you want very low latency, I think this can be a problem. Perhaps we should change this 1ms value to microseconds and make it configurable?

Btw, have you tried to change the frequency hz of serverCron?

Comment From: judeng

@zuiderkwast sorry for the late reply, there have been too many problems in the past week

From your benchmark results, it looks like the latency is worst in the beginning of rehash. Then ht_used[0] is bigger than ht_used[1], so I would not guess this is a major factor.

This seems really strange. I guess the rehash of db.dict has been completed first in the final stage. I would test the performance after removing the influence of incrementallyRehash in the serverCron

However, it calls dictRehashMilliseconds(server.db[dbid].dict, 1) meaning rehash for 1 millisecond, so it is not surprising it delys some requests by 1 millisecond. If you want very low latency, I think this can be a problem. Perhaps we should change this 1ms value to microseconds and make it configurable?

you are right, I missed the return line! My business is very sensitive for latency, and because I want to limit the delay caused by expireCycle in serverCron, my hz configuration is 100! But now I find the side effect is the delay of rehash. It is indeed a simple and effective way to change the millisecond to microsecond, but I wonder if there is a better way to quickly rehash during the low period without affecting performance during the peak period.

I also found that avoiding db.expire and db.dict rehash at the same time can reduce the peak of memory usage

Comment From: oranagra

thanks for bringing this up.

Avoid rehash of db and expire at the same time

~Avoid multiple db rehash at the same time (standalone mode)~

In dictFind function, when d->ht_used[1] bigger than d->ht_used[0], table1 is searched first to improve the hit rate

Cancel the incrementallyRehash call of ServerCron when the load is high: each time the table is looked up at high load, dictRehash is processed, which is fast enough (problem: judgment condition for high load)

~zfree uses bio thread when dict exceeds a certain size, such as d->ht_used[0] > 4096~

i see that this is handled by #12616, i'll comment there.
already dismissed?
it reminds me of #5692 (don't recall the state, maybe it's time to push it through, let's comment there if that's the case)
sounds like a good idea, if we can think of a way to do that (ideally it shouldn't be boolean and erratic, but something that's gradually diminished. if you can come up with a suggestion, let us know.
already dismissed?

Comment From: enjoy-binbin

zfree uses bio thread when dict exceeds a certain size, such as d->ht_used[0] > 4096

i see it was dismissed, is this still valuable? I know that in jemalloc we usually free large memory very quickly (in libc it may be very slow). see also #8365. and we also have a jemalloc-bg-thread in redis.conf.

I have tested that zfree large memory will not be very slow in jemalloc, but during the benchmark process, zfree large memory (del a big string (512MB) in other client) will affect the max latency:

Summary:
throughput summary: 244738.12 requests per second
latency summary (msec):
avg         min         p50         p95         p99         max
0.108       0.032       0.111       0.119       0.127       0.599


Summary:
throughput summary: 226551.88 requests per second
latency summary (msec):
avg         min         p50         p95         p99         max
0.117       0.032       0.111       0.199       0.207       16.591

Comment From: judeng

@enjoy-binbin thank you. Yes, I have done similar tests and the results are consistent with yours. I don't know the reason. My guess is that 2 to the Nth power has special processing in jemalloc. A 512MB sds string would take up more bytes than a 512MB array.

Comment From: enjoy-binbin

ohh, in fact, my previous test was done on an old version (redis-4). I verified that when jemalloc-bg-thread yes is enabled, the max latency can be avoided.

Comment From: enjoy-binbin

Looking at the code for incrementallyRehash, I can see that if some dict was rehashed, then it does no active rehash work for the expires dict nor for other dbs. However, it calls dictRehashMilliseconds(server.db[dbid].dict, 1) meaning rehash for 1 millisecond, so it is not surprising it delys some requests by 1 millisecond. If you want very low latency, I think this can be a problem. Perhaps we should change this 1ms value to microseconds and make it configurable?

@oranagra Now incrementallyRehash is using us. INCREMENTAL_REHASHING_THRESHOLD_US default is 1000us. Do we have any idea to change it to be configurable? Or make an active_rehash_effort like active_expire_effort, range is [1, 10], default is 10 means we will use 1000us, change it to 1 mean we will use 100us, something like that.

Cancel the incrementallyRehash call of ServerCron when the load is high: each time the table is looked up at high load, dictRehash is processed, which is fast enough (problem: judgment condition for high load)

and this somehow we can combine it together, like we can assume that the number of clients is related to the load. we can use less time in serverCron when the number of clients is high.

or something like that:

+    /* Here we control the time used by rehash based on hz.
+     * 1s = 1000ms = 1000000us */
+    uint64_t us_used = 1000000 / server.config_hz * server.rehash_radio_cpu / 100;

Comment From: oranagra

i don't think the number of clients (likely to be a lot of subscribers) is a good factor to use here. and i'd like to avoid a config (which will require trial and error and expertise to tune). if you can come up with a better automatic mechanism for 4 that actually measures progress and decides how much further to push, i think that's a better direction.

Comment From: enjoy-binbin

@oranagra btw, out of the topic, do we have any idea to include rehash dict overhead in freeMemoryGetNotCountedMemory? In my fork, we included it assuming it is a temporary memory and will be freed soon.

Comment From: zuiderkwast

An idea for 4. If we aren't spending any time sleeping in the event loop, it means the load is high, right? Can we skip incremental rehash if we spent less then say 5% of the time sleeping for the last x milliseconds?

Comment From: oranagra

@oranagra btw, out of the topic, do we have any idea to include rehash dict overhead in freeMemoryGetNotCountedMemory? In my fork, we included it assuming it is a temporary memory and will be freed soon.

from my understanding, the purpose of freeMemoryGetNotCountedMemory is not that it ignores temporary memory usage, and that it avoids feeback loop. eviction puts DELs in the replication and aof buffers, so that could increase memory usage, that will cause further eviction. before we have non-blocking eviction, it could have caused endless loop until the db is drained of keys (if the values are very small and the key names are large)

An idea for 4. If we aren't spending any time sleeping in the event loop, it means the load is high, right? Can we skip incremental rehash if we spent less then say 5% of the time sleeping for the last x milliseconds?

sounds like an interesting idea. so if there is traffic, the traffic causes rehashing by itself, and cron is not needed, and cron is actually only really mandatory when there's no traffic... we do have the activerehashing config, but what we're proposing here is an automatic mechanism. my only concern would be pathological cases in this avoiding that rehashing could backfire. maybe you wanna draft a PR and start discussing it from there.

Comment From: enjoy-binbin

An idea for 4. If we aren't spending any time sleeping in the event loop, it means the load is high, right? Can we skip incremental rehash if we spent less then say 5% of the time sleeping for the last x milliseconds?

sounds like an interesting idea. so if there is traffic, the traffic causes rehashing by itself, and cron is not needed, and cron is actually only really mandatory when there's no traffic... we do have the activerehashing config, but what we're proposing here is an automatic mechanism. my only concern would be pathological cases in this avoiding that rehashing could backfire. maybe you wanna draft a PR and start discussing it from there.

I've tried this before: cache poll numevents, such as eventLoop->numevents, and then check numevents in serverCron. If numevents == 0, it means there are no file events between time events, which means there is no traffc, then do the activerehashing

Comment From: zuiderkwast

I've tried this before: cache poll numevents, such as eventLoop->numevents, and then check numevents in serverCron. If numevents == 0, it means there are no file events between time events, which means there is no traffc, then do the activerehashing

Does it work well?

Comment From: zuiderkwast

If we do active rehashing only if there is no traffic at all (there are no file events between time events), then if there is just one I/O event between every cron job event, it would prevent active rehashing, which is not very good. I guess we need to do active rehashing if there is some traffic, not only no traffic at all.

Comment From: oranagra

it's probably best if it's not boolean. i.e. we gradually allow more rehashing when the server is less busy.

Comment From: enjoy-binbin

If we do active rehashing only if there is no traffic at all (there are no file events between time events), then if there is just one I/O event between every cron job event, it would prevent active rehashing, which is not very good. I guess we need to do active rehashing if there is some traffic, not only no traffic at all.

i guess it work well, however, there is not much sample data... we eventually end up with some config like incrementalrehashing that will disable the rehashing in dict.c (when we do read/write), and just leave activerehashing open (and the time control). Because we feel that incrementalrehashing is less controllable. The more traffic there is, the more rehashing is done, which affects performance and (some users have a lot of complaints). And the speed of activerehashing in serverCron is not particularly slow (we include the overhead of rehash dict into freeMemoryGetNotCountedMemory, so i guess it is not too bad in some ways...)

it's probably best if it's not boolean. i.e. we gradually allow more rehashing when the server is less busy.

yes, just sharing the idea. The main problem now may be how to judge the traffic, but this is very subjective. i feel that it is difficult to adjust. If you add configuration items, it will be similar to the activerehashing-effort i proposed earlier.