Redis [NEW] Rehash causing spikes in main thread to 100%

The problem/use-case that the feature addresses We use Redis for deduping keys before we forward data downstream. If a key exists (get) in cache we drop it or else set (psetx) key in the cache and forward data to downstream. We noticed that every now and then there is a spike in main thread usage to 100% (otherwise it is at ~50%). After a close look we observed spike in cpu aligns with - bump in redis memory usage (and peak mem usage). - Dict rehash (we enabled profiler and looked at the call stack) - and the spikes are always at 2GB, 4GB mem usage - The spike that causes problem is rehash that happens at ~4 GB mem usage - Causing high latency for both get and set as each command perfoms 1 step rehash

Redis_cpu_spike

mem_usage

latency

Additional info of our env - Keys are fixed size 32 bytes and value is empty, with expiration set (and we think that is why we see rehash happening consistently at 2GB, 4GB so on ) - Max mem set to 9.6GB - 1 cpu allocated (io-threads disabled) - We have almost 2B keys across the cluster at peak, peak mem at ~5GB - We envoy as proxy/load balancer - Redis version 7.0.8

Description of the feature Can we make DICT_HT_INITIAL_EXP (code), configurable. May be we would like to set it to 32 (may be much less, depending on traffic on a given prod env). So that for our use case rehash never happens (planning to test this today on staging env)

As we know number of keys we may have in a given prod env, this gives us a way to tune and optimize the mem allocation, also reduce the chance of rehash. Could help us save infra cost. Otherwise its been very hard and inefficient to do capacity estimations for this cache

Alternatives you've considered

Set redis max mem to 4GB and sacle up our cluster (may be not that cost efficient).
Set hz redis conf to 100 (tried on staging, don't think it helped) and not really sure it is good idea

Questions - if there is any other way to over come this? we are open for suggestions and try alternatives - does setting DICT_HT_INITIAL_EXP to 32 (or higher) have performance implications ?

Comment From: vyaramaka

Or may be allow this load ratio to be configurable. Current it is used/size >=1, IIUC rehash more likely to happen when this value is between 1 and 2 (worst case being 5). So which is 1 or 2 elements per bucket, may be 3 keys per bucket is not that bad for latency, may be we can live with it especially if we can avoid spikes in cpu to happen (or delay spikes ). May be with a recommended value between 1 and 5 for the config

Comment From: judeng

thank you for sharing the case, rehash does cause cpu and latency spikes, here is my test information #11494 Specifically in your case, I would guess that the main dict and expire dict trigger rehash at the same time resulting in a 50% performance drop, especially setex command. If the keys does not carry the expiration time, the performance will only decrease by about 20%.

The increase in latency in your case is hundreds of times, I guess it is caused by redis throughput reaching a bottleneck(as you say, cpu have reaching 100%), causing the command queue to block. You could check your proxy timeout policy and extends some shards to avoid cpu reaching 100%. Hope this could help you.

Comment From: judeng

Regarding making DICT_HT_INITIAL_EXP configurable, I'm afraid it is not a good idea. You know that you need to configure 32 because you know the number of keys you have after the fact. However, for practical applications, it is very difficult to estimate the appropriate value before starting redis

Comment From: vyaramaka

Hello @judeng !! Thank you for sharing your experience.

here is my test information https://github.com/redis/redis/issues/11494

Were there any improvements made as you suggested ? Like to avoid main dictand expire dict trigger rehash at the same time

If the keys does not carry the expiration time

For our use case we need to set expiration date. But out of curiosity, you suggest not setting expiration and let the cache grow, and get evicted at some point? Wouldn't that make rehashes even more expensive (we have too many keys, example I shared here is one of the smallest datacenters). Also when evictions are in progress, would it slow down the cache response ?

You could check your proxy timeout policy

We did tune it, currently envoy ops_timeout is 0.8s. Could you please elaborate on this and how it will help ?

Regarding making DICT_HT_INITIAL_EXP configurable, , I'm afraid it is not a good idea.

I understand your point but I have to slightly disagree with you. Yes initially we may not know number of keys but eventually we would (in our case we have a very good idea from the beginning). We can start with default values, and tune it to our need after fact, and having it configurable would help us tune them and save $$, without having to scale for cpu bottlenecks. Also the point you raised is about capacity estimations but not about potential pit holes or dangers.

Comment From: vyaramaka

Update: I used 32 as max number and we don't intend to use it, we will need ~24

Comment From: judeng

For our use case we need to set expiration date

sorry, maybe I didn't explain well. I just say that setex suffers a greater performance recesion when it rehashing.

We did tune it, currently envoy ops_timeout is 0.8s. Could you please elaborate on this and how it will help ?

Rehash does not directly cause such a large latency spike. The direct reason is that after the redis CPU reaches 100%, the connection from the proxy continues to send requests to redis until the ops_timeout is reached, then connection is disconnected and the requests is dropped.

We can start with default values, and tune it to our need after fact, and having it configurable would help us tune them and save $$, without having to scale for cpu bottlenecks.

I got your idea. We can indeed predict when the rehash operation will be triggered based on the current number of keys, and then trigger the dict expansion in advance. Am I wrong? I thought this might be helpful, but note that this is only for the main dict, and we don't want to do this on hash/set/zset type

Comment From: vyaramaka

Hello @judeng !! thank you !! I see you created a PR based on our discussion here. will take a look and reach out if I have any questions

Comment From: hpatro

@vyaramaka Thanks for explaining the issue pretty descriptively. I presume your concern is around latency and the temporary CPU/memory bump is not a problem.

Thinking out loud. Would it make sense for latency sensitive users, to make rehashing configurable on command processing flow and only do incremental rehashing as part of the cron and skip the single bucket rehashing? This might prolong the rehashing for slightly longer but would give a way around for this particular scenario.

Comment From: vyaramaka

@hpatro, thanks !! Got busy working around this and making progress, will definitely revisit this soon. Yeah, I am open for trying any configuration regarding this, atm, dict and rehash got multiple constants none of them are configurable and this is making it hard to tune for our needs. And yes for our use case elevated CPU is not a problem, but spikes to 100% and in turn latency is the problem

Comment From: vyaramaka

I am planning to make some local changes and test different configs. So far I tried making DICT_HT_INITIAL_EXP configurable, however reading server configs in dict.h/dict.c caused cyclic dependencies, need to revisit it when I find some time

Comment From: hpatro

@oranagra @zuiderkwast @madolson Any thoughts on this ?

Comment From: hpatro

Haven't thought through the repercussion of disabling rehash step during command processing. Will also wait for other maintainers to add their input on this.

@vyaramaka if you're keen on testing quickly and measure the decrease in latency and increase in rehashing time period. Please find the small patch below on the idea explained above.

```diff diff --git a/src/server.c b/src/server.c index b164b8cbb..f87280d97 100644 --- a/src/server.c +++ b/src/server.c @@ -3486,6 +3486,7 @@ void call(client c, int flags) { struct redisCommand real_cmd = c->realcmd; client *prev_client = server.executing_client; server.executing_client = c; + dictPauseRehashing(c->db->dict);

 /* When call() is issued during loading the AOF we don't want comman

ds called * from module, exec or LUA to go into the slowlog or to populate st atistics. / @@ -3757,6 +3758,8 @@ void afterCommand(client c) { * So the messages are not interleaved with transaction response. / if (!server.execution_nesting) listJoin(c->reply, server.pending_push_messages); * So the messages are not interleaved with transaction response. / if (!server.execution_nesting) listJoin(c->reply, server.pending_push_messages); + + dictResumeRehashing(c->db->dict); }

/* Check if c->cmd exists, fills err with details in case it doesn't. ```

Comment From: oranagra

Hi, I'll try to respond to what i can, hope i'm not missing much. first, i don't see any major downside for DICT_HT_INITIAL_EXP, other than maybe the need for a different one for the main dict, vs one for Hash, Set and Zset keys (which i assume you don't have). it just means that when the db is relatively empty, you'll have higher memory utilization and lower CPU cache efficiency.

I do think that it is very wrong to make it tunable though. Redis should do the right thing implicitly, and should require as little manual tuning as possible.

I do think we can explore ways to reduce the impact of rehashing (both the latency and also the temporary memory utilization spike), this can be done either by slight or moderate tweaks to our current dict, or by completely replacing it with a data structure that doesn't need rehashing. e.g: #8611, #8700, #9517, #10350, #10802 and other linked issues. so far these major or medium size ideas didn't mature, but we did apply some mall improvements in recent years.

i'll have a look soon at the two PRs @judeng made that are linked here.

Comment From: vyaramaka

Hello everyone !! Finally found sometime to experiment with this, created this commit , yet to be tested. Please let me know if you see anything wrong. will keep you posted on results