The problem

Redis has a well known issue of mixing both dataset memory and other memory overheads in it's memory accounting so that both trigger eviction.

One very common case is that client buffers (usually output buffers) eat a lot of memory and cause massive eviction. This is mitigated by the client-output-buffer-limit config which disconnects a client if a single client consumed too much. but a common case is where many clients are connected at the same time, each consuming a big chunk of memory, which together makes a lot of memory and that can induce massive key eviction.

Description of the feature

Redis already counts (most of) the client buffers (the argv buffer is missing, see #5159). What can be done is to match the total sum of all client related buffers, if it reaches over a limit, find a (random / oldest / newest) client, and disconnect it (keep doing that in a loop until we're under the limit).

Additional information

Another common case that causes eviction is dict rehashing, but we can't really solve that since redis needs to respect the memory limit and avoid eating far more memory than it is allowed to. Related issues: #1880, #5613, #4213, #5975, #4496

Note that we should not attempt to exclude the client memory from the key eviction. if the clients are consuming less than the threshold, then that's an acceptable (steady) state, and redis still needs to respect it's maxmemory limit. Related issues: #1618, #4668, #4583

Comment From: ushachar

@oranagra why not combine this / start with a backpressure mechanism (Slow down reading from a client / clients if the buffer crosses a 'soft limit' threshold AND we are getting close to our overall limit)? While we'll need to be able to disconnect as a last-ditch thing, slowing down the reads will probably be more effective.

Comment From: oranagra

@ushachar that's indeed an option that can improve certain cases, there are a few caveats though: - it's considerably more complicated to implement. even if by "slow sown reading" you actually mean "pause" until memory goes below the low watermark. - obviously we don't want to slow down or pause all clients (e.g. ones that may contain CLIENT KILL, PING, INFO), just ones which already have big output buffer (But not ones that have big query buffer), so in fact it's a completely separate (possibly complementary) mechanism than the "client eviction". - there are some cases where that pause can cause a dead lock: client doesn't start reading responses until the whole pipeline is delivered, but we stopped reading from the socket. (or if we just slowdown and not pause, then that will possibly make things worse (memory overhead will just grow slower and last longer, reaching to the same dimensions).

The challenge in the disconnection approach is to choose the right client. i.e. if we disconnect the oldest one (one with biggest output buffer that's pending for the longest time), we can get into a situation where no client managed to complete it's task and they all get disconnected before completion. We do not want to disconnect the newest one (small buffer) since that can be an innocent client, but we do want to disconnect ones that are gonna get big, but do that at an early stage, so their disconnection enables the older ones to complete their task.

Comment From: madolson

Status: Status is that we still need a design effort, this is probably too complex to have a new person pick up. So will leave this unassigned.

Comment From: ShooterIT

@oranagra For rehash impact, I have an idea When expand dict, we will allocate a new big ht[1] that is double of ht[0], The size of ht[1] will be very big if ht[0] also is big. for example, for db dict, if we have more than 100 million keys, we may use 2GB for ht[1] when expand dict. I think we shouldn't allocate a new big ht[1] for expand dict if the sum of used memory and new ht[1] exceeds maxmemory. After eviction and rehashing, we still couldn't add much more keys. What's worse, redis will keep less keys when the sum of ht[0](half of ht[1]) and used memory exceeds maxmemory because ht[1] can eat all leftover memory instead of users' data. But we certainly increase hash conflict if don't expand dict.

Comment From: oranagra

@ShooterIT that's indeed an interesting idea that is worth examining in detail in another issue / PR. This issue is about disconnecting clients when the sum of their output/input buffers crosses a limit.

Comment From: yossigo

@ShooterIT Interesting idea indeed. It reminds me there were some past discussions about moving the global keyspace to be a RAX, which is of course more radical. I don't remember if that idea reached a design dead-end or just abandoned mid way.

Comment From: ShooterIT

@yossigo We ever discussed RAX implementation for keyspace in slack, but we don't push it further. I submit a PR #7954 for dict reshash, also please have a look @oranagra

Comment From: yoav-steinberg

I couldn't really think of a good way to select which clients to disconnect in order to free up memory and it's probably impossible to predict what effect such an action will have on the application. Any client we disconnect might reconnect and consume the memory again. We might stay in some steady state where clients never get responses or never have their queries processed and the application is going crazy because of lots of client disconnects. Although data won't be evicted, and memory won't grow I think this will cause confusion and unpredictable behavior on the app's side.

I'll try to describe how I see this problem and what solution can solve, at least, the most relevant and common cases:

The common case of the problem (evicting lots of keys or outgrowing memory) is probably because a lot of clients are querying the server simultaneously and plugging up all network bandwidth. So we have lots of client buffers filled with some request or response data which accumulate and caused eviction or memory bloat. As @ushachar suggested we can create some threshold (or even use the existing client-output-buffer-limit) and stop processing new requests from clients until existing buffers are drained back under the threshold. This will effectively enforce network throttling to whatever's possible and keep memory in check.

Some notes: * @oranagra suggested that if a client uses a pipeline and doesn't drain its responses until it can write the next query such a solution will block the client forever (leaving the output buffer and query buffers full). This is theoretically correct but except for multiplexing proxies, I'm not aware of clients which work this way. And in the case of multiplexing proxies we can expect much less client connections anyway. We can however add a timeout mechanism which disconnects connections if they are in a such a blocked state for a long period, although I'm not sure it's worth it.

  • In case the buffer bloat is due to pubsub/monitor connections we can still take the same approach. Stop processing commands until the pubsub/monitor connections are drained. Again, a timeout here can help in order to disconnect a random connection (or based on buffer size) in case command pausing doesn't achieve the result. Is this really needed?

  • There's the issue of how to decide which clients to pause and/or how to enable control commands like PING/INFO to continue being processed. One option is to only pause clients after we parse the command. If it's a control command then don't pause the client even if we're over the threshold since its response won't increase memory much, this will leave the server responsive.

  • There's also the scenario of lots of clients writing large queries coupled with a CPU bottleneck. This can theoretically cause memory issues due to large query buffers unrelated to network bandwidth (although this can also happen due to network bandwidth and large queries). Here too, pausing connections could solve the issue. What's important is that we pause the connection before reading the next command. This contradicts the previous suggestion of first parsing the command in order to permit control commands. I can think of two solutions:

  • Pause connection after parsing the command but before parsing the arguments.
  • Have a separate threshold for cumulative-query-buffer-threshold than cumulative-client-buffer-threshold. If we pass the query buffer threshold we pause all connections before parsing the commands until we're below the threshold.

Comment From: oranagra

  • isn't redis.py pipeline behaving the way i was concerned about? (won't start reading the responses until all the requests are sent)?
  • other than pubsub and monitor, please be aware of client tracking redirection, in all of these pausing the client with the large output buffer will not stop the output buffer from growing, and we do need to either find who to throttle, or drop these clients after some threshold / timeout. (again the focus is that we're not only looking at the individual client used memory, but at the cumulative damage all clients can cause to the keyspace)

it may be a good idea to write some tests to simulate some of the scenarios before we try to solve them. please be aware of #7911

Comment From: oranagra

@yoav-steinberg in case it helps, here's some old branch with some initial work i did on maintaining a pool of the best clients to evict, and evict them when over the limit: https://github.com/oranagra/redis/commits/client_eviction (see the last 3 commits)

Comment From: JimB123

@oranagra I'm just looking at this now as @madolson brought this to my attention.

A single large misbehaved client is (as you mention) handled by the client-output-buffer-limit parameter. The hash table size/resizing is an amortized, per-key, overhead and can be discounted. The separate issues related to hash table resize have been addressed by @ShooterIT in https://github.com/redis/redis/pull/7954

Focusing only on the cumulative COB issue. There are some nuances. Have you considered the difference between primary & replica?

On a primary, COBs contribute to maxmemory and cause evictions as a result. On a replica, evictions can't occur.

In the situation where Redis is used as a LRU cache, the primary will always be at maxmemory. Consider a use case where a limited number of write clients are attached to the primary. And then a large number of read clients are attached to a replica. The replica will, almost always, be at maxmemory before any clients even attach! In this case, it's sort-of the role of maxmemory to leave enough room (above maxmemory) for these replica clients.

This creates a really weird situation - on the primary, COBs are part of the maxmemory pool. On the replica they effectively aren't part of this pool.

I'm not generally a fan of "more knobs". If there was a setting to limit total COB usage, wouldn't a customer want to be able to add more data if they weren't actually using the space for COB? If I understand what you're proposing, this new limit would not change the natural balance and eviction related to COBs. It would just limit the amount of data we are willing to evict to support COBs. That could easily cause confusion as (I think) many users may assert "I don't want to evict any data for COBs" - without realizing that the alternative is not having any COBs!

Just thinking out loud... Maybe the setting could default to something like 50% - a large default which says we will allow UP TO 50% of maxmemory to be used for COBs. On the low end, we could restrict the minimum value to something like 10% - which would prevent users from inadvertently setting the value too small.

Then, the remaining question is what to do if COB usage exceeds the amount we are willing to evict? I think the most reasonable thing is to kill the client with the largest COB - as if it exceeded the client-output-buffer-limit.

Also worth noting, we can't do anything which requires looping over all of the clients in a single pass. With a large number of clients, that would have prohibitive latency costs.

Comment From: yoav-steinberg

@JimB123 a few questions

This creates a really weird situation - on the primary, COBs are part of the maxmemory pool. On the replica they effectively aren't part of this pool.

This is already the case even without any new "cumulative COB client eviction" (note that by "eviction" I simply mean disconnecting the client). So I think we can add this feature with disregard to this weird situation. Actually, by enabling this feature on the replicas you get to control how much memory COB's will consume which adds a safety mechanism which wasn't possible before. Am I missing something here?

If I understand what you're proposing, this new limit would not change the natural balance and eviction related to COBs. It would just limit the amount of data we are willing to evict to support COBs. That could easily cause confusion...

I think there's merit to not changing how things are today while still adding a needed safety mechanism. I'm assuming that in all practical situations COBs (and also client query buffers - together let's call them CBs) should be much smaller than maxmemory. So having this new setting should enable users to enforce this assumption and have only minimal data evicted when something goes wrong and CBs start to inflate.

Maybe the setting could default to something like 50% - a large default which says we will allow UP TO 50% of maxmemory to be used for COBs.

This default seems large, and I'm not sure it's advantageous to have a percentage values here. Also how does this avoid not having any data evicted for CBs? It seems to me like a fixed MB/GB size would do as well here.

Then, the remaining question is what to do if COB usage exceeds the amount we are willing to evict? I think the most reasonable thing is to kill the client with the largest COB - as if it exceeded the client-output-buffer-limit.

Also worth noting, we can't do anything which requires looping over all of the clients in a single pass. With a large number of clients, that would have prohibitive latency costs.

This is what I was thinking too. Looping over clients is a performance issue, and just killing the largest might not be nearly enough. I'm currently considering maintaining an approximation of a sorted list (approximation for performance) and then looping over the list and disconnecting the clients from largest to smallest until we're under the cumulative CB limit.

Here's what I've got so far based on @oranagra's PR: https://github.com/redis/redis/pull/8687

Comment From: oranagra

On a primary, COBs contribute to maxmemory and cause evictions as a result. On a replica, evictions can't occur.

In the situation where Redis is used as a LRU cache, the primary will always be at maxmemory. Consider a use case where a limited number of write clients are attached to the primary. And then a large number of read clients are attached to a replica. The replica will, almost always, be at maxmemory before any clients even attach! In this case, it's sort-of the role of maxmemory to leave enough room (above maxmemory) for these replica clients.

This creates a really weird situation - on the primary, COBs are part of the maxmemory pool. On the replica they effectively aren't part of this pool.

I agree this situation is weird. obviously we're not gonna do something like having the replica inform its memory usage to the master and have it evict keys in order to satisfy the memory of the replica. I don't think it's the master's role to leave some unused memory to COBs on the replicas, maybe that's something an external cluster manager software can do.

The mechanism we're adding of disconnecting clients when over the memory limit would do good to that use case, even on a replica that doesn't do eviction, it'll still be able to limit the client buffers size.

If there was a setting to limit total COB usage, wouldn't a customer want to be able to add more data if they weren't actually using the space for COB? If I understand what you're proposing, this new limit would not change the natural balance and eviction related to COBs. It would just limit the amount of data we are willing to evict to support COBs. That could easily cause confusion as (I think) many users may assert "I don't want to evict any data for COBs" - without realizing that the alternative is not having any COBs!

Yes the user would certainly want to be able to add data to the database if the clients buffers are not high, this is why i think it would be wrong to reserve memory for CBs, and like in the past, CBs can cause eviction, we're just wanna limit their impact so that the user doesn't end up with an empty database.

If we see many requests asking to completely protect the data from eviction due to CBs, we can add a separate reservation setting, or a flag that says that maxmemory doesn't count CBs (now that they have their own limit, that's a viable option), but i feel we wanna do that yet.

Just thinking out loud... Maybe the setting could default to something like 50% - a large default which says we will allow UP TO 50% of maxmemory to be used for COBs. On the low end, we could restrict the minimum value to something like 10% - which would prevent users from inadvertently setting the value too small.

IIUC, your suggestion to limit this to 10% is so that someone won't try to set it to 0% (trying to completely avoid eviction), right? I agree with Yoav, i don't think this setting should be in percentage, and it should be clear that all it does is kill clients when their total memory usage is over the limit, i don't think anyone will try to set it to 0.