Redis [NEW] Enhancing FLUSHALL ASYNC in case of scarce memory

FLUSHALL is considered a heavy-weight command. It takes a very long time until completion and during that time the server is practically unavailable. As an alternative, FLUSHALL also supports the ASYNC option of running the command in the background in an attempt to avoid stalling the server. Nevertheless, it might raise another reasonable issue for flows that try to make fast reconstructions of the database and just might allocate memory faster than the amount of memory released by FLUSHALL-ASYNC. The user can identify such a state by INFO and possibly getting OOM error or additional eviction of the new data.

To overcome this issue, we can add some basic throttling mechanism between FLUSHALL ASYNC and WRITE commands that get into action only when there is not enough memory. That is, if consumed memory is over maxmemory and FLUASHALL ASYNC is running in the background, then write operation will be blocked. Note that in the common case FLUSHALL operation will be much faster than any database reconstruction and even if an operation gets paused, it will be paused for only a few milliseconds until FLUSHALL will release just enough memory. As for pause-write implementation, Redis already has such a mechanism that can be leveraged.

Comment From: oranagra

In a nut shell, in case the write traffic during FLUSHALL ASYNC reaches OOM condition, we'll get the CLIENT PAUSE WRITE behavior (postponing / throttling) instead of an OOM error to the client, or an eviction of the newly populated keys.

Comment From: madolson

I'm not convinced we should implement flow control for this specific feature. As the requester stated, Redis doesn't behave incorrectly, it will just start rejecting writes or performing evictions.

Comment From: oranagra

Redis will behave correctly, but it's not useful for users.

Users want a FLUSHALL command that doesn't block the traffic for a long duration, ideally it should release all the memory instantly, but since that's not possible, it would be good to do it in the background and avoid the negative implications of doing so.

Comment From: madolson

I'm not sure that "not useful" for users is true. The user still needs to be able to handle the errors if their workload is pushing data near the limits. In all of our internal testing it's very hard to outpace the speed of freeing data.

Comment From: moticless

Hi @madolson, Since your tests didn’t encounter the issue it doesn’t mean we can ignore a real problem that reproduced to us and potentially can happen to others.

IMHO, Let’s make Redis simple and more predictable to use, not the other way around. As a user, that FLUSHALL ASYNC the entire database, then I don't want to have any potential eviction afterward. I don't want to simulate any command flow-control afterward. And I don't want to be forced fallback to synchronous FLUSHALL either.

Comment From: madolson

I'll clarify my point, this seems over prioritizing a single case (which I believe is constructed and not real). There are multiple places in Redis where flow control is more important like replication (limiting cow and cob growth) and it also could be applied to all lazy free jobs, not just flush all. There is no reason to believe that individually async deletes might also cause unexpected evictions.

Today for replication we let it consume extra memory, we could do the same for flush all as well. I do think flow control is better.

Comment From: madolson

@moticless can you also post the reproduction?

Comment From: oranagra

@madolson AFAIK the reason replication buffers aren't included in eviction limits is to avoid a feedback loop. that's already a bad fact (same as lazy free eviction), and doing the same for FLUSHALL would be even worse.

I agree it's a good idea to add throttling in other places as well (COW, replication buffers), but i'm not sure we should tie them together (for FLUSHALL it's considerably simpler than for COW).

I'm not sure that "not useful" for users is true. The user still needs to be able to handle the errors if their workload is pushing data near the limits. In all of our internal testing it's very hard to outpace the speed of freeing data.

Users must handle errors, but in some cases it becomes very complicated, and if redis can avoid throwing useless errors and instead block the connection, that's much better. imagine that instead of blocking on CLIENT PAUSE WRITE, we would have returned an error for write commands...

e.g. what we do currently with BUSY on long scripts is a result of naive code inside redis, and would be much better to just postpone anything other than SHUTDOWN and SCRIPT KILL. Ideally, each client connection would be able to define a timeout after which redis will return a BUSY error, rather than a server configuration. many workloads (other than some watch-dogs) would be much better served by an infinite blocked.

another example is what we did for the incremental eviction, the fact we're over the memory limit is only a justification for OOM error if we have nothing more to evict, but as long as we do have something to evict, we rather avoid the OOM error.

Comment From: moticless

@moticless can you also post the reproduction?

Sorry for misleading. We didn't reproduce the issue. There were some indications and evaluation that we might reach this issue as part of our product, such that DB importer flow reconstructs the database. And therefore, for now we prefer to use synchronous FLUSHALL.

Comment From: madolson

Sure, maybe I'll restate an earlier point then. When we tested it internally, we were unable to generate a load that ingested data faster than the background thread was able to free it except for single core cases with pinning. You make a statement that it could still be possible, which is fair, but it would be better if there was a concrete known way it could happen.

Comment From: moticless

Hi @madolson, I managed to reproduce OOM by flushall async database with many small strings (DB1) and right after filling it up with many big strings (DB2). That is, flushall async didn't release memory faster than database filling up with many big items, as:

                                  |DB1| ~= |DB2| ~= 1G <  maxmemory = 1248576000

The 3 basic steps of my script are:

memtier starts with filling up database with many small strings of size 10 bytes, up to ~1G of used memory
FLUSHALL ASYNC now have a lot of work to release in the background. Note: Syncronous flushall takes around 7 sec on my setup.
memtier add many big strings of size 50K (while flushall runs in background) up to ~1G of used memory. During that time I got several OOM. Note: this command takes around 2 sec on my setup.

 memtier_benchmark --unix-socket=/run/redis.sock -R --command="set __key__ __data__" --command-key-pattern=P --data-size-pattern=R -c 1 --data-size-range=10-10 -t 10 -n 10000000 --randomize --pipeline=150
 redis-cli flushall async
 memtier_benchmark --unix-socket=/run/redis.sock -R --command="set __key__ __data__" --command-key-pattern=P --data-size-pattern=R -c 2 --data-size-range=50000-50000 -t 1 -n 10000 --randomize --pipeline=150

Obviously this is not common scenario, yet we simply cannot take the risk. (the numbers can further fine tuned to have less "pathology" scenario).

Comment From: madolson

@moticless I appreciate you spending the time to come up with that benchmark. It does really seem "pathological", having a dramatic behavior change after sending an async flush, but I understand the inclination to harden Redis.

The only ask I have is let's at least consider generalization it to include other modes, such as an async delete of a large item. As long as we have pending async items to delete, we won't immediately return OOM to clients.

Comment From: moticless

I like your suggestion of generalizing this feature and adding also UNLINK. Maybe also lazy-evict?

Comment From: ShooterIT

AFAIK, we can free memory 1 GB per second, do you encounter this case in production environment?

Comment From: moticless

I did all of my tests on my local machine. Generally speakaing, the time it takes to release largely influenced by the size of the items being released. Consider my reproduction above. Allocation is faster than release because it allocates bigger chunks of memory.

Comment From: moticless

I would like to advise you with about some design issue I encounter:

Today there are 3 different types of events that triggers pauseClients():

PAUSE_DURING_FAILOVER
PAUSE_BY_CLIENT_COMMAND
PAUSE_DURING_SHUTDOWN

When clients are paused, behind blocking the clients traffic, it also pause the database in the sense that: - Blocks from sending PING to replicas (if clients are paused during a Redis Cluster manual failover) - Pauses eviction processing - Pauses expire processing

In other words, there is a (misleading?) assumption that client-pause also means server is static. Now I would like to add another trigger PAUSE_THROTTLE_CLIENT_WRITE to pauseClients() which, in my opinion, shouldn’t "pause-database", only “pause-clients”.

Right now I am thinking about making a preliminary PR to try separate this tight coupling. In fact, the same argument can be said about PAUSE_BY_CLIENT_COMMAND. Do we really want it to pause-database or not? Do we want to expose the option to the client to decide?

Happy to hear your thoughts

Comment From: oranagra

Looking at an older redis, before CLIENT PAUSE WRITE was added, even then, during a client pause, we also suspended active expire, etc, and it seems it was very much on purpose. i.e. CLIENT PAUSE implies the database is static. I think we should make the THROTTLE mode an exception in this case, and let the other pause methods keep enforcing a completely static db.

Comment From: moticless

Already today we support two flavors of pause, CLIENT_PAUSE_WRITE and CLIENT_PAUSE_ALL. Example to API call:

pauseClients(... , CLIENT_PAUSE_WRITE);

The selected flavor is rather hidden, and not reflected in the api and there is also implicit order over on the modes (WRITE < ALL). The API doesn’t relate to the caller-context : - areClientsPaused() - doesn’t distinct between CLIENT_PAUSE_WRITE and CLIENT_PAUSE_ALL. - checkClientPauseTimeoutAndReturnIfPaused() the same

IMHO, we should consider rebranding this feature, at least internally, to be something more like pause-services and extend the pause flavors to be bitmask: - PAUSE_SVC_CLIENT_WRITE - PAUSE_SVC_CLIENT_ALL - PAUSE_SVC_EXPIRE - PAUSE_SVC_EVICT - PAUSE_SVC_REPLICA

And the API can be more like: - pauseServices(... , SVC_PAUSE_CLIENT_WRITE | SVC_PAUSE_EVICT | SVC_PAUSE_EXPIRE) - areServicesPaused(SVC_PAUSE_CLIENT_WRITE | SVC_PAUSE_CLIENT_ALL) - will return bitmask of paused services. - areServicesPaused(SVC_PAUSE_EXPIRE)

Comment From: soloestoy

In the first glance I think it's a over optimization case, because if user set maxmemory-eviction-tenacity to 100, the write command will be blocked until used_memory become lower than maxmemory no matter if we have items in bio lazyfree thread or not, user would not receive "negative" OOM error.

So, maybe the problem you wanna address is not block the redis server, instead only pause the oom command if some items are in bio thread and allow other (read and delete) commands? I'm not sure if it's useful, correct me if misunderstand.

Comment From: moticless

In the first glance I think it's a over optimization case, because if user set maxmemory-eviction-tenacity to 100, the write command will be blocked until used_memory become lower than maxmemory no matter if we have items in bio lazyfree thread or not, user would not receive "negative" OOM error.

You are correct but this doesn't resolve the problem that I try to address. That is, case there is a major lazyfree in the background, such as FLUSHALL ASYNC or UNLINK, then we can avoid from redundant evictions or rejection simply by client-pause-write until reach below OOM. Setting maxmemory-eviction-tenacity to 100 might cause to redundant multiple lazyfree until background thread will release them and reach below OOM.

so, maybe the problem you wanna address is not block the redis server, instead only pause the oom command if some items are in bio thread and allow other (read and delete) commands? I'm not sure if it's useful, correct me if misunderstand.

If I get you right, we are on the same page. I am not blocking the redis-server, only those clients that attempts to write (useful for watchdogs, for example).

Comment From: moticless

@oranagra @madolson , All along, I am not super happy with the idea of blocking write clients if there is at least one pending lazyfree job. We might reach paralyzing in the worst case, during OOM, such that for each single, rather small in the common case, lazyfree job, client will be paused and will be released after background thread released it. And so forth. We better think again about fine tuning the conditions to pause clients on write. This what I have in mind:

lazyfree Flushall pending job - for sure
Some threshold to evaluate the amount of bytes/lazyfrees jobs/ to be released - configurable

Comment From: soloestoy

Thanks for your reply @moticless , it seems there are two problems we are trying to solve:

the "negative" OOM error caused by lazyfree. I think the key point is how to understand "lazyfree", IMHO it doesn't mean redis can delete object and free memory immediately, it means we free memory asynchronously(don't block server) and gradually(may hold the memory in a long time). In this understanding OOM is not a wrong error I think, maybe you think timeout(pause) is better than an error? But that may introduce some new problems, like it's hard to trouble shoot the timeout reason(write command paused by lazyfree).
the "redundant" eviction caused by lazyfree (this is what I concerned too). It's a serious problem, lazyfree may lead to unexpected eviction(fortunately seems no one encounter this in practice until now). We need to fix it, and I think there must be a simple way to fix.

And it's interesting about the flow control mechanism : )

Comment From: soloestoy

link #11193