Redis [NEW] Configurable / adaptive maximum number of args in queries

The problem/use-case that the feature addresses

Currently queries are limited to 1M arguments: https://github.com/redis/redis/blob/1ddecf1958924b178b76a31d989ef1e05af81964/src/networking.c#L1851 AKA maximum query multibulk length. This is a problem especially apparent when trying to handle "restore" type commands like processing an AOF rewrite line trying to populate a set. And might be a limiting factor in other cases.

Description of the feature

Ideally this would be configurable. To address the problem of argv pre-allocation when the limit is very large we should add a mechanism for dynamic growing of query argv instead of doing pre-allocation. In case of dynamic growing this will prevent a bogus command from causing a huge allocation, and then we might want to consider removing the limit altogether.

Comment From: oranagra

@yoav-steinberg maybe you can search for previous complaints about that limit and refer to them here, so we can gather more evidence about the important of this limit.

Comment From: zuiderkwast

Here's a recent one: #9005

Comment From: oranagra

Thanks Viktor. It's recent but it's just a theoretical query, not really some use case that run into a problem. In any case, it would be good to remove that hard coded limitation, and make it adaptive (not a config).

Comment From: zuiderkwast

@oranagra Well you mentioned some potential trouble a large number of args could cause there:

Redis runs OOM, crashing Redis or the OS
Client buffer causes Redis to evict all keys

It was requested that we document and explain the problems.

Now config is requested.

Possibly limiting the total length of a command could also be a thing to configure, for the same reasons.

If any app allows an untrusted user to submit a large data set to trigger a large Redis command, it would be a vulnerability.

Comment From: oranagra

At current times this allocation is capped to 8mb, rather than adding another configuration, i remember Salvatore suggested to allocate it gradually as data flows in. so client can't cause a 10gb allocation just by sending a few chars, he'll need to actually send data to gradually fill this buffer. and coupled with the new client eviction, he may be kicked out before it causes any damage.

Comment From: yoav-steinberg

Here's @antirez's comment on the issue: https://github.com/redis/redis/issues/4660#issuecomment-363437761

Comment From: oranagra

yeah. that's what i remembered when when i asked you to open this issue.

Comment From: yoav-steinberg

Coming back to this I think the goal is to have argv grow dynamically during parsing. There are a few issues and options here: - One simple way of handling this would be to allocate argv with an upper bound (something like 1024), and then realloc it as needed during argument parsing. This will keep the existing access to argv in the code with no change. But on the other hand it'll require memcpy's associated with the reallocs. - Instead of doing reallocs have some more complex data structure where we have an array pointers to argv arrays. Each sub-argv array will contain up to 1k entries, for example if we have 2050 args then:

[0] -> [0..1023] args
[1] -> [0..1023] args
[2] -> [0..1] args

The we can realloc the main array, which is small, once a sub-array is full. The down side is that if we want to access arg 2049 we need to access it like this: argv[2049 / 1024]->values[2049 % 1024] instead of just argv[2049]. This mean lots of changes in the code (We'll use a macro for this). - Alternatively we can use a more complex data structure. For example a quicklist. And then just push new args onto the quicklist. This will reduce re-allocs and make the code simpler because no new "growing list" logic will be implemented. The down side is that all code accessing argv[x] will need to be changed to access the quicklist based on index. Also having the args in a quicklist means we won't have a separate robj per arg. So this might break lots of other things...

@oranagra WDYT?

Comment From: oranagra

@yoav-steinberg i don't think the complexity and the mass LOC change is worth it, i think a realloc is fine. if someone has a command with over 1000 arguments, he'll suffer a memcpy of 8kb, that doesn't sound like a problem. the gradual growing mechanism can start with 1024, and then grow to twice each time, so that if someone sends a command with 2 million arguments, he'll pay the price of these reallocs: 8k, 16k, 32k ... 8mb (total of 10 reallocs). it doesn't sound like a high price to pay for such an extreme case, which will likely experience other O(n) issues in redis further down the pipeline.