Redis Threaded commands implementation

Hi @antirez, I have looked at the threaded commands implementation (threaded-core-commands up to a5541e87f) and opening this issue to discuss that.

I think this is a great use case and a validation of past concepts like the blocking mechanism that were part of the modules (and btw may need to be refactored out of the module subsystem just for code clarity).

My main feedback is this: the locking mechanism needs to be invoked explicitly by every command that uses it. This not only requires a lot of code change that make it less readable, but may also be unnecessary given that key access is known in advance so it could be a core function. The only piece of information missing to do that is the ability to specify read-only/read-write keys at the command table level. I'm sure that would be easier and safer to do than modify lots of commands.

Other concerns: * Performance: I think spawning a thread per command is not efficient, and a thread pool should be used. * Performance: there is a fast path check that guarantees no penalty when no keys are locked. However in real world scenarios there will always be locked keys - and I have some concerns about the latency impact of clientShouldWaitOnLockedKeys(). * Safety: the lockKey() practice of silently ignoring repeated locks by the same client is extremely dangerous and should be avoided (unless the lock is recursive... but really, let's not!).

Other thoughts: * Moving forward to read/write locking is not trivial, I think it will take some design work that hopefully will also consider some of the points above. * How shall Lua blocking be handled? Blocking only EVAL itself based on declared keys may lead to locking violation if scripts are not well behaved, which is a regression. It may be necessary to strictly enforce key access, or make it possible to block Lua clients (which will complicate things a lot). * Same question but for modules, which have multiple different flows: RM_Call() on main thread, RM_Call() on a thread safe context, direct API key access.

Comment From: antirez

I @yossigo, thanks for the feedbacks! Honestly I don't agree with most of them :-D But it's very good to have a constructive critique of the design like that.

Before commenting the notes you wrote, the most important thing I'm totally of the opposite advice about what you say, is the thing of not taking the bottom/top half design of the main command + the callback. This design that is found in many systems, notably the Linux interrupt handlers implementation, allows to have a clear distinction between what is executed in the background thread and what is executed in the main thread. This is extremely important because the callback will now that it has very limited APIs available, and isolates the limited part in a very simple way. Moreover commands need to lock keys depending on exact calls sometimes. Or they need to use different kind of locks that can be much more efficient: to encode such information in the command table is complex and IMHO also useless. For instance commands implementing STORE can be served by a "store lock", in such case the key does not need to be locked at all, because when the command returns, we just need to replace the key value. STORE is what most O(N) commands implement, so is a big case. I also don't agree about this design to force command implementations to be more complex, I translated several commands from non threaded to threaded and it was trivial in all the cases. I literally cut&pasted it.

Another fundamental note is the following: this is in the core, so if this works, we have the moral obligation to let Redis go a step forward. When you (or others) have a PR that improves this design, we can replace it. But to stop this effort because there are other possible designs does not make sense, here we don't have the backward compatibility concerns of moduels.

About your points:

Right now there is a maximum number of threads. This implementation will use synchronous calls when the limit is reached, so even as it is, it is already better compared to what we have in all the possible cases. But as you can see from the implementation, to block the clients wanting to lock a key to resume them when the threads are already under control is a very simple evolution of the implementation. So I basically believe your note is more an hint on how to make threading scheduling better, not a limit of the current design. For now the goal was to create a very simple system that works. We can start with that that is guaranteed to be better than what we have now, and improve it with successive PRs.
The implementation right now will also skip every read command and return ASAP. However note that both ACLs, and Redis Cluster already use the getKeys() API with success, it uses a static array and normally does not add any big overhead. Later queueClientIfKeyIsLocked() will just do an hash table lookup. I can't see any critical performance issue here, and especially there is no way to have zero overhead. The only thing that I believe is sensible, for certain users, is to just let specify a thread pool of 0 threads if they want the old Redis. However this is likely quite futile: if they don't use slow commands with big collections, the key count will be always zero, so they don't pay any penalty. If they instead have O(N) commands against big collections, the overheads is peanuts compared to long pauses. So I'm not sure about your point here as well because the main thing is: A) this uses a mechanism already used by ACL + Redis Cluster. B) The cost is only incurred when there would be long pauses.
Safety of lockKey() I don't agree but it's a trivial thing, we can discuss and change it. What I believe is that to abort on repeated locks will force the implementations to check if the user specified array of keys contain duplicates, that is a problem.
read-write keys: I've tons of notes, but we are not forced to do it. For sure this is the first step anyway.

"A complex system that works is invariably found to have evolved from a simple system that worked."

Enforcing scripting access is already implemented in my code. It will return an error.
I didn't implemented this, but for modules violating the declared keys it is even simpler: RM_Call() can just return an error.

Thanks for the feedbacks! My design POV is very far from what you wrote, but still very appreciated to have a different view on this.

Comment From: yossigo

@antirez This has the potential to be a very big paradigm shift for Redis, so I actually see it as a good thing to have many different opinions (=disagreement) that can be exchanged and validated. I'm sorry I haven't been able to find the time to come up with POC code in addition to the above, maybe that would be a productive effort.

Anyway if I understand correctly, you still consider this work in progress and something that can benefit from further discussions, validations and improvements so that's the goal here.

I immediately thought of the Linux bottom half concept looking at this, but unlike the kernel case I don't think it's really required to split the implementation. At least not in most cases where auto-locking and deferring can be trivial. That doesn't mean we have to close the door on making it possible (e.g. for SORT BY) but just consider the command itself as the bottom half in 99% of the cases where it's really possible.

Regarding your point about STORE locks, I think that's really interesting! Can this mean we don't need write locks at all, and can instead just settle for an RCU-like mechanism and provide an atomic swap-in of a key once the store operation has completed? Also, I'm not 100% sure all corner cases are covered in this case and that no consistency issues can arise.

About the Lua and module implementation, I think returning "sporadic" errors on accessing to locked keys is potentially a big breaking change. At least for Lua one can argue it's a result of a faulty script that doesn't announce its keys, but then perhaps being more strict about that is a better approach in the long term - it streamlines EVAL and it makes Lua predictable.

Comment From: antirez

@yossigo yep indeed I think this can be evolved lot, but what is there is already beta-quality, so to start I want to merge this as a first huge improvement over what we have. Of course only for the unstable branch. I'm very skeptical about not taking this two halves design for the future, but well, what I mean is, it's core, not modules, we can rewrite commands easily.

but unlike the kernel case I don't think it's really required to split the implementation

IMHO this is not the big design driver here, but two fundamental facts: 1. In the thread half, you are still very limited, why to give the illusion you can use the Redis API there? You can just access the shared info passed via the 'options' pointer, and you can reply to the passed client, that's all. 2. The command implementation is the one that really should be able to dictate about what to do. For instance as you can see, certain commands will decide to not block if they are operating against small sets, or if passed certain options. To do this at a transparent level, at top level, is a lo more complex without any good reason.

Store locks are quite powerful I believe. Consider that the write side of a STORE operation, only materializes at the STORE moment, when the command finished. We don't have, in Redis semantics, to guarantee any ordering for the command that is not executed in the same connection. As long as commands modify a coherent single state, they can be delayed as much as possible (for instance because of the network). So AFAIK, the STORE locks don't change any observable Redis semantics, and at the same time are for free compared to real write locks from the POV of implementation complexity. But for one thing: replication / AOF propagation. This part needs to be handled, and with the current implementation of read locks this is not needed.

About Lua, this is not the first time that we return an error when the set of keys declared don't match. We do that on ACLs, we do that on Redis Cluster hash slots. Nothing has changed. For modules IMHO it is even more a non brainer, we just return an error for RM_Call(). However modules, like the other commands, should have an option soon or later in order to return ASAP to the caller to say: I wanted to operate on key "foo" that is locked, please reschedule me for when the key is free of locks.

Comment From: antirez

@madolson @soloestoy have your teams gave this a look as well?

My plan is the following:

Merge into unstable ASAP.
Work a bit more till the end of the month to give it more potential.
Make it possible to disable this entirely via Redis configuration by setting 0 threads.
See how it goes for some time. Then Redis 7 will decide if this is ready or not... but this is far future.

Comment From: JimB123

Hi Salvatore. I work with @madolson and am most familiar with this area. My comments:

Salvatore, I was reading your proposal regarding the processing of “slow” commands on a background thread while locking keys from the foreground thread. This is an exciting idea. However the tweeted messages may have over-simplified some issues with this functionality. There are a number of issues which need to be considered/addressed which will add significant “challenges” to this capability.

Even if the command is not explicitly writing, there are several cases where data is modified when reading. These include: PFCOUNT and also any data structure where the format can be internally changed - like hash table, skiplist, compressed list encoding which can be in the process of rehashing or other data transformation. These modifications must be prevented in both main and background threads.
There are significant complexities with MULTI/EXEC. If a MULTI/EXEC hit’s a blocked key, it must either:
1. synchronously pause the Redis main thread (waiting for the background thread to release the lock) - this also comes with complexity regarding completing/joining the background command while the main thread is paused.
2. pre-examine the MULTI/EXEC to determine if any locked keys will be used. This also comes with the challenge of tracking keys across databases if SWAPDB and/or SELECT is used in the MULTI/EXEC block.
3. block all MULTI/EXEC commands if any async command is running.
There are significant challenges with LUA. LUA is not forced to declare keys used. Furthermore, even if a LUA script does declare which keys are touched, the declaration does not indicate which DB the key is associated with. There is no mechanism to determine keys by inspection (as is possible with MULTI/EXEC). The safest option for LUA might be to block any LUA script if any async command is running. Throwing an error is not an option as it breaks atomicity guarantees.
There is a conflict with FLUSHDB/FLUSHALL. These keyless commands would need to be blocked until no background thread was running.
There is the possibility of interaction with EXPIRE and EVICT processing. Deep dive is necessary here to understand the issues with each.
Client disconnect (on the main thread) needs to interact with async processing in a background thread.
BRPOPLPUSH is a special can of worms and needs to be carefully considered. Consider BRPOPLPUSH FROM TO. Later, while a background thread has “TO” locked, another client performs an LPUSH FROM. The LPUSH (to a different key) would need to be blocked OR the BRPOPLPUSH would need to be further blocked.
Should MONITOR show the executed command at the beginning of processing? or the end?
I know you’re currently looking at background reads, but if the slow command needs to perform a write, the logical lock must also block readers on the main thread, not just writers.
If the background command is a write command, unlocking of the key needs to be coordinated with replication.

In addition, an important part of this proposal involves the blocking of clients until a key is unlocked. The existing blocking mechanism (for blocking commands like BRPOPLPUSH) is a good candidate for re-work at this juncture.

The existing blocking mechanism works as follows:

The blocking command is processed normally through processCommand().
In the command’s handler, the command is removed from the client’s normal cmd/argv/argc fields and copied over to custom fields for the command. (The normal fields are cleared.)
When the command is unblocked, there is a completely separate code path (independent from processCommand()) which is used to finish execution of the blocking command.
Not only does a separate code path require independent maintenance, but can lead to subtle differences in the way commands function if blocked. (Hint: IIRC, MONITORing a BRPOPLPUSH will look subtly different if it blocks or not.)

Instead, changing the blocking mechanism will better support this new use-case (and potentially others).

When a client is blocked (for any reason), leave the cmd/argv/argc attached on the client as normal.
When the client is unblocked, pass it back through processCommand() to execute as normal. This might result in the client being re-blocked for another reason.
Some care must be taken to ensure that MONITOR works as expected and only logs the 1st attempt.
This approach eliminates a bunch of blocking command fields on the client struct. It eliminates a separate code path for finishing these commands. It provides a standard approach to blocking clients.
Note that some adjustment of timeout might be necessary when re-blocking.

Example: A BRPOPLPUSH is blocked waiting on something to pop. A LPUSH is performed which unblocks the BRPOPLPUSH. We attempt to execute BRPOPLPUSH normally by sending it through processCommand(). However it is discovered that the destination key is locked. The BRPOPLPUSH client is re-blocked. Later, when the target key is unlocked, we again send BRPOPLPUSH through processCommand() and this time it succeeds.

Example: “MSET k1 v1 k2 v2” is performed. It is recognized that k1 is locked. The client is blocked. When k1 is unlocked, the MSET command is passed back through processCommand(). It is determined that now, k2 is locked. The client is blocked a second time. When k2 is unlocked, the command is sent through processCommand() a third time, and may execute normally.

Comment From: antirez

@JimB123 those are excellent observations indeed! You spotted several issues with the current implementation. Other stuff are already fixed or partially fixed. Many are main challenges. I'll try replying to most points here:

PFCOUNT: yep I know about this, but since there is a long time plan to turn hyperloglogs into a proper data structure, my guess was to fix that in this way as well. However for things like dict rehashing, that is more challenging. We don't have this problem with many data structures, but we definitely have it with some. For instance Redis Trees are totally immutable in this case. So I can think only at dictionaries AFAIK, in such case, when the key is locked we can just increment the number of iterators in the dictionary to prevent any rehashing. But yep, this is something we need to take in mind. I'll try to fix the implementation at least for the dictionaries case.
MULTI/EXEC: My implementation handles this explicitly, but I failed to recognize the problem with SELECT / SWAPDB. Given that it works already properly most of the times, I guess that a good approximation could be to pause the transaction entirely only if it contains such commands.
Lua: I don't understand well what you mean here. EVAL indeed allows to inspect the keys delcared. What the implementation does, is that if then the script accesses keys that are NOT declared and that are locked, Lua returns an error. There is a test in this case in the threaded command branch. I guess this is enough. It is exactly the same approach taken with ACLs and Redis Cluster redirection. The database of the Lua script, is the selected database. Changing DB and then using the declared keys, is for Lua scripts exactly equivalent to accessing keys that were not declared. So the proper way to run a Lua script changing keys in a specific database, is always to do SELECT+EVAL/EVALSHA declaring the keys in EVAL. Never to select() the DB inside Lua, or this will break Redis Cluster and other things.
FLUSHDB/FLUSHALL. Maybe this is not so bad. We just need to scan all the locks, and write in the lock that the key was removed (you can see the "deleted" attribute of the lock alerady). As we do that, we also replace the object stored at the key with a dummy object before the emptyDb() call. In this way, when the last lock on the key is removed, the object gets removed. The "deleted" flag in the implementation, currently not used, can be used even to let DEL always run even if the key is locked. I understand that this was not clear because of lack of notes.
BRPOPLPUSH is hard indeed. I've a few things in mind, but there is to consider what is the simplest approach possible indeed.

About what you described next, this looks very similar to what I implemented in order to lock the clients. I block the client ASAP before cleaning the argv/argc, and can resume it later easily. Also for multiple keys, I do exactly what you said. Not sure if you hade a chance to check the implementaiton. But anyway, many problems you arise are spot on.

Comment From: JimB123

Regarding LUA, since LUA allows keys to be undeclared - and has no mechanism to declare keys in alternate DBs (even though select/swapping is supported), It creates an issue. Consider a LUA script which modifies: In DB0-key1 & DB1-key2 - there's no way to define/declare this. (Of course this is not an issue in Redis cluster - as cluster only has a single DB.)

I don't think returning an error is appropriate as this breaks the current atomicity guarantees.

The problem is that a LUA script might be completely valid (by current rules/standards) - but if a DIFFERENT client is executing a long running command, the LUA script might fail, breaking atomicity,

Client A should never have to worry about atomicity problems or errors based on the actions of Client B. (This is handled today since Redis executes everything serially.)

Comment From: antirez

@JimB123 what I mean is that, as documented for several years, a Lua script that modifies a key in DB0, a key in DB1, is invalid. Because it has, indeed, no way to delcare such thing in the EVAL arguments. Scripts that alter cross-DB keys are in vioaltion exactly as scripts that use non declared keys.

Comment From: JimB123

@antirez Respectly, I disagree with you here.

The current documentation for EVAL explicitly states that the declaration of keys "is not enforced in order to provide the user with opportunities to abuse the Redis single instance configuration, at the cost of writing scripts not compatible with Redis Cluster." This essentially states that not declaring keys is frowned upon, but is still supported - especially in non-cluster mode (which supports multiple DBs).

The current documentation for EVAL also states "It is possible to call SELECT inside Lua scripts like with normal clients," (and then there are some caveats)

I support the assertion that these SHOULD be invalid - but not the assertion that they ARE invalid.

I would support deterministically killing scripts which issue SELECT or access undeclared keys. However this would not be backwards compatible. To do this in a backwards-compatible way, we could deterministically kill violating scripts IF "async-slow-commands" is enabled - and of course document this.

I do NOT support non-deterministic failures based on race conditions and possible interactions with another client.