Redis Reaching 1 million requests per second on a single Redis instance

The following is a proposal to accelerate Redis performance by: 1. Improving io-thread efficiency by totally offloading network layer from main thread. 2. Reducing load on main thread remaining functionality by moving memory management and response formatting to the io-threads. 3. Amortizing memory access latency by batching dictionary searches.

We believe than by implementing these steps, a single Redis instance running on a multicore machine will deliver more than 1 million rps, up from under 400K rps as it is today.

Introduction

Io-threads were added to version 6.0 in order to split the network and some part of the application representation layer processing load among two or more cores. Io-threads can be used for both incoming requests and outgoing responses. When used on incoming traffic, io-threads not only read from the socket but also attempt to parse the first incoming request in the input buffer. In both directions, the main thread operates in a similar way, it balances the io tasks equally between the io-threads and itself, executes its part of the tasks and waits for others to complete their part before processing commands. This fan-out/fan-in approach keeps the solution simple has it requires no complex synchronization except from a barrier between the main and io threads. We measured the performance of Redis (version 7.0) with and without io-threads. All tests were performed on an EC2 Graviton 3 instance with 8 cores (r7g.2xl) with no replica and no TLS. In this test we first populated the DB with 3 million keys of size 512 bytes and sent GET/SET requests from 500 clients. The GET and SET distribution was 80 and 20% respectively. We measured the following performance numbers: 1. Without io-threads 205K rps 2. With 6 additional io-threads (“—io-threads 7”) 295K rps 3. With 6 additional io-threads doing also read 390K rps

When analyzing the performance we found the following issues: a. Underutilized cores - despite the 2x performance acceleration io-threads provide, Redis main thread still spends only 57% (processPendingCommandAndInputBuffer) of time executing commands. This implies that 1. The io-threads are practically idle 57% of the time and that 2. the main thread is spending 43% of the time executing io related functionality that could and should be executed by the io-threads. In addition, out of the 57% spent on executing commands, more than 9 percent (addReplyBulk ) are spent on translating objects into responses. b. Memory management – Redis main thread spends more than 7% freeing object that have been allocated mostly by the io-threads. Such discordance between allocators and freers may cause lock contention on the jemalloc arenas and reduce efficiency. c. Memory pattern access – Redis dictionary is a straightforward but inefficient chained hash implementation. While traversing the hash linked lists, every access to either a dictEntry structure, pointer to key or a value object requires with high probability an expensive external memory access. In our tests we found that the main thread spent more that 26% on dictionary related operations.

Our suggestions:

Io-threads

We suggest to totally offload all network layer functionality from main thread to io-threads. Our preferred approach would be to divide the client layer into two halves. The first half, handled by the io-thread will be the “stream layer” which includes socket layer as well as parsing requests and formatting responses while the second part, handled by main thread, maintains the client execution’s state (blocked, watching, subscribing etc). Based on this, io-threads handle read/write/epoll on the sockets, parse and allocate commands and append the ready to be executed commands on one or more queues. The main thread extract commands from the queues, executes them in the client context and appends responses on the queues which the io-threads extract, format and transmit on the sockets. In addition to the standard Redis commands and responses, internal control commands will be exchange between io-threads and main thread such as creating and erasing client states, pausing read from client sockets, requests to free previously allocated memory etc..

Dictionary memory access

Redis de facto executes commands in batches. We suggest that before executing a batch of commands we must ensure that all memory locations needed for dictionary operations will be find in the (L1/2/3) caches and avoid the latency associated with external memory accesses. This is done by searching in the dictionary all keys from a batch of commands before executing them. Searching the dictionary with more than on key at a time, when using prefetch instructions properly, allows to amortize memory access latency. From our experience, up to 65 % of the time spent on dictionary operations can be reduced this way.

Alternative solutions

An alternative approach to increase parallelism would be to allow main and io threads run in parallel on an unmodified client layer and prevent conflicting accesses between the io and main thread with locks. Such approach, while theoretically simpler, may require considerable testing to ensure consistency as well as avoiding lock related issues such as contention and deadlocks.

Dictionary inefficiencies can be solved by replacing the hash table implementation with a more "cache friendly" one that amortizes memory access by storing entire buckets in one or more adjacent cache lines. This approach has the advantage of being more "neighbor friendly" as it issues considerably less requests to the memory channels. However since Redis dictionary deliver a non standard programmatic interface ("key to dictEntry" mapping rather than "key to value" mapping) , replacing the hash implementation requires a much bigger coding and testing effort than the proposed alternative.

Comment From: madolson

Thanks for bringing this up!

Io-threads

This definitely seems like the bigger and more impactful change, related to the discussion from https://github.com/redis/redis/issues/11119, where we discussed improving threaded IO improvements. This is broadly aligned with where I want the Redis core to go. A multi-threaded IO layer that does the parsing and sends full commands to one, and in the future many, execution engines where memory handles are passed back to the IO threads for sending over the network. @zuiderkwast FYI.

Obviously there is a lot of detail we need to investigate here. @redis/core-team As long as we're still onboard with trying to implement this for Redis 8, I think the next step would be trying to come up with a more defined plan and review that.

Dictionary memory access

AFAIK this is a novel suggestion we haven't seen made before. It intuitively makes sense and seems like something we could work on independently first. We just need some way to accumulate some number of commands, prefetch, then execute.

Comment From: oranagra

Thank you for that detailed research and suggestion Here are some random comments that occurred to me while reading.

a single Redis instance running on a multicore machine

This may not be the right use-case to optimize. considering redis design and philosophy (share nothing, horizontal scaling) we care more about multiple redis shards (possibly without IO threads) running on a multi-core machine. That's of course not a reason not to improve vertical scaling, and offloading IO to threads which may improve both cases. I'm just saying that some of these observations may not be relevant for that case or can be interpreted differently.

The first half, handled by the io-thread will be the "stream layer" which includes socket layer as well as parsing requests

As you pointed out, freeing the memory that was allocated by an IO thread from the main thread can cause contention on the arena lock, but i also want to point out that in some cases that memory is retained in the database, which can lead to fragmentation issues.

The main thread extract commands from the queues, executes them in the client context and appends responses on the queues which the io-threads extract, format and transmit on the sockets.

If the IO thread is doing formatting of the responses, it must mean that has to either have an access to the original copy of the data (which poses issues since the main thread can already modify or deallocate it), or that we need to make another copy which can be wasteful. Alternatively we can revive the idea of reference counts, move it from robj to sds, etc.

If i'm not mistaken, you also suggest that we do IO in threads, in parallel to the main thread doing command execution, so note that we need to be careful not to violate AOF and fsync every write concerns (the barrier thing), i.e. not to send the reply to a client before we fsynced the write it performed to the disk (this can become more complicated).

This is done by searching in the dictionary all keys from a batch of commands before executing them

Are you suggesting to just fetch that into the CPU cache, and later repeat the lookup (which will have less memory bus stalls)? or to actually keep the pointers into the dict / keyspace for re-use? If that's the first, then I'm not sure the CPU cache is likely to be big enough to hold a big enough portion of the dict, which can be quite large (unless the commands are simple ones that don't operate on bit data structures, i.e. not SUNIONSTORE). if that's the second, i think it could easily break when one command modifies the dataset (or does dict re-hashing). Maybe i didn't understand your suggestion.

However since Redis dictionary deliver a non standard programmatic interface

Note that in the recent version the dictEntry was changed to be opaque, so we can now easily modify it without the need to modify a ton of code outside dict.c

Comment From: touitou-dan

Thanks a lot Oran for the comments

I think that we are aligned with the shared nothing vision. The goal here is to improve efficiency for those users who are using io-threads.

Alternatively we can revive the idea of reference counts, move it from robj to sds, etc.

What we have in mind is a combination of atomic refcounts and a copy on write for objects who are currently used by the io-threads

...so note that we need to be careful not to violate AOF and fsync every write concerns

In such case the main thread will send an internal command to the io-thread to pause output to the client until fwrite is completed

Are you suggesting to just fetch that into the CPU cache, and later repeat the lookup (which will have less memory bus stalls)? or to actually keep the pointers into the dict / keyspace for re-use?

The first one. I propose that whenever a batch of commands is ready for execution, to invoke a modified vectorized dictFind operation for some of the keys before the commands executions. This is a "best effort" procedure that aims to bring all needed memory addresses "closer" to the cpu core before executing the commands. The number of commands per such batch and their type is TBD, but from my experience the benefit can be seen already from batches of 6-7 requests, so trashing the cache shouldn't be a concern. And since dictFind doesn't have side effects, consistency shouldn't be a concern too. Also pls note that a. the same approach can be used to accelerate multi key operation such as mget, mset and exec even in a single thread configuration and that b. the vectorized dictFind can be extended to prefetch deeper into the robj or even the data structure pointed by the robj