Crash report
Paste the complete crash log between the quotes below. Please include a few lines from the log preceding the crash report to provide some context.
933388:M 01 Sep 2023 17:50:56.817 # === ASSERTION FAILED ===
933388:M 01 Sep 2023 17:50:56.817 # ==> networking.c:1575 'ln != NULL' is not true
------ STACK TRACE ------
Backtrace:
./redis-server *:6379(freeClient+0x5b8)[0x55aadf3214d8]
./redis-server *:6379(+0xbe117)[0x55aadf328117]
./redis-server *:6379(beforeSleep+0x5a)[0x55aadf2fc22a]
./redis-server *:6379(aeMain+0x2e)[0x55aadf2f732e]
./redis-server *:6379(main+0x3af)[0x55aadf2ec61f]
/lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f37f24456ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f37f2445785]
./redis-server *:6379(_start+0x21)[0x55aadf2ece21]
Additional information
Steps to reproduce:
- Boot up Redis with
io-threads>0,io-threads-do-reads yes, and TLS enabled:./redis-server --io-threads 4 --io-threads-do-reads yes --tls-port 6379 --port 0 --tls-cert-file ~/workspace/testcerts/certs/server.crt --tls-key-file ~/workspace/testcerts/certs/server.key --tls-auth-clients no - Run traffic using Memtier/another benchmarking tool with TLS support to activate IO threads:
./memtier_benchmark --test-time=100 -c 10 -t 10 --pipeline=1000 --tls --tls-skip-verify - Send a query that causes OpenSSL to have leftover buffered bytes (
SSL_pending()> 0), then close the connection:BIG_VALUE=$(head -c 32839 < /dev/zero | tr '\0' 'a') (printf "*5\r\n\$3\r\nSET\r\n\$1\r\nX\r\n\$32839\r\n${BIG_VALUE}\r\n\$2\r\nEX\r\n\$4\r\n1000\r\n"; sleep 0.1) | openssl s_client --host localhost --port 6379
Cause:
After calling SSL_read, it is possible that OpenSSL reads more data from the SSL records than was requested by the Redis server. This could happen if Redis is asking for an amount of data that would not be evenly divisible by the record size. E.g. Redis asks for 1 KiB of data, but the OpenSSL record was 8 KiB, in which case 7 KiB of data is buffered inside of OpenSSL.
When a large multi-bulk item comes in, Redis only asks for enough data to finish the current item read to prevent additional copying, as an optimization. This can cause OpenSSL to have bytes remaining in its buffers, even when the file descriptor is no longer considered readable. Because of this, Redis needs to check SSL_pending to check if the SSL buffer has remaining data, otherwise it risks leaving the data in OpenSSL buffers until the next readable event comes in on the FD.
If io-threads-do-reads is disabled, when read events come in from the event loop, we trigger the TLS event handler, which in turn triggers our read callback that causes data to be read using connRead, and w/ TLS enabled, this uses SSL_read. After the read callback, data can be left in the OpenSSL buffers, so inside the TLS event handler, Redis checks if SSL_pending is greater than zero, and if so adds the connection to a list for further processing.
However, if io-threads-do-reads is enabled, when read events come in from the event loop, we still trigger the TLS evenet handler, which in turn triggers our read callback. However, the read callback postpones the read to be done by the IO threads in beforeSleep. No SSL_read call actually occurs at this time. The read callback finishes, and the TLS event handler checks SSL_pending, even though no data was actually read in the callback. We are checking if the last SSL_read had data pending in the SSL buffers, and the last SSL_read actually happened after a prior read event, not the current one. Essentially, our processing of the connections with SSL_pending data occurs one event after it should.
This causes Redis to fail to read any data beyond our large multibulk item until the next read event comes in. However, a crash occurs if the next read event is caused by a connection close. In this case, we go through the following sequence of events:
- Connection close event causes FD to become readable
- Event loop triggers TLS event handler, which tirggers the read callback, adding the item to our pending reads for IO threading
- TLS event handler checks SSL_pending, finds it is >0, so adds the connection to our list of TLS connections for further processing
- In BeforeSleep, we call handleClientsWithPendingReadsUsingThreads, where we have an IO thread perform the actual read
- In handleClientsWithPendingReadsUsingThreads, an IO thread reads from the connection and gets the remaining bytes from SSL_pending, but does not absorb the connection close event
- In BeforeSleep, after handleClientsWithPendingReadsUsingThreads, we call connTypeProcessData which triggers processing of the connections with pending SSL buffers.
- For each connection with a pending SSL buffer, we trigger a readable event. Again, since
io-threads-do-readsis enabled, this simply adds it to our IO threads pending read list for processing. - At the very end of the TLS handle event call, we again check if
SSL_pending() > 0. Since there is no SSL_pending bytes anymore (they were read by handleClientsWithPendingReadsUsingThreads) we remove the connection from our SSL pending list. - In BeforeSleep, after connTypeProcessData, we attempt to skip the event loop if there is SSL pending data. However, this doesn't skip the event loop since there is no pending data anymore
- On the next event loop cycle, we still see that the FD is readable from the connection close event, so trigger another read event
- This read event double adds the connection to the IO threads pending read list
- When we go to handleClientsWithPendingReadsUsingThreads, the connection is processed by two IO threads (could cause race conditions) and is marked to be freed asynchronously when the read detects the connection closed
- At the very end of the handleClientsWithPendingReadsUsingThreads function, we loop over our list, and call beforeNextClient on each client object, which attempts to free clients marked to be freed asynchronously. Since we have two client objects in the list, we double free the client, causing the assertion to trigger
Note that this can also cause a host of other issues (race conditions, memory issues from double freeing memory).
Comment From: madolson
My understanding is that TLS is not actually supposed to function with IO threading and reads. We should probably disallow the configuration until we actually have it sorted. @yossigo thoughts?
Comment From: yossigo
@madolson Yes, we should have done that in the past. But I'm not sure what would make more sense, consider it a bugfix and push it to 7.2.x or hold back.
Comment From: ShooterIT
Fixed in https://github.com/redis/redis/pull/13695