Describe the bug
Blocking commands such as BRPOP can be used in conjunction with a timeout argument. They are supposed to unblock when the timeout is reached, if not earlier.
Such blocking commands do not unblock in time when the following two conditions are met:
* Constant high volume of new connections. E.g. clients do not use long-lived connections, but rather use a new connection for every single request.
* At any given moment, the number of concurrent established connections are very low compared to the number of new connections. E.g. the connected_clients has a low value while total_connections_received grows like crazy.
To reproduce
Run the following redis-benchmark command to manufacture a high volume of new connections but a low number of concurrent connections against your redis server (primary role):
./redis-benchmark -h nanya-test-brpop.hdo23d.0001.internal.use1.cache.amazonaws.com -t set -n 1000000 -l -r 100000000 -c 100 -k 0 --threads 2 -P 3
In a separate terminal, run the following brpop command observer it takes abnormal amount of time to unblock:
[ny0312@ip ~]$ time redis-cli brpop queue 2
(nil)
real 0m32.288s
user 0m0.001s
sys 0m0.000s
[ny0312@ip ~]$ time redis-cli brpop queue 2
^[[O(nil)
real 0m37.500s
user 0m0.001s
sys 0m0.000s
[ny0312@ip ~]$ time redis-cli brpop queue 2
(nil)
real 0m23.083s
user 0m0.001s
sys 0m0.000s
Expected behavior
[ny0312@ip ~]$ time redis-cli brpop queue 2
(nil)
real 0m2.032s
user 0m0.001s
sys 0m0.001s
Additional information
Upon investigation, I believe the root cause is an unnecessary list rotation that Redis does to the server.clients list before checking client timeouts.
Specifically, in clientsCron, Redis always rotates the current tail of the list to the head, and then check the head client for timeout. The number of clients that Redis checks in each clientsCron run is proportional to the number of current connections. When there are constant high volume of new connections, Redis end up always checking the newly accepted connections at the tail of the server.clients list, and because the number of current connections is low, the older connections end up being starved and never looked at. As result, blocking commands take extra long time to be unblocked.
Comment From: ny0312
This seems to have been fixed since 6.0.0 by the combination of these 2 commit:
- https://github.com/redis/redis/commit/7add0f249078aa11cdeb11ec7535519315f0369e
- https://github.com/redis/redis/commit/aa9d92d94a7f18f768e25bdbb94179b32e7b9e13
Resolving.