I have Redis cluster with 32 instances. 16 instances are primary and 16 instances are replicas.

I experience low performance of CLUSTER SLOTS command. For example a node with zero slots assigned can process only 800 CLUSTER SLOTS commands per second. CPU usage by such Redis instance is 100%. The same node can process 10,000 GET, SET, etc. requests with 20% of CPU usage.

The issue is that client send requests as: 1. Connect to one of seed nodes and issue CLUSTER SLOTS command 2. Connect to specific nodes storing specific keys and run commands GET, SET, etc. commands 3. Close connection

Because of slow CLUSTER SLOTS command I get frequent timeouts.

If I use all 32 nodes as seed nodes, the we see some strange imbalance. 2-3 nodes are still using 100% of CPU because of slow CLUSTER SLOTS

We tested it with * Redis 4.0.10 * Redis 5.0.1

The output of CLUSTER NODES: ```d453613ed1f140733d2d2500545c11dbb15f783e 10.0.0.68:6379@16379 slave 2dbdb78cd270c3e886bcc0a202864560fcb14893 0 1542301494000 87 connected ad9127df756c16065b83cc7bc745d74346d82ceb 10.0.0.76:6379@16379 master - 0 1542301493000 93 connected 9830-10375 11468-12013 e3ede6d02901a9ef9824199aeaeaf69a3bd82808 10.0.0.77:6379@16379 slave 2ca6046a1f5cdc92739a08744bc19079702eb118 0 1542301495000 64 connected b597247610173bc1cb80e839460fcbddccfc692f 10.0.0.39:6379@16379 slave 9b2e827e7698e5c4567b89d9c468959f3aeb18fb 0 1542301492000 79 connected 1019214a95bd133c72b7adb73a2f945d7ddf3c45 10.0.0.64:6379@16379 master - 0 1542301496000 81 connected 7100-8191 937cfe1b100d6642be52cb5612fce264fbe586d7 10.0.0.65:6380@16380 slave bc89edc6eee2761f351617045e92161ff7b8c99e 0 1542301495000 95 connected a1d91c2e2e5995e1bae815f135d0e5d735d6beba 10.0.0.37:6379@16379 slave 5c5595fadf938b1ef161728afea578504eebc599 0 1542301494098 83 connected 4386f4df4d8a19ea2d37921c01fb584826c66167 10.0.0.80:6379@16379 slave 4415d5c72bbab134141cda069b09e69f3a271753 0 1542301496285 65 connected 9b2e827e7698e5c4567b89d9c468959f3aeb18fb 10.0.0.34:6379@16379 master - 0 1542301494000 79 connected 12014-13106 2cc9907b221e71e4c08854f5afd04a1a9427142c 10.0.0.65:6379@16379 slave f01c2843c3d1f2b36581da2d6d74816b42c1bec3 0 1542301491000 96 connected 4f6068bf03c332bba077629c54b0f552a29e6da0 10.0.0.71:6379@16379 slave 9d55ea97eefbd48e9292de963c197d083ec2a05b 0 1542301495590 82 connected 79e6655b06c58797a714b6c88c52475fc22d9ae6 10.0.0.33:6379@16379 master - 0 1542301492000 85 connected 8738-9829 9d55ea97eefbd48e9292de963c197d083ec2a05b 10.0.0.66:6379@16379 master - 0 1542301492000 82 connected 13653-14744 311edeaabc9a2bb9dfdee9a51f3873329b37cf0d 10.0.0.79:6379@16379 slave d88d9205d6072a70bca0cf936d1e82b406585f6e 0 1542301497282 68 connected 5c5595fadf938b1ef161728afea578504eebc599 10.0.0.32:6379@16379 master - 0 1542301496884 83 connected 5461-6553 ad2baa5360b9a9be9ee5d6ed7f656f86c1f562da 10.0.0.69:6379@16379 slave 1019214a95bd133c72b7adb73a2f945d7ddf3c45 0 1542301492000 81 connected 2dbdb78cd270c3e886bcc0a202864560fcb14893 10.0.0.63:6379@16379 master - 0 1542301494000 87 connected 3823-4914 ae59a39f3017b9f0ee662a7da74f56dcb4d2f8fe 10.0.0.31:6379@16379 myself,master - 0 1542301491000 84 connected 2184-3276 170bcdd95dced89939733176a44be50e504e3e56 10.0.0.67:6379@16379 slave a55db89540bd6c7cfd581520d408578be341ef35 0 1542301497586 86 connected 76f133bdc60cde33065d09e2ffaf2460370dfd9e 10.0.0.73:6379@16379 master - 0 1542301493601 66 connected 6554-7099 8192-8737 bc89edc6eee2761f351617045e92161ff7b8c99e 10.0.0.70:6380@16380 master - 0 1542301495291 95 connected 10876-11467 2e6508f8cd195bdc1513dad59c30968655c7fd34 10.0.0.38:6379@16379 slave 79e6655b06c58797a714b6c88c52475fc22d9ae6 0 1542301496085 85 connected 0ff3d910cd10c966fc4a8126ce8ea5ea677caf99 10.0.0.35:6379@16379 master - 0 1542301493000 80 connected 15291-16383 a55db89540bd6c7cfd581520d408578be341ef35 10.0.0.62:6379@16379 master - 0 1542301494000 86 connected 546-1637 2ca6046a1f5cdc92739a08744bc19079702eb118 10.0.0.72:6379@16379 master - 0 1542301494000 64 connected 0-545 1638-2183 3db25bd21a26888bf7cf14b89a1cd009ed6734aa 10.0.0.81:6379@16379 slave ad9127df756c16065b83cc7bc745d74346d82ceb 0 1542301494000 93 connected 4415d5c72bbab134141cda069b09e69f3a271753 10.0.0.75:6379@16379 master - 0 1542301496000 65 connected 3277-3822 4915-5460 d88d9205d6072a70bca0cf936d1e82b406585f6e 10.0.0.74:6379@16379 master - 0 1542301496000 68 connected 13107-13652 14745-15290 00f70416cc7dc777cf6e6e6e42e955629dcaa1e1 10.0.0.40:6379@16379 slave 0ff3d910cd10c966fc4a8126ce8ea5ea677caf99 0 1542301493601 80 connected 7c94ac26286e64522cec034663dac52db2b2c69c 10.0.0.78:6379@16379 slave 76f133bdc60cde33065d09e2ffaf2460370dfd9e 0 1542301495000 66 connected f01c2843c3d1f2b36581da2d6d74816b42c1bec3 10.0.0.70:6379@16379 master - 0 1542301494298 96 connected 10376-10875 3211972a45e118743531eaad374ab192bddc2342 10.0.0.36:6379@16379 slave ae59a39f3017b9f0ee662a7da74f56dcb4d2f8fe 0 1542301494600 84 connected


The output of `CLUSTER SLOTS`:
``` 1) 1) (integer) 9830
    2) (integer) 10375
    3) 1) "10.0.0.76"
       2) (integer) 6379
       3) "ad9127df756c16065b83cc7bc745d74346d82ceb"
    4) 1) "10.0.0.81"
       2) (integer) 6379
       3) "3db25bd21a26888bf7cf14b89a1cd009ed6734aa"
 2) 1) (integer) 11468
    2) (integer) 12013
    3) 1) "10.0.0.76"
       2) (integer) 6379
       3) "ad9127df756c16065b83cc7bc745d74346d82ceb"
    4) 1) "10.0.0.81"
       2) (integer) 6379
       3) "3db25bd21a26888bf7cf14b89a1cd009ed6734aa"
 3) 1) (integer) 7100
    2) (integer) 8191
    3) 1) "10.0.0.64"
       2) (integer) 6379
       3) "1019214a95bd133c72b7adb73a2f945d7ddf3c45"
    4) 1) "10.0.0.69"
       2) (integer) 6379
       3) "ad2baa5360b9a9be9ee5d6ed7f656f86c1f562da"
 4) 1) (integer) 12014
    2) (integer) 13106
    3) 1) "10.0.0.34"
       2) (integer) 6379
       3) "9b2e827e7698e5c4567b89d9c468959f3aeb18fb"
    4) 1) "10.0.0.39"
       2) (integer) 6379
       3) "b597247610173bc1cb80e839460fcbddccfc692f"
 5) 1) (integer) 8738
    2) (integer) 9829
    3) 1) "10.0.0.33"
       2) (integer) 6379
       3) "79e6655b06c58797a714b6c88c52475fc22d9ae6"
    4) 1) "10.0.0.38"
       2) (integer) 6379
       3) "2e6508f8cd195bdc1513dad59c30968655c7fd34"
 6) 1) (integer) 13653
    2) (integer) 14744
    3) 1) "10.0.0.66"
       2) (integer) 6379
       3) "9d55ea97eefbd48e9292de963c197d083ec2a05b"
    4) 1) "10.0.0.71"
       2) (integer) 6379
       3) "4f6068bf03c332bba077629c54b0f552a29e6da0"
 7) 1) (integer) 5461
    2) (integer) 6553
    3) 1) "10.0.0.32"
       2) (integer) 6379
       3) "5c5595fadf938b1ef161728afea578504eebc599"
    4) 1) "10.0.0.37"
       2) (integer) 6379
       3) "a1d91c2e2e5995e1bae815f135d0e5d735d6beba"
 8) 1) (integer) 3823
    2) (integer) 4914
    3) 1) "10.0.0.63"
       2) (integer) 6379
       3) "2dbdb78cd270c3e886bcc0a202864560fcb14893"
    4) 1) "10.0.0.68"
       2) (integer) 6379
       3) "d453613ed1f140733d2d2500545c11dbb15f783e"
 9) 1) (integer) 2184
    2) (integer) 3276
    3) 1) "10.0.0.31"
       2) (integer) 6379
       3) "ae59a39f3017b9f0ee662a7da74f56dcb4d2f8fe"
    4) 1) "10.0.0.36"
       2) (integer) 6379
       3) "3211972a45e118743531eaad374ab192bddc2342"
10) 1) (integer) 6554
    2) (integer) 7099
    3) 1) "10.0.0.73"
       2) (integer) 6379
       3) "76f133bdc60cde33065d09e2ffaf2460370dfd9e"
    4) 1) "10.0.0.78"
       2) (integer) 6379
       3) "7c94ac26286e64522cec034663dac52db2b2c69c"
11) 1) (integer) 8192
    2) (integer) 8737
    3) 1) "10.0.0.73"
       2) (integer) 6379
       3) "76f133bdc60cde33065d09e2ffaf2460370dfd9e"
    4) 1) "10.0.0.78"
       2) (integer) 6379
       3) "7c94ac26286e64522cec034663dac52db2b2c69c"
12) 1) (integer) 10876
    2) (integer) 11467
    3) 1) "10.0.0.70"
       2) (integer) 6380
       3) "bc89edc6eee2761f351617045e92161ff7b8c99e"
    4) 1) "10.0.0.65"
       2) (integer) 6380
       3) "937cfe1b100d6642be52cb5612fce264fbe586d7"
13) 1) (integer) 15291
    2) (integer) 16383
    3) 1) "10.0.0.35"
       2) (integer) 6379
       3) "0ff3d910cd10c966fc4a8126ce8ea5ea677caf99"
    4) 1) "10.0.0.40"
       2) (integer) 6379
       3) "00f70416cc7dc777cf6e6e6e42e955629dcaa1e1"
14) 1) (integer) 546
    2) (integer) 1637
    3) 1) "10.0.0.62"
       2) (integer) 6379
       3) "a55db89540bd6c7cfd581520d408578be341ef35"
    4) 1) "10.0.0.67"
       2) (integer) 6379
       3) "170bcdd95dced89939733176a44be50e504e3e56"
15) 1) (integer) 0
    2) (integer) 545
    3) 1) "10.0.0.72"
       2) (integer) 6379
       3) "2ca6046a1f5cdc92739a08744bc19079702eb118"
    4) 1) "10.0.0.77"
       2) (integer) 6379
       3) "e3ede6d02901a9ef9824199aeaeaf69a3bd82808"
16) 1) (integer) 1638
    2) (integer) 2183
    3) 1) "10.0.0.72"
       2) (integer) 6379
       3) "2ca6046a1f5cdc92739a08744bc19079702eb118"
    4) 1) "10.0.0.77"
       2) (integer) 6379
       3) "e3ede6d02901a9ef9824199aeaeaf69a3bd82808"
17) 1) (integer) 3277
    2) (integer) 3822
    3) 1) "10.0.0.75"
       2) (integer) 6379
       3) "4415d5c72bbab134141cda069b09e69f3a271753"
    4) 1) "10.0.0.80"
       2) (integer) 6379
       3) "4386f4df4d8a19ea2d37921c01fb584826c66167"
18) 1) (integer) 4915
    2) (integer) 5460
    3) 1) "10.0.0.75"
       2) (integer) 6379
       3) "4415d5c72bbab134141cda069b09e69f3a271753"
    4) 1) "10.0.0.80"
       2) (integer) 6379
       3) "4386f4df4d8a19ea2d37921c01fb584826c66167"
19) 1) (integer) 13107
    2) (integer) 13652
    3) 1) "10.0.0.74"
       2) (integer) 6379
       3) "d88d9205d6072a70bca0cf936d1e82b406585f6e"
    4) 1) "10.0.0.79"
       2) (integer) 6379
       3) "311edeaabc9a2bb9dfdee9a51f3873329b37cf0d"
20) 1) (integer) 14745
    2) (integer) 15290
    3) 1) "10.0.0.74"
       2) (integer) 6379
       3) "d88d9205d6072a70bca0cf936d1e82b406585f6e"
    4) 1) "10.0.0.79"
       2) (integer) 6379
       3) "311edeaabc9a2bb9dfdee9a51f3873329b37cf0d"
21) 1) (integer) 10376
    2) (integer) 10875
    3) 1) "10.0.0.70"
       2) (integer) 6379
       3) "f01c2843c3d1f2b36581da2d6d74816b42c1bec3"
    4) 1) "10.0.0.65"
       2) (integer) 6379
       3) "2cc9907b221e71e4c08854f5afd04a1a9427142c"

Comment From: antirez

Thanks @atlantos, I'll look into it. The idea is in general that CLUSTER SLOT should be called rarely, for instance after the connection is established, or immediately after we receive some cluster redirection in order to get a fresh configuration. Still it will be great to have a much faster command. I'll perform benchmarks and report back what we can do.

Comment From: antirez

P.S. Worst scenario we could think about having a caching and invalidation of the reply, even if this requires a lot of attention about not providing wrong state, but I'm confident that the invalidation can be performed in the lower level functions without requiring hooking in all the places where the config is changed.

Comment From: antirez

Pinging @soveran, @soloestoy and @artix75 about that for possible direct interest on this issue, and moreover pinging @oranagra and @yossigo since Redis Labs Redis flavor also supports the Cluster protocol AFAIK, so they may have a similar issue and/or solutions.

Comment From: antirez

Moreover @mrniko potentially already observed the same problem.

Comment From: atlantos

For your information seems the imbalance was caused by client driver and was resolved by using random Redis nodes as seed nodes on client's side.

We have 7 client nodes connecting to the Redis cluster using PHP Redis as:

$redis_nodes = Array('redis-cluster1', 'redis-cluster2', .. , 'redis-cluster32',);
$obj_cluster = new RedisCluster(NULL, $redis_nodes);

Redis PHP extensions has some persistence and connects to specific node to run CLUSTER SLOTS. For example * client1 always connect to redis-cluster10 to run CLUSTER SLOTS * client2 always connect to redis-cluster18 to run CLUSTER SLOTS and so on.

Because the number of clients is less than the number of Redis servers, some nodes received a lot of CLUSTER SLOTS requests and CPU usage was 50%-100%. Other nodes received no CLUSTER SLOTS requests and CPU usage was 15%. The workaround was to use 4 random nodes as

$obj_cluster = new RedisCluster(NULL, array_rand($redis_nodes, 4));

After this change CPU usage on all Redis nodes became ~ 25%.

Comment From: antirez

Thanks for the update. Still what happened to you looks like a clue about CLUSTER SLOTS being a potential target for some optimization. I think there is room to make it much faster.

Comment From: atlantos

Yes, definitely there are options for improvements. Now our 32 nodes cluster receive 250,000 overall requests and 1,800 CLUSTER SLOTS requests.

CPU usage by CLUSTER SLOTS command is ~ 10% per node. And CPU usage by all other commands ~ 15%. Total CPU usage is ~ 25%. This means that CLUSTER SLOTS command uses (250000/15)/(1800/10) = 92 times more CPU than regular GET/SET/etc. command.

Comment From: ChenGuanqun

Redis NODES command also have the same performance issue: https://github.com/antirez/redis/issues/6534.

We have tried to solve this from client side (Randomizing the command sent to the cluster ndoes): https://jira.spring.io/projects/DATAREDIS/issues/DATAREDIS-890?filter=allissues

Looking forward improving this from service side. thanks @antirez

Comment From: filipecosta90

Hi there @atlantos The PR should adress this issue: https://github.com/redis/redis/pull/8541 validated by comment: https://github.com/redis/redis/pull/8541#issuecomment-791896456

Comment From: oranagra

closing as solved. feel free to reopen or respond if needed.