Redis [BUG] threaded-io causes increased CPU utilization that's only resolved after a Redis restart

We've have been using Redis 6.0.10 for different workloads in our platform (message queue, cache, kv store). We'd recently turned on io-threads. We are also using redisearch 1.6.14.

We're using 2 x 8 core machine in a master-slave setup, with the following core mask allocation:

server_cpulist 0,1,2,3
bio_cpulist 4,5,6,7
aof_rewrite_cpulist 4,5,6,7
bgsave_cpulist 4,5,6,7

and have io-threads set to 2:

io-threads 2 
io-threads-do-reads no

While most of the time things are operating perfectly, and we are seeing better core utilization, under certain peek workloads on the master (a mixture of reads and writes), we suddenly start seeing a constant increase in the master's CPU utilization.

By constant, I mean that it seems that ops simply consume more CPU. when there's no I/O being performed on the redis server, CPU utilization drops to 0%, but when we connect our application, we see it constantly reaching 100%. We tried turning our application(s) on and off one by one, but only after restarting the redis, it's CPU utilization drops back to normal (5-30%).

Here's a depiction of our troubleshooting efforts, and they affected the CPU utilization

Platform Redis Master - CPU Utilization

This has only started happening ever since we activated io-threads.

So of course, the immediate solution would be to turn io-threads off, but it feels like there's a bug hidden somewhere underneath.

I have no concrete steps to reproduce this issue, as it happens sporadically once a week or so and on random points throughout the day.

Any advice would be more than welcomed...

Comment From: igorwwwwwwwwwwwwwwwwwwww

Would you be able to capture a per-thread pidstat when this occurs?

sudo pidstat -t -p $(pgrep -of bin/redis-server) 1 120

That should help narrow down which threads are using the CPU.

Another useful thing would be to capture a profile with perf and visualising with flamegraph:

sudo perf record -ag -F 99 -- sleep 60
sudo perf script --header | stackcollapse-perf.pl --kernel | flamegraph.pl --hash --colors=perl > flamegraph.svg

This way we know how the cycles are being spent.

Do note that threaded-io does include some busy looping (presumably by design to keep the I/O threads on CPU). So enabling the feature can in fact increase overall CPU utilization of the process, but you gain scalability since some of the socket writes are offloaded.

The goal of the feature is not to reduce overall CPU utilization, it is to reduce CPU utilization of the main thread, since that is the scalability bottleneck.

Comment From: sheinbergon

@igorwwwwwwwwwwwwwwwwwwww thank you for your prompt response. We are well aware of the purpose of this feature. Notice from my issue description that this is more of a quirk - due to an unknown reason, CPU utilization drastically increases until we restart the master. Even though the same workload is applied before and after the restart, the CPU utilization drops to near minimum after it, so it's not a matter of abusing or misusing redis.

We'll try to capture the threading/cpu workload as per your specification and provide you with this information once the issue reoccurs.

Comment From: sheinbergon

If anyone every encounters this behavior, it was due to an OS scheduling problem. our OS is Debian Buster, running on EC2. The OS was set to bet of isolcpus=0 kernel parameter. This seems to have caused a behavior where both the main redis_server thread and io_thd threads competed for the same core 0, even though the server_cpulist was set to 0,1,2,3. So instead of getting a performance boost from using the io_threads, we are seeing the opposite effect. This was only solved after restarts, as this would give the process a chance to reassign its threads. Removing this parameter and tinkering with the server_cpulist mask has made the issue go away.