Redis [QUESTION] PAUSE instruction on threaded-io spinlock

While reading Richard Sites' (@dicksites) "Understanding Software Dynamics", the chapter on locks mentions using a PAUSE instruction as part of a spin lock.

Citing the reasons for doing so:

For a hyperthreaded core, it allows the other hyperthread(s) to use more instruction-issue cycles and more L1 data and instruction cache access cycles.

It may allow the CPU core to save some power by spinning more slowly.

It slows down the rate of issuing speculative instructions, which on some implementations can get 100 or more instructions ahead of the speculative branch that will eventually fail when the lock value changes from 1 to 0. Excessive speculative instructions can take dozens of cycles to flush out of the execution units, slowing the exit from the loop.

From CPU profiling in our production environment, we know that IO threads spend the majority of their cycles spinning, presumably in this loop:

https://github.com/redis/redis/blob/c0d7226274feb6a519a17a169421295da827a30c/src/networking.c#L4018-L4021

I can provide flamegraphs if needed.

That loop does not use the PAUSE instruction which suggests it may in fact negatively affect hyperthread performance. Do we want to consider addingPAUSE to what is effectively a spinlock?

Note: I did find this post which suggests there may be some latency incurred per PAUSE instruction, so we likely want to measure the effects.

Comment From: igorwwwwwwwwwwwwwwwwwwww

For reference, glibc's __pthread_spin_lock does indeed use the pause instruction on x86.

https://github.com/bminor/glibc/blob/361d6454c034a920f2c96517c277990d390b9652/nptl/pthread_spin_lock.c#L69
https://github.com/bminor/glibc/blob/895ef79e04a953cac1493863bcae29ad85657ee1/sysdeps/x86_64/atomic-machine.h#L412
https://stackoverflow.com/questions/7086220/what-does-rep-nop-mean-in-x86-assembly-is-it-the-same-as-the-pause-instru

Chromium's wtf library has a spinlock with pause instructions for various architectures:

https://chromium.googlesource.com/chromium/src/third_party/WebKit/Source/wtf/+/823d62cdecdbd5f161634177e130e5ac01eb7b48/SpinLock.cpp

Comment From: igorwwwwwwwwwwwwwwwwwwww

For reference, here is a CPU flamegraph from our production environment, showing the IO threads burning a lot of CPU in those loops, this is on redis 6.2:

redis-cache-02-db-gprd 20221028T152237

Comment From: dicksites

On Sun, Oct 30, 2022 at 4:59 AM Igor @.***> wrote:

While reading Richard Sites' @.*** https://github.com/dicksites) "Understanding Software Dynamics", the chapter on locks mentions using a PAUSE instruction as part of a spin lock.

Citing the reasons for doing so:

For a hyperthreaded core, it allows the other hyperthread(s) to use more instruction-issue cycles and more L1 data and instruction cache access cycles.

It may allow the CPU core to save some power by spinning more slowly.

It slows down the rate of issuing speculative instructions, which on some implementations can get 100 or more instructions ahead of the speculative branch that will eventually fail when the lock value changes from 1 to 0. Excessive speculative instructions can take dozens of cycles to flush out of the execution units, slowing the exit from the loop.

From CPU profiling in our production environment, we know that IO threads spend the majority of their cycles spinning, presumably in this loop:

https://github.com/redis/redis/blob/c0d7226274feb6a519a17a169421295da827a30c/src/networking.c#L4018-L4021

I can provide flamegraphs if needed.

That loop does not use the PAUSE instruction which suggests it may in fact negatively affect hyperthread performance. Do we want to consider addingPAUSE to what is effectively a spinlock?

Note: I did find this post https://www.reddit.com/r/intel/comments/hogk2n/research_on_the_impact_of_intel_pause_instruction/ which suggests there may be some latency incurred per PAUSE instruction, so we likely want to measure the effects.

Thank you for productive use of the book and for copying me on this note.

If pause takes about 50 cycles at 2.5 GHz, that is 20 nsec. 150 cycles is 60 nsec. Neither of these hurt latency to any meaningful degree.

However, if you add a pause inside a loop that goes around 1M times (e.g. IOThreadMain)and someone depends on that original loop taking about 5 msec total, the time could increase to as much as 65 msec with a pause inserted.

But even the original 5 msec loop is a bad design. As the book explains, after about the time of two context switches (about 5-10 usec), it is better to stop looping and block until the value you want changes, e.g. the futex() system call in Linux or something equivalent in Windows, freeing up the CPU core for something more productive. Almost 40% of the total CPU time could thus be saved. Pay attention to your orders of magnitude. Spinning for a few microseconds makes sense; spinning for a few milliseconds rarely does, and spinning for several tens of milliseconds cannot be justified.

I also note that the IOThreadMain spin loop uses atomic instructions in getIOPendingCount(). It might impact other threads less if you did a non-atomic read in this loop, and then fell into the atomic read on line 4024. In that case, you might need an explicit volatile declaration for io_threads_pending, or an explicit memory barrier in the loop to prevent the compiler from deleting the loop entirely (see book section 2.5).

When there is actual work in IOThreadMain, the flame graph indicates about 38 levels of subroutine call to do something trivial at the bottom of the call tree. That seems like an excessive amount of overhead. It perhaps could use some inline annotations or macros instead of so many calls. But I gather from the orange coloring that all but five of these levels are in kernel code. Sigh. Or maybe the kernel code does use inline but the mechanism to produce the flame graph defeats this, with no real performance issue when not producing flamegraphs. Thank you again, /dick @.***

—

Reply to this email directly, view it on GitHub https://github.com/redis/redis/issues/11448, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMR3XI3Z334WI422WWS2JLDWFZIHJANCNFSM6AAAAAARSHJYDQ . You are receiving this because you were mentioned.Message ID: @.***>

Comment From: madolson

Presumably "Yes" we should implement a proper spinlock with pause. This loop was implemented to get the best performance, at the cost of spinning on other threads. I think we have greater ambitions for improving the IO threading system though, as right now it's not optimizing that much code.