Redis [QUESTION] redundant memcpy at "src/rax.c:939"

the source code of redis-server version 7.0.10 here is as follows:

raxNode **raxFindParentLink(raxNode *parent, raxNode *child) {
    raxNode **cp = raxNodeFirstChildPtr(parent);
    raxNode *c;
    while(1) {
        memcpy(&c,cp,sizeof(c));
        if (c == child) break;
        cp++;
    }
    return cp;
}

I noticed the writer choose to use memcpy load the content pointed by cp and compare it with child. However, this will be removed by gcc-12.1 with default optimization (O2), the corresponding disassembly is:

/home/redis-stable/src/rax.c:939
  4faff6:   48 3b 30                cmp    (%rax),%rsi    #here the rax keep the value of `cp`
  4faff9:   74 0e                   je     4fb009 <raxFindParentLink+0x29>
  4faffb:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
/home/redis-stable/src/rax.c:940
  4fb000:   48 83 c0 08             add    $0x8,%rax
/home/redis-stable/src/rax.c:939
  4fb004:   48 3b 30                cmp    (%rax),%rsi
  4fb007:   75 f7                   jne    4fb000 <raxFindParentLink+0x20>

so the executable behave as the corresponding source code is "*cp" instead of "c". And i'm not sure why a memcpy is need here.

Comment From: zuiderkwast

Memcpy the size of a pointer is indeed strange. I agree. It is equivalent to c = *cp which is simpler and easier to read.

Do you want to open a PR?

Comment From: Absoler

yes, I'd like to, it's here https://github.com/redis/redis/pull/12720

Comment From: sundb

@zuiderkwast Because the earliest versions of rax did not do size(void*) byte alignment for raxNode.

/* Return the pointer to the first child pointer. */
#define raxNodeFirstChildPtr(n) ((raxNode**)((n)->data+(n)->size))

Comment From: zuiderkwast

@sundb You are very smart. :thinking: :bulb:

But now with raxPadding we can be sure it is always pointer aligned?

Comment From: sundb

@zuiderkwast Yes, it's safe now.

Comment From: enjoy-binbin

see https://github.com/redis/redis/pull/12720#issuecomment-1793739286, closing

Comment From: oranagra

now i see the work he did in https://github.com/antirez/rax/commits/master but still, there are a ton of memcpy operations left in the code, which makes me wonder if all of them are now excessive, or only some. considering this, and the risk (we can't easily test it), i don't like to proceed with this PR unless someone can test this on a vulnerable platform, and eliminate all the excessive memcpy, not just one. the simplistic statement that initially led to the opening of this PR is overlooking this problem, and we can't proceed without fully addressing it and properly testing it.

Comment From: zuiderkwast

That's true. It's risky to merge it without a CI job on ARM or some other platform that doesn't support unaligned reads.

There is a myth that aligned reads are much faster than unaligned reads even on platforms that support unaligned reads. Maybe that's why the padding was introduced. But unaligned reads are just as fast as aligned reads on modern CPUs like core i7 according to Daniel Lemire: https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

Comment From: zuiderkwast

My colleague @bjosv quickly added an ARM CI job (using qemu) to his fork of another, updated, rax fork, and with this patch removing memcpy on top: https://github.com/bjosv/rax/tree/qemu-in-ci. The rax fuzz test passes.

Here is assembly for raxFindParentLink and raxFindParentLink_NEW (with the patch) for ARM: https://godbolt.org/z/8on3bah85

We can see there are two lines that differ. We get rid if an instruction flagged as unaligned: ldr r3, [r3] @ unaligned. So it seems to be able to do unaligned read but maybe only with this annotation (or whatever it is).

There is another compiler option that affects this: -mno-unaligned-access. With this, there's a call to memcpy instead of an @ unaligned read.

Anyway, I suppose we need CI coverage in redis and remove the other memcpy calls as well, for a complete work. An even then, I don't think there will be any measurable performance gain.

Comment From: oranagra

are you sure this CI is able to detect that issue? the thought is that if the memcpy isn't needed, and indeed all the memory padding is added correctly it'll pass. but i'm not certain ARMv7 build would be able to detect it. i.e. the question is if the padding is done correctly and there's no misalignment requiring a memcpy.

IIRC ARMv7 does support unaligned access, and we need ARMv5 to test it. or it actually be the core version we use (ARM11) and not the instruction set. so to check if this test is valid, we should add some code that does an unaligned access and validate that it fails.

i'm not certain if that annotation causes any performance overheads (when the memory is aligned)

Comment From: zuiderkwast

Actually I'm not sure if that test can prove anything. Maybe if we add an assert to check that the pointer is aligned, then run the full Redis CI, that would be stronger evidence.

@bjosv and @Absoler, if you are motivated to push this forwards, then I think you need to show that there is performance to gain here. Otherwise, I think we can spend our time on something more important instead. All modern platforms likely to run Redis (amd64, aarch64, etc.) seem to support unaligned reads and can optimize the memcpy into a simple read.

Comment From: zuiderkwast

I believe the QUESTION has been answered, but to summarize: Yes, the memcpy can probably be replaced by a regular memory access, but we're not certain the address is aligned and that may matter for some platforms. See also #12720. Feel free to re-open the issue if you're motivated in working on it.