Describe the bug
The corrupt payload: fuzzer findings - OOM in dictExpand test within the integration/corrupt-dump unit results in memory exhaustion on the s390x architecture. The intention of the test is to ensure that it does not OOM (as it does not on most architectures), but it still OOMs on s390x.
To reproduce
% uname -a
Linux zelenka 4.19.0-17-s390x #1 SMP Debian 4.19.194-3 (2021-07-18) s390x GNU/Linux
% ./runtest --clients 1 --verbose --single integration/corrupt-dump --only "corrupt payload: fuzzer findings - OOM in dictExpand"
[...]
[skip]: corrupt payload: fuzzer findings - stream integrity check issue
[skip]: corrupt payload: fuzzer findings - infinite loop
[skip]: corrupt payload: fuzzer findings - hash ziplist too long entry len
[skip]: corrupt payload: OOM in rdbGenericLoadStringObject
[hangs here running the "corrupt payload: fuzzer findings - OOM in dictExpand" test]
Expected behavior
That the test passes, as it does on all other architectures.
Additional information
This test was added as part of an allocation-related overhaul in https://github.com/redis/redis/commit/7ca00d694d44be13a3ff9ff1c96b49222ac9463b. It is unknown whether the changes in this diff introduced the underlying problem (instead of merely introducing a test that exposes it).
This was originally tracked in @Debian here: https://bugs.debian.org/982122
Thanks to Julien Cristau and Adam Barratt for their help nailing down this issue.
Comment From: lamby
@oranagra You may be interested in glancing at this issue, as it is your commit that is the most likely suspect here.
Comment From: lamby
In case it is useful, this page on the Debian wiki summarises the low-level sizeof and endian (etc.) differences between s390x and other architectures.
Comment From: sundb
@lamby Is your test on unstable? It looks like this issue has been fixed in #9321.
Comment From: lamby
Yes, this is reproducible on the unstable branch (currently my HEAD is pointing at 0cf2df84d4b27af4bffd2bf3543838f09e10f874). Thus, I don't think this is fixed by #9321, unfortunately.
Comment From: oranagra
interesting... the test that fails is exposing a problem that always existed, and this PR (redis 6.2) aims to fix it (the trymalloc functions). the other recent PR fixes the same problem on a different code path (not covered by the initial test). other than being bigendian, i don't see anything special in the page you linked about s390x.
@lamby maybe the reason it hangs is that on your test platform there's an insane amount of memory, so the allocation that fails on other platforms, succeeds on this one and redis continues to execute code that it doesn't on other platforms?
Comment From: oranagra
i.e. that test relies on malloc to fail, returning NULL.
the size it tries to allocate is 170910011579433481.
maybe for some reason the swap or some insane overcomit allows it?
Comment From: tezc
#define DICTHT_SIZE(exp) ((exp) == -1 ? 0 : (unsigned long)1<<(exp))
#define DICTHT_SIZE_MASK(exp) ((exp) == -1 ? 0 : (DICTHT_SIZE(exp))-1)
struct dict {
.......
char ht_size_exp[2]; /* exponent of size. (size = 1<<exp) */
};
static void _dictReset(dict *d, int htidx)
{
.....
d->ht_size_exp[htidx] = -1;
}
@oranagra I don't know about v6.2 but just taking a look at unstable. char is unsigned on s390x (also on arm, ppc etc.). Setting it -1 will actually set it to 255. When you call DICHT_SIZE(exp), char(255) will be promoted to int(255). So, DICTHT_SIZE(exp) will be expanded to something like ((int)(255) == (int)-1 ? 0 : (unsigned long) 1 << 255). This looks like an undefined behavior if I'm not mistaken.
Anyway, on qemu, the above test hangs on both s390x and armv8. I just changed type int8_t ht_size_exp[2] and that test passed. Maybe this is the issue for unstable?
Comment From: oranagra
Ohh, thanks @tezc So this is indeed a new issue in unstable, caused by: #9228 It's odd that it only fails that one test. @yoav-steinberg @yossigo FYI Maybe we should attempt to add a qemu test in our daily CI (bigendian and unsigned char)
Comment From: lamby
Anyway, on qemu, the above test hangs on both s390x and armv8. I just changed type int8_t ht_size_exp[2] and that test passed. Maybe this is the issue for unstable?
Interestingly, I just tried this on non-qemu s390x system and it still hangs here.
Comment From: tezc
That's bad news, there might be multiple issues then.
I pulled s390x docker from here (please consider security implications if you want to try it) and I see it hangs before the fix and it passes after the fix. (also same behavior for armv8).
Here is the demonstration of undefined behavior on s390x, gcc generates different results depending on the optimization level :
root@500b6403cee5:/home# cat test.c
#include <stdio.h>
#define DICTHT_SIZE(exp) ((exp) == -1 ? 0 : (unsigned long)1<<(exp))
int main()
{
char c[1] = {-1};
printf("DICHT_SIZE(c[0]) = %lu \n", DICTHT_SIZE(c[0]));
return 0;
}
root@500b6403cee5:/home# gcc test.c -o out && ./out
DICHT_SIZE(c[0]) = 9223372036854775808
root@500b6403cee5:/home# gcc -O2 test.c -o out && ./out
DICHT_SIZE(c[0]) = 0
Looks like this is not the only problem though.
Comment From: lamby
gcc generates different results depending on the optimization level :
I get the same results on my bare-metal s390x machine. (And just for clarity, I also see 0 when run with -O3 too.)
That might be relevant -- my patched dict.h is being compiled with -O3, seemingly the default. What about yours?
Comment From: tezc
@lamby I just shared that snippet to show GCC might emit different instructions as that code involves an undefined behavior (shift greater than the width of the operand). Compiler can decide to emit 0 or 9223372036854775808 or anything else depending on the context (e.g surrounding code). It can also emit different values in the same build for different usages of that macro inside the codebase. So, it's impossible to reason about it. All we should do is fix undefined behavior and test it again. Unfortunately, it doesn't fix the issue as you said it doesn't work on bare metal machine.
That might be relevant -- my patched dict.h is being compiled with -O3, seemingly the default. What about yours?
I see -O2 here but just saying again, I don't think this will lead us anywhere.
On qemu, I can reproduce the hang and that small fix solves the issue on my local. After the fix, I can`t reproduce it anymore and I'm out of ideas unfortunately :( I don't have much experience with Redis, so someone else might have a better idea about how to make progress. Maybe take a look at what @oranagra suggested?
i.e. that test relies on malloc to fail, returning NULL. the size it tries to allocate is
170910011579433481. maybe for some reason the swap or some insane overcomit allows it?
Comment From: yossigo
@tezc Actually, gcc generates the same output even on x86_64 when using default unsigned char:
yossi@ubuntu:/tmp$ gcc -funsigned-char t.c && ./a.out
DICHT_SIZE(c[0]) = 9223372036854775808
yossi@ubuntu:/tmp$ gcc -O2 -funsigned-char t.c && ./a.out
DICHT_SIZE(c[0]) = 0
yossi@ubuntu:/tmp$ gcc -O3 -funsigned-char t.c && ./a.out
DICHT_SIZE(c[0]) = 0
Building Redis with -funsigned-char also reproduces an immediate crash on redis-server startup, which goes away when moving to signed char (i.e. int8_t).
There could be another unrelated issue, @lamby do you see the exact same failure with and without the int8_t change?
Comment From: tezc
@yossigo yeah, when using unsigned char, -1 becomes 255 and leads to undefined behavior
char c[1] = {-1}; // c[0] is now 255.
DICTHT_SIZE(c[0]); // 255 == -1 ? 0 : (unsigned long)1 << 255)
Shift greater than width of the operand is undefined behavior if I'm not mistaken.
Also, redis-server generates segfault without the fix on qemu s390x, same behavior as -funsigned-char on x86. After the fix, crash goes away.
Comment From: yossigo
@oranagra Looks like your assumption was correct, but just changing the struct was not enough and I guess this overflowed in a way that got trycalloc to succeed. @lamby Can you please test this patch, as it seems like qemu may not reliably reproduce this problem. Thanks!
EDIT: Problem still reproduces on qemu-system-s390x with Ubuntu 20.04. Looks like jemalloc calloc() is happy to return a 2^61 allocation and zmalloc_size() will even confirm that.
index bc03b0c96..a26ba9eba 100644
--- a/src/dict.c
+++ b/src/dict.c
@@ -61,7 +61,7 @@ static unsigned int dict_force_resize_ratio = 5;
/* -------------------------- private prototypes ---------------------------- */
static int _dictExpandIfNeeded(dict *d);
-static char _dictNextExp(unsigned long size);
+static signed char _dictNextExp(unsigned long size);
static long _dictKeyIndex(dict *d, const void *key, uint64_t hash, dictEntry **existing);
static int _dictInit(dict *d, dictType *type);
@@ -150,7 +150,7 @@ int _dictExpand(dict *d, unsigned long size, int* malloc_failed)
/* the new hash table */
dictEntry **new_ht_table;
unsigned long new_ht_used;
- char new_ht_size_exp = _dictNextExp(size);
+ signed char new_ht_size_exp = _dictNextExp(size);
/* Detect overflows */
size_t newsize = 1ul<<new_ht_size_exp;
@@ -1009,7 +1009,7 @@ static int _dictExpandIfNeeded(dict *d)
/* TODO: clz optimization */
/* Our hash table capability is a power of two */
-static char _dictNextExp(unsigned long size)
+static signed char _dictNextExp(unsigned long size)
{
unsigned char e = DICT_HT_INITIAL_EXP;
diff --git a/src/dict.h b/src/dict.h
index 999053923..495e48f4a 100644
--- a/src/dict.h
+++ b/src/dict.h
@@ -80,7 +80,7 @@ struct dict {
/* Keep small vars at end for optimal (minimal) struct padding */
int16_t pauserehash; /* If >0 rehashing is paused (<0 indicates coding error) */
- char ht_size_exp[2]; /* exponent of size. (size = 1<<exp) */
+ signed char ht_size_exp[2]; /* exponent of size. (size = 1<<exp) */
};
/* If safe is set to 1 this is a safe iterator, that means, you can call
Comment From: yossigo
@oranagra This has nothing to do with dict refactoring, it reproduces (at least on qemu-system-s390/Ubuntu 20.04) with 1c59567a7fe207997eef6197eefa7d508d7fbf9f:
#0 0x000002aa221ca63a in _dictClear (d=d@entry=0x3ff9ce0d640, ht=ht@entry=0x3ff9ce0d650,
callback=callback@entry=0x0) at dict.c:473
#1 0x000002aa221ca6fe in dictRelease (d=0x3ff9ce0d640) at dict.c:493
#2 0x000002aa221ee42a in decrRefCount (o=0x3ff9ce80f50) at object.c:366
#3 decrRefCount (o=0x3ff9ce80f50) at object.c:361
#4 0x000002aa2220147c in rdbLoadObject (rdbtype=<optimized out>, rdb=rdb@entry=0x3ffda4fe728,
key=<optimized out>, dbid=<optimized out>) at rdb.c:1604
#5 0x000002aa222364b6 in restoreCommand (c=0x3ff9cf2f900) at cluster.c:5229
#6 0x000002aa221cffbe in call (c=0x3ff9cf2f900, flags=<optimized out>) at server.c:3722
#7 0x000002aa221d1b54 in processCommand (c=c@entry=0x3ff9cf2f900) at server.c:4247
#8 0x000002aa221e5550 in processCommandAndResetClient (c=c@entry=0x3ff9cf2f900) at networking.c:2018
#9 0x000002aa221e7fc0 in processInputBuffer (c=0x3ff9cf2f900) at networking.c:2107
#10 0x000002aa2228865e in callHandler (handler=<optimized out>, conn=<optimized out>) at connhelpers.h:79
#11 connSocketEventHandler (el=<optimized out>, fd=<optimized out>, clientData=0x3ff9ce15180,
mask=<optimized out>) at connection.c:304
#12 0x000002aa221c7d6e in aeProcessEvents (eventLoop=eventLoop@entry=0x3ff9ce230f0, flags=flags@entry=27)
at ae.c:428
#13 0x000002aa221c8234 in aeMain (eventLoop=0x3ff9ce230f0) at ae.c:488
#14 0x000002aa221c3eba in main (argc=<optimized out>, argv=<optimized out>) at server.c:6513
(gdb) p ht->size
$3 = 288230376151711744
(gdb) p i
$4 = 570050880
(gdb) p je_malloc_usable_size(ht->table)
$8 = 2305843009213693952
Comment From: lamby
Hi @yossigo - Just to confirm what you already added in your "EDIT" but it does indeed still hang here, allocating memory. :)
ps. Just to note to me and anyone following along: the patch in that comment above is now on the unstable branch.
Comment From: oranagra
@lamby to the best of our understanding there are two unrelated things discussed here. 1. the singed char issue that leads to a crash (on all platforms where char is unsigned by default), existed in unstable for about a week and now fixed (never released). 2. the hang test problem you reported at the top, still not resolved, and exists since the 6.2.0 release, let's discuss it below.
6.2.0 has some new mechanism to prevent malformed RESTORE command from crashing redis, and one of the concerns is the OOM panic we have when an allocation fails, so the new code avoids the panic and instead fails RESTORE (this comes with some tests).
The problem seems to be that on some platforms / kernels, the overcommit allows these huge allocations to succeed, and even when using calloc, the pages may be zeroed only when mapped (accessed) rather than on the start. this means that this test succeeds to allocate an immense amount of memory, but the RESTORE command in the test then fails on another issue, and then redis attempts to iterate on that huge dict and release all the elements it contains (there are none), this loop takes forever and hangs the test.
i'm not certain yet if we can or need to fix redis, maybe we just need to skip this test on the problematic platforms.