Describe the bug
We're seeing a segfault in Redis 6.0.9 when running under XALT. XALT is a framework, mostly used on HPC systems, to gather statistics on software usage. It is activated by using LD_PRELOAD of a small XALT library that saves statistics on processes being run. What happens is that redis-server segfaults when the LD_PRELOAD is active:
(gdb) bt
#0 0x00002aaaab996d04 in __strchr_sse42 () from /lib64/libc.so.6
#1 0x00000000004b8c28 in spt_copyenv (oldenv=0x819ca0) at setproctitle.c:115
#2 0x00000000004b8fb0 in spt_init (argc=1, argv=0x7fffffff5a88) at setproctitle.c:198
#3 0x0000000000443404 in main (argc=1, argv=0x7fffffff5a88) at server.c:5168
We're pretty sure this is a bug (or dependence on undefined behaviour) in Redis. Looking at the environment related code in setproctitle.c, especially spt_copyenv(), looks fishy. The environment is cleared using clearenv() (through spt_clearenv() in this case) but a pointer to the original environ is saved as oldenv. The old entries in oldenv are then processed and copied to the new environment with setenv(). But the latter calls will update entries in environ. What is unclear to me is what guarantees clearenv() gives about the content of environ after it is called. It might simply set environ to NULL, but still use the same memory for setenv() calls that follow, this is not clear. As oldenv still points to the old environ content it might depend on undefined behaviour normally, but that behaviour changes when running under XALT.
To reproduce
See https://github.com/xalt/xalt/issues/41 for more details. Unfortunately, building and configuring XALT is a bit too extensive to add here.
paulm@tcn180 17:19 ~/c$ tar xf ~/dl/redis-6.0.9.tar.gz
paulm@tcn180 17:19 ~/c$ cd redis-6.0.9/
paulm@tcn180 17:19 ~/c/redis-6.0.9$ make -j32 noopt
cd src && make noopt
make[1]: Entering directory `/nfs/home4/paulm/c/redis-6.0.9/src'
CC Makefile.dep
make[1]: Leaving directory `/nfs/home4/paulm/c/redis-6.0.9/src'
...
make[2]: Leaving directory `/nfs/home4/paulm/c/redis-6.0.9/src'
make[1]: Leaving directory `/nfs/home4/paulm/c/redis-6.0.9/src'
# Not running under XALT, all OK
paulm@tcn180 17:20 ~/c/redis-6.0.9$ src/redis-server
29208:C 16 Nov 2020 17:20:50.532 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
29208:C 16 Nov 2020 17:20:50.532 # Redis version=6.0.9, bits=64, commit=00000000, modified=0, pid=29208, just started
29208:C 16 Nov 2020 17:20:50.532 # Warning: no config file specified, using the default config. In order to specify a config file use src/redis-server /path/to/redis.conf
_._
_.-``__ ''-._
_.-`` `. `_. ''-._ Redis 6.0.9 (00000000/0) 64 bit
.-`` .-```. ```\/ _.,_ ''-._
( ' , .-` | `, ) Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
| `-._ `._ / _.-' | PID: 29208
`-._ `-._ `-./ _.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' | http://redis.io
`-._ `-._`-.__.-'_.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' |
`-._ `-._`-.__.-'_.-' _.-'
`-._ `-.__.-' _.-'
`-._ _.-'
`-.__.-'
29208:M 16 Nov 2020 17:20:50.533 # Server initialized
29208:M 16 Nov 2020 17:20:50.533 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
29208:M 16 Nov 2020 17:20:50.533 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
29208:M 16 Nov 2020 17:20:50.533 * Ready to accept connections
<ctrl-C>
# Get around the XALT filtering active on our system
paulm@tcn180 17:20 ~/c/redis-6.0.9$ cp src/redis-server R
# Running under XALT
paulm@tcn180 17:20 ~/c/redis-6.0.9$ gdb ./R
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /nfs/home4/paulm/c/redis-6.0.9/R...done.
(gdb) run
Starting program: /nfs/home4/paulm/c/redis-6.0.9/./R
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 29174]
Program received signal SIGSEGV, Segmentation fault.
0x00002aaaab996d04 in __strchr_sse42 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64
(gdb) bt
#0 0x00002aaaab996d04 in __strchr_sse42 () from /lib64/libc.so.6
#1 0x00000000004b8c28 in spt_copyenv (oldenv=0x819ca0) at setproctitle.c:115
#2 0x00000000004b8fb0 in spt_init (argc=1, argv=0x7fffffff5a88) at setproctitle.c:198
#3 0x0000000000443404 in main (argc=1, argv=0x7fffffff5a88) at server.c:5168
Additional information
5329 is about an unexpectedly empty environment for redis-server as reported by /proc/<pid>/environ. Looking at spt_copyenv() the environment might be empty due to the undefined behaviour being depended on, as setenv() is called for each entry in oldenv.
The trace above is on a RHEL 7.8 system, GCC 9.3.0.
Comment From: paulmelis
Actually a simple reproduction is suggested in https://github.com/xalt/xalt/issues/41 by @rtmclay. Copying the code of spt_clearenv() and spt_copyenv() into t.c, calling setenv() as the first thing in main(), followed by a call to spt_copyenv().
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h> /* malloc(3) setenv(3) clearenv(3) setproctitle(3) getprogname(3) */
#include <stdio.h> /* vsnprintf(3) snprintf(3) */
#include <errno.h> /* errno program_invocation_name program_invocation_short_name */
static int spt_clearenv(void) {
#if __GLIBC__
clearenv();
return 0;
#else
extern char **environ;
static char **tmp;
if (!(tmp = malloc(sizeof *tmp)))
return errno;
tmp[0] = NULL;
environ = tmp;
return 0;
#endif
} /* spt_clearenv() */
static int spt_copyenv(char *oldenv[]) {
extern char **environ;
char *eq;
int i, error;
if (environ != oldenv)
return 0;
if ((error = spt_clearenv()))
goto error;
for (i = 0; oldenv[i]; i++) {
if (!(eq = strchr(oldenv[i], '=')))
continue;
*eq = '\0';
error = (0 != setenv(oldenv[i], eq + 1, 1))? errno : 0;
*eq = '=';
if (error)
goto error;
}
return 0;
error:
environ = oldenv;
return error;
} /* spt_copyenv() */
int main(int argc, char *argv[])
{
setenv("MY_PROG_NAME","cleanEnv",1);
printf("environ = %p\n", environ);
if (spt_copyenv(environ))
printf("ERROR in spt_copyenv()\n");
}
paulm@cmstorm 13:14:~$ gdb ./t
GNU gdb (GDB) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./t...
(gdb) run
Starting program: /home/paulm/t
environ = 0x5555555592a0
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7ef649c in __strchr_avx2 () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007ffff7ef649c in __strchr_avx2 () from /usr/lib/libc.so.6
#1 0x00005555555551fe in spt_copyenv (oldenv=0x5555555592a0) at t.c:43
#2 0x00005555555552f4 in main (argc=1, argv=0x7fffffffdab8) at t.c:68
Valgrind shows that the contended memory in spt_copyenv() is allocated by the first setenv() but then freed by clearenv(). If you leave out the setenv() in main() the segfault doesn't happen.
paulm@cmstorm 13:15:~$ valgrind ./t
==319297== Memcheck, a memory error detector
==319297== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==319297== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==319297== Command: ./t
==319297==
environ = 0x4a7f040
==319297== Invalid read of size 8
==319297== at 0x10927C: spt_copyenv (t.c:42)
==319297== by 0x1092F3: main (t.c:68)
==319297== Address 0x4a7f040 is 0 bytes inside a block of size 640 free'd
==319297== at 0x483B9AB: free (vg_replace_malloc.c:538)
==319297== by 0x48F2C24: clearenv (in /usr/lib/libc-2.32.so)
==319297== by 0x109191: spt_clearenv (t.c:14)
==319297== by 0x1091C0: spt_copyenv (t.c:39)
==319297== by 0x1092F3: main (t.c:68)
==319297== Block was alloc'd at
==319297== at 0x483A6AF: malloc (vg_replace_malloc.c:306)
==319297== by 0x48F270B: __add_to_environ (in /usr/lib/libc-2.32.so)
==319297== by 0x4842FD9: setenv (vg_replace_strmem.c:2133)
==319297== by 0x1092C9: main (t.c:64)
==319297==
...
Comment From: rtmclay
This code change won't fix the issue. Please see my comments to #8088.