We've had 10+ redis(version:5.0.12) crashes recently, all with the same stack.

Once the redis process died, we debugged and found that the value of list->tail was NULL at adlist.c:336. But unfortunately, the redis process on other machines crashed and exited directly, and we couldn't debug it.

Crash report

74483:S 17 Sep 2022 12:48:14.529 * FAIL message received from runidXX about runidXX
74483:S 18 Sep 2022 20:35:01.056 * FAIL message received from a8c88876fb4c5e5b1ecc9a0bc806709b5573339fbd about 060b2a9ef3ee75c409859d85983849b84dfd11ea

=== REDIS BUG REPORT START: Cut & paste starting from here === 
74483:S 19 Sep 2022 12:21:25.391 # Redis 5.0.12 crashed by signal: 11 
74483:S 19 Sep 2022 12:21:25.391 # Crashed running the instruction at: 0x427dc2 
74483:S 19 Sep 2022 12:21:25.391 # Accessing address: 0x8 
74483:S 19 Sep 2022 12:21:25.391 # Failed assertion: <no assertion failed> (<no file>:0) 

------ STACK TRACE ------ 
EIP: 
/tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster](listRotateTailToHead+0x12)[0x427dc2] 

Backtrace: 
/tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster](logStackTrace+0x41)[0x471691] 
/tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster](sigsegvHandler+0x96)[0x471d16] 
/lib64/libpthread.so.0(+0x13930)[0x7f7122d22930]
 /tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster](listRotateTailToHead+0x12)[0x427dc2] 
/tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster](clientsCron+0x71)[0x42e511]
 /tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster](serverCron+0x245)[0x431025] 
/tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster](aeProcessEvents+0x276)[0x42a4c6] 
/tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster](aeMain+0x2b)[0x42a72b]
 /tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster](main+0x4b5)[0x427595] 
/lib64/libc.so.6(__libc_start_main+0xe7)[0x7f7122b73b67] 
/tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster](_start+0x2a)[0x4277ca

----- INFO OUTPUT ------ 
# Server 
redis_version:5.0.12 
redis_git_sha1:00000000 redis_git_dirty:0 
redis_build_id:7c0135df285128a0 
redis_mode:cluster 
os:Linux 4.19.90-23.6.v2101.ky10.x86_64 x86_64 
arch_bits:64 multiplexing_api:epoll 
atomicvar_api:atomic-builtin 
gcc_version:7.3.0 
process_id:74483 
run_id:7134593b8cae01f5f3b95d339b6903db9ff19dde 
tcp_port:6379 
uptime_in_seconds:2226474 
uptime_in_days:25 
hz:70 
configured_hz:70 
lru_clock:2616901 
executable:/tools/redis-5.0.12/bin/redis-server 
config_file:/data/redis/redis.conf 

# Clients 
connected_clients:869 
client_recent_max_input_buffer:4 
client_recent_max_output_buffer:0 
blocked_clients:0 

# Memory 
used_memory:24572000 
used_memory_human:23.43M 
used_memory_rss:104591360 
used_memory_rss_human:99.75M 
used_memory_peak:63126440 
used_memory_peak_human:60.20M 
used_memory_peak_perc:38.93% 
used_memory_overhead:17697198 
used_memory_startup:1891984 
used_memory_dataset:6874802 
used_memory_dataset_perc:30.31% 
allocator_allocated:24798560 
allocator_active:32555008 
allocator_resident:37257216 
total_system_memory:24714575872 
total_system_memory_human:23.02G 
used_memory_lua:37888 
used_memory_lua_human:37.00K 
used_memory_scripts:0 
used_memory_scripts_human:0B 
number_of_cached_scripts:0 
maxmemory:12000000000 
maxmemory_human:11.18G 
maxmemory_policy:volatile-ttl 
allocator_frag_ratio:1.31 
allocator_frag_bytes:7756448 
allocator_rss_ratio:1.14 
allocator_rss_bytes:4702208 
rss_overhead_ratio:2.81 
rss_overhead_bytes:67334144 
mem_fragmentation_ratio:4.26 
mem_fragmentation_bytes:80020640 
mem_not_counted_for_evict:2040 
mem_replication_backlog:1048576 
mem_clients_slaves:0 
mem_clients_normal:14754494 
mem_aof_buffer:2040 
mem_allocator:jemalloc-5.1.0 
active_defrag_running:0 
lazyfree_pending_objects:0 

# Persistence 
loading:0 
rdb_changes_since_last_save:104952019 
rdb_bgsave_in_progress:0 
rdb_last_save_time:1661334811 
rdb_last_bgsave_status:ok 
rdb_last_bgsave_time_sec:-1 
rdb_current_bgsave_time_sec:-1 
rdb_last_cow_size:0 aof_enabled:1 
aof_rewrite_in_progress:0 
aof_rewrite_scheduled:0 
aof_last_rewrite_time_sec:1 
aof_current_rewrite_time_sec:-1 
aof_last_bgrewrite_status:ok 
aof_last_write_status:ok 
aof_last_cow_size:59727872 
aof_current_size:14072882 
aof_base_size:2894033 
aof_pending_rewrite:0 
aof_buffer_length:0 
aof_rewrite_buffer_length:0 
aof_pending_bio_fsync:0 
aof_delayed_fsync:0

# Stats 
total_connections_received:17865608 
total_commands_processed:143520116
instantaneous_ops_per_sec:13
total_net_input_bytes:9659248291
total_net_output_bytes:1382677478
instantaneous_input_kbps:0.20
instantaneous_output_kbps:0.12
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:2421
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

#Replication
role:slave
master_host:ip
master_port:6379
master_link_status:up
master_last_io_seconds_ago:3
master_sync_in_progress:0
slave_repl_offset:8871541619
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:5ad534c4b8ed27282a113257b60988d54e43ea77
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:8871541619
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:8870493044
repl_backlog_histlen:1048576

#CPU
used_cpu_sys:37696.320464
used_cpu_user:15254.539483
used_cpu_sys_children:2.535811
used_cpu_user_children:15.147198

#Commandstats
cmdstat_auth:calls=686785,usec=1131329,usec_per_call=1.65
cmdstat_select:calls=1,usec=1,usec_per_call=1.00
cmdstat_cluster:calls=72425,usec=5358095,usec_per_call=73.98
cmdstat_config:calls=36151,usec=1485166,usec_per_call=41.08
cmdstat_set:calls=52476010,usec=657378730,usec_per_call=12.53
cmdstat_ping:calls=37447365,usec=27338096,usec_per_call=0.73
cmdstat_slowlog:calls=108441,usec=266164,usec_per_call=2.45
cmdstat_del:calls=52476009,usec=658288124,usec_per_call=12.54
cmdstat_info:calls=216929,usec=68784391,usec_per_call=317.08

#Cluster
cluster_enabled:1

#Keyspace
db0:keys=1,expires=0,avg_ttl=0

------ CLIENT LIST OUTPUT ------
id=17864884 addr=ip:port fd=1035 name= age=73 idle=13 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping 
id=17865280 addr=ip:port fd=350 name= age=28 idle=28 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL
 id=17865281 addr=ip:port fd=355 name= age=28 idle=28 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL
 id=17865282 addr=ip:port fd=379 name= age=28 idle=28 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL
 id=17865283 addr=ip:port fd=390 name= age=28 idle=28 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
 : 
id=17865284 addr=ip:port fd=471 name= age=28 idle=28 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL
<Note that there are more than 800 lines in the above Client List Output, which are omitted in order to save space.> 


 ------ REGISTERS ------ 
74483:S 19 Sep 2022 12:21:25.395 # 
RAX:00007f712262af10 RBX:0000000000000001 
RCX:000000000000006 RDX:0000000000000000 
RDI:00007f712260f000 RSI:0000000000000000 
RBP:00007f7111fa6a40 RSP:00007ffcd98c1258 
R8 :00007ffcd9951000 R9 :0008f27c66eb5d60 
R10:00007ffcd98c1260 R11:000000004340ceec 
R12:0000018353fabf0f R13:00007f71226b45c0 
R14:00007f7122615080 R15:0000000000000001 
RIP:0000000000427dc2 EFL:0000000000010202 
CSGSFS:002b000000000033 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c1267) -> 00007f71226b45c0 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c1266) -> 00007f712266c050 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c1265) -> 00007ffcd98c1300 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c1264) -> 0000000000000000 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c1263) -> 00007f7122615090 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c1262) -> 0000000222615080 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c1261) -> 0000000000000000 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c1260) -> 00007ffcd9953ead 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c125f) -> 00007ffcd98c12e0 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c125e) -> 0000000000431025 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c125d) -> 00007f712266c050 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c125c) -> 0000000000000000 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c125b) -> 000000000000004b 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c125a) -> 000000000005f76e 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c1259) -> 000000006327ee45 
74483:S 19 Sep 2022 12:21:25.395 # (00007ffcd98c1258) -> 000000000042e511

------ FAST MEMORY TEST ------ 
74483:S 19 Sep 2022 12:21:25.396 # Bio thread for job type #0 terminated 
74483:S 19 Sep 2022 12:21:25.397 # Bio thread for job type #1 terminated 
74483:S 19 Sep 2022 12:21:25.397 # Bio thread for job type #2 terminated 
*** Preparing to test memory region 5a2000 (2248704 bytes) 
*** Preparing to test memory region c50000 (135168 bytes) 
*** Preparing to test memory region 7f710cc00000 (111149056 bytes) 
*** Preparing to test memory region 7f711364c000 (3670016 bytes) 
*** Preparing to test memory region 7f71139cd000 (8388608 bytes) 
*** Preparing to test memory region 7f71141ce000 (8388608 bytes) 
*** Preparing to test memory region 7f71149cf000 (8388608 bytes) 
*** Preparing to test memory region 7f71151cf000 (3145728 bytes) 
*** Preparing to test memory region 7f7122200000 (8388608 bytes) 
*** Preparing to test memory region 7f7122b48000 (24576 bytes) 
*** Preparing to test memory region 7f7122d0b000 (16384 bytes) 
*** Preparing to test memory region 7f7122d2c000 (16384 bytes) 
*** Preparing to test memory region 7f7122ec3000 (8192 bytes) 
*** Preparing to test memory region 7f7122efc000 (4096 bytes) .O.O.O.O.O.O.O.O.O.O.O.O.O.O Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible. 

------ DUMPING CODE AROUND EIP ------ 
Symbol: listRotateTailToHead (base: 0x427db0) 
Module: /tools/redis-5.0.12/bin/redis-server 0.0.0.0:6379 [cluster] (base 0x400000) 
$ xxd -r -p /tmp/dump.hex /tmp/dump.bin 
$ objdump --adjust-vma=0x427db0 -D -b binary -m i386:x86-64 /tmp/dump.bin 
------ 
74483:S 19 Sep 2022 12:21:26.273 # dump of function (hexdump of 146 bytes): 48837f28017627488b4708488b104889570848c7420800000000488b1748890248c7000000000048895008488907f3c348837f28017628488b07488b500848891748c70200000000488b57084889420848c740080000000048891048894708f3c366662e0f1f8400000000000f1f4000488b06488b57084885c074034889104885d2743448894208488b46084885c0740448 === REDIS BUG REPORT END. Make sure to include from START to END. ===
 Please report the crash by opening an issue on github: 
http://github.com/antirez/redis/issues 
Suspect RAM error? Use redis-server --test-memory to verify it.

Additional information

  1. OS distribution and version:Linux 4.19.90-23.6.v2101.ky10.x86_64 x86_64
  2. Steps to reproduce (if any)

Comment From: zhipingu

@antirez @oranagra Looking forward to your help

Comment From: oranagra

@zhipingu any chance you can check if it also happens with a more recent version (preferably 7.0)?

Comment From: zhipingu

@zhipingu any chance you can check if it also happens with a more recent version (preferably 7.0)? @oranagra We also encountered crashes on version 6.2.7, and it seems to be related to the cluster size, the larger the cluster size the easier it is to crash

Comment From: zhipingu

@antirez We also encountered crashes on version 6.2.7, and it seems to be related to the cluster size, the larger the cluster size the easier it is to crash

Comment From: zhipingu

We deployed a cluster of 30 nodes, one master and two slaves

Comment From: oranagra

@zhipingu when you reproduced this on 6.2.7 you had the same stack trace? can you post the crash log?

Comment From: oranagra

i'll note that this crash seems to happen when adlist.c thinks the list is non-empty (len != 0), but the tail is NULL, can only happen due to one of these: 1. internal counting bug in adlist.c 2. memory corruption by some other mechanism in redis that overrides something in the list.

number 1 is very unlikely since adlist.c is so simple and didn't really change in years (the diff from the latest version are negligible)

number 2 would be very hard to find, considering the list nodes themselves are isolated allocations, and the list header is part of the server struct and has no dangerous array right before it.

assuming memory corruption, if we knew how to reproduce it, then valgrind can be used to find the offender

Comment From: zhipingu

@zhipingu when you reproduced this on 6.2.7 you had the same stack trace? can you post the crash log?

@oranagra Thanks for your reply,We have encountered different crash stack, except that list->tail is more frequent there, and we have used Address Sanitize to monitor memory, but the stack information we get is the same as the memory illegal stack printed by redis itself, and we can't get where it was written badly

Comment From: zhipingu

i'll note that this crash seems to happen when adlist.c thinks the list is non-empty (len != 0), but the tail is NULL, can only happen due to one of these:

  1. internal counting bug in adlist.c
  2. memory corruption by some other mechanism in redis that overrides something in the list.

number 1 is very unlikely since adlist.c is so simple and didn't really change in years (the diff from the latest version are negligible)

number 2 would be very hard to find, considering the list nodes themselves are isolated allocations, and the list header is part of the server struct and has no dangerous array right before it.

assuming memory corruption, if we knew how to reproduce it, then valgrind can be used to find the offender

@oranagra We recently found another redis crash at sds.h:181 and sds.h:88(both are unsigned char flags = s[-1];), I'm sure several crash codes were added after version 5.0 and not found in 3.2.7, and we've been testing with 3.2.7 for a while and no crashes have been found so far

Comment From: zhipingu

i'll note that this crash seems to happen when adlist.c thinks the list is non-empty (len != 0), but the tail is NULL, can only happen due to one of these:

  1. internal counting bug in adlist.c
  2. memory corruption by some other mechanism in redis that overrides something in the list.

number 1 is very unlikely since adlist.c is so simple and didn't really change in years (the diff from the latest version are negligible)

number 2 would be very hard to find, considering the list nodes themselves are isolated allocations, and the list header is part of the server struct and has no dangerous array right before it.

assuming memory corruption, if we knew how to reproduce it, then valgrind can be used to find the offender

@oranagra We recently found another redis crash at sds.h:181 and sds.h:88(both are unsigned char flags = s[-1];), I'm sure several crash codes were added after version 5.0 and not found in 3.2.7, and we've been testing with 3.2.7 for a while and no crashes have been found so far

@zhipingu when you reproduced this on 6.2.7 you had the same stack trace? can you post the crash log? @oranagra Thanks for your reply,We have encountered different crash stack, except that list->tail is more frequent there, and we have used Address Sanitize to monitor memory, but the stack information we get is the same as the memory illegal stack printed by redis itself, and we can't get where it was written badly

@oranagra We recently found another redis crash at sds.h:181 and sds.h:88(both are unsigned char flags = s[-1];), I'm sure several crash codes were added after version 5.0 and not found in 3.2.7, and we've been testing with 3.2.7 for a while and no crashes have been found so far

Comment From: oranagra

@zhipingu when you wrote 3.2.7 you meant 6.2.7? is that other crash you mention in sds.h with 6.2 or 5.0? can maybe post the crash log?

Comment From: zhipingu

@oranagra I mean version 3.2.7, before that I encountered multiple crashes on both 5.0.12 and 6.2.7, so I chose version 3.2.7 to test. The crash on sds.h:88 and sds.h:181 I mentioned were found in 5.0.12, and these lines of code were added in 5.0. About the crash log, I'm sorry I can't provide all of them, I can only mention a part of them, as follows:

Comment From: zhipingu

@oranagra crash log

935120:C 25 Oct 2022 01:28:10.126 *SYNC append only file rewrite performed
427948:M 25 Oct 2022 01:28:10.138 *AOF rewrite:96 MB of memory used by copy-on-write
427948:M 25 Oct 2022 01:28:10.154 * Background AOF rewrite terminated with success
427948:M 25 Oct 2022 01:28:10.154* Residual parent diff successfully flushed to the rewritten AOF
427948:M 25 Oct .2022 01:28:10.154 * Background AOF rewrite finished successfully 
ASAN:DEADLYSIGNAL
==============================================================
427948--ERROR: AddressSanitizer: SEGV on unknown address x000000000000 (pc 0x0000004536cb bp 0x7ffed2e5c650 sp 0x7ffed2e5c630 T0)
427948--The signal is caused by a READ memory access.
427948--Hint: address points to the zero page.
#0 0x4536ca in sdsalloc /tools/redis-5.0.12/src/sds.h:181
#1 0x45486e in sdsAllocSize /tools/redis-5.0.12/src/sds.c:303
#2 0x442b14 in clientsCronResizeQueryBuffer /tools/redis-5.0.12/src/server.c:861
#3 0x4431aa in clientsCron /tools/redis-5.0.12/src/server.c:1001
#4 0x443d72 in serverCron /tools/redis-5.e.12/src/server.c:1228
#5 0x435a24 in processTimeEvents /tools/redis-5.0.12/src/ae.c:331
#6 0x4365d4 in aeProcessEvents /tools/redis-5.0.12/src/ae.c:469
#7 0x436a5e in aeMain /tools/redis-5.0.12/src/ae.c:501
#8 0x452898 in main /tools/redis-5.0.12/src/server.c:4432
#9 0x7f1094690b66 in _libc_start_main(/lib64/libc.so.6+0x25b66)
#10 0x4290e9 in _start (/tools/redis-5.0.12/bin/redis-server+0x4290e9})
AddressSanitizer can not provide additional info.
SUNHARY: AddressSanitizer: SEGV /tools/redis-5.0.12/src/sds.h:181 in sdsalloc
--427948--ABORTING

Comment From: zhipingu

@oranagra another crash

—-428319--ERROR: Addresssanitizer: global-buffer-overflow on address 0x00000d9f49f at pc 0x0000005ad4d3 bp 0x7ff3c0b0020 sp x7fff3c0b0000
READ of size 1 at 0x000000d9f49f thread T0
#0 0x5ad4d2 in sdslen /tools/redis-5.0.12/src/sds.h:88
#1 0x5ad757 in activeExpireCycleTryExpire /tools/redis-5.0.12/src/expire.c:58
#2 0x5add11 in activeExpireCycle /tools/redis-5.0.12/src/expire.c:195
#3 0x44322e in databasesCron /topls/redis-5.0.12/src/server.c:1014
#4 0x443d77 in serverCron /tools/redis-5.0.12/src/server.c:1231
#5 0x435a24 in processTimeEvents /tools/redis-5.0.12/src/ae.c:331
#6 0x4365d4 in aeProcessEvents /tools/redis-5.0.12/src/ae.c:469
#7 0x436a5e in aeMain /tools/redis-5.0.12/src/ae.c:501
#8 0x452898 in main /tools/redis-5.0.12/srg/server,c:4432
#9 0x7f515eae0b66 in_libc_start_main (/ljb64/libc.so.6+0x25b66)
#10 0x4290e9 in _start (/tools/redis-5.0.12/bin/redis-server+0x4290e9)

0x000000d9f49f is located 47 bytes to the right of global variable 'hashDictType' defined in 'server.c:656:10'(0xd9f440) of size 48
0x000000d9f49f is located 1 bytes to the left of global variable 'keylistDictType' defined in 'server.c:668:10"(0xd9f4a0) of size 48
SUMMARY:AddressSanitizer: global-buffer-overflow /tools/redis-5.0,12/src/sds.h;88 in sdslen
Shadow bytes around the buggy address:

Comment From: oranagra

first one seems to happen when a client has a NULL query buffer, it doesn't seem like a possible result of a memory corruption, but rather a bug (also, if it were a memory corruption, i would hope ASAN would have reported that earlier). that portion of code in clientsCronResizeQueryBuffer didn't change recently, but maybe there are other bugs that lead to that scenario of NULL query buffer, that have already been resolved, i can't think of them but it's too long ago, so it would help if you can try to reproduce that on 7.0.

the second one seems to be a key name pointer (from the server dictionary) that's bad address. this could have been a result of some memory corruption, or a bug forgetting to update a pointer. again, if it was a memory corruption, i would hope that ASAN would have reported that earlier, which makes it likely to be a bug. and again, it's hard to keep track of everything that was fixed since so long ago, so it would help if you can try to reproduce it on 7.0.

Comment From: garth6666

i'll note that this crash seems to happen when adlist.c thinks the list is non-empty (len != 0), but the tail is NULL, can only happen due to one of these:

  1. internal counting bug in adlist.c
  2. memory corruption by some other mechanism in redis that overrides something in the list.

number 1 is very unlikely since adlist.c is so simple and didn't really change in years (the diff from the latest version are negligible)

number 2 would be very hard to find, considering the list nodes themselves are isolated allocations, and the list header is part of the server struct and has no dangerous array right before it.

assuming memory corruption, if we knew how to reproduce it, then valgrind can be used to find the offender

Hi, I'm paying attention to this problem. In the listRotateTailToHead function, if the list ->tail may be NULL, I wonder whether the clientNode(new tail) may be released before the list ->tail ->next=NULL. Maybe my guess is wrong. Maybe there is an atom or mutual exclusion mechanism in the code, so I want to ask whether this situation exists.

Comment From: oranagra

@garth6666 i'm not sure i follow you, maybe you need to be more explicit. please note that redis is for the most part single-threaded, and clients are only released from the main thread (the same one that runs this serverCron code.

Comment From: zhipingu

i'll note that this crash seems to happen when adlist.c thinks the list is non-empty (len != 0), but the tail is NULL, can only happen due to one of these:

  1. internal counting bug in adlist.c
  2. memory corruption by some other mechanism in redis that overrides something in the list.

number 1 is very unlikely since adlist.c is so simple and didn't really change in years (the diff from the latest version are negligible) number 2 would be very hard to find, considering the list nodes themselves are isolated allocations, and the list header is part of the server struct and has no dangerous array right before it. assuming memory corruption, if we knew how to reproduce it, then valgrind can be used to find the offender

Hi, I'm paying attention to this problem. In the listRotateTailToHead function, if the list ->tail may be NULL, I wonder whether the clientNode(new tail) may be released before the list ->tail ->next=NULL. Maybe my guess is wrong. Maybe there is an atom or mutual exclusion mechanism in the code, so I want to ask whether this situation exists.

i'll note that this crash seems to happen when adlist.c thinks the list is non-empty (len != 0), but the tail is NULL, can only happen due to one of these:

  1. internal counting bug in adlist.c
  2. memory corruption by some other mechanism in redis that overrides something in the list.

number 1 is very unlikely since adlist.c is so simple and didn't really change in years (the diff from the latest version are negligible) number 2 would be very hard to find, considering the list nodes themselves are isolated allocations, and the list header is part of the server struct and has no dangerous array right before it. assuming memory corruption, if we knew how to reproduce it, then valgrind can be used to find the offender

Hi, I'm paying attention to this problem. In the listRotateTailToHead function, if the list ->tail may be NULL, I wonder whether the clientNode(new tail) may be released before the list ->tail ->next=NULL. Maybe my guess is wrong. Maybe there is an atom or mutual exclusion mechanism in the code, so I want to ask whether this situation exists.

@garth6666

i'll note that this crash seems to happen when adlist.c thinks the list is non-empty (len != 0), but the tail is NULL, can only happen due to one of these:

  1. internal counting bug in adlist.c
  2. memory corruption by some other mechanism in redis that overrides something in the list.

number 1 is very unlikely since adlist.c is so simple and didn't really change in years (the diff from the latest version are negligible) number 2 would be very hard to find, considering the list nodes themselves are isolated allocations, and the list header is part of the server struct and has no dangerous array right before it. assuming memory corruption, if we knew how to reproduce it, then valgrind can be used to find the offender

Hi, I'm paying attention to this problem. In the listRotateTailToHead function, if the list ->tail may be NULL, I wonder whether the clientNode(new tail) may be released before the list ->tail ->next=NULL. Maybe my guess is wrong. Maybe there is an atom or mutual exclusion mechanism in the code, so I want to ask whether this situation exists.

@garth6666 By gdb coredump, we found the value of tail->pre is equal to NULL, so after the "list->tail = tail->pre" is executed, the list->tail become NULL. And more strange, the value of tail->next and list->head->next is equal

Comment From: zhipingu

first one seems to happen when a client has a NULL query buffer, it doesn't seem like a possible result of a memory corruption, but rather a bug (also, if it were a memory corruption, i would hope ASAN would have reported that earlier). that portion of code in clientsCronResizeQueryBuffer didn't change recently, but maybe there are other bugs that lead to that scenario of NULL query buffer, that have already been resolved, i can't think of them but it's too long ago, so it would help if you can try to reproduce that on 7.0.

the second one seems to be a key name pointer (from the server dictionary) that's bad address. this could have been a result of some memory corruption, or a bug forgetting to update a pointer. again, if it was a memory corruption, i would hope that ASAN would have reported that earlier, which makes it likely to be a bug. and again, it's hard to keep track of everything that was fixed since so long ago, so it would help if you can try to reproduce it on 7.0.

@oranagra thands for your reply. we only reproduce it on 6.2.7. And we have not try it on 7.0,because it is not a stable version

Comment From: zhipingu

first one seems to happen when a client has a NULL query buffer, it doesn't seem like a possible result of a memory corruption, but rather a bug (also, if it were a memory corruption, i would hope ASAN would have reported that earlier). that portion of code in clientsCronResizeQueryBuffer didn't change recently, but maybe there are other bugs that lead to that scenario of NULL query buffer, that have already been resolved, i can't think of them but it's too long ago, so it would help if you can try to reproduce that on 7.0.

the second one seems to be a key name pointer (from the server dictionary) that's bad address. this could have been a result of some memory corruption, or a bug forgetting to update a pointer. again, if it was a memory corruption, i would hope that ASAN would have reported that earlier, which makes it likely to be a bug. and again, it's hard to keep track of everything that was fixed since so long ago, so it would help if you can try to reproduce it on 7.0.

@oranagra thands for your reply. we only reproduce it on 6.2.7. And we have not try it on 7.0,because it is not a stable version. By gdb coredump, we found the value of tail->pre is equal to NULL, so after the "list->tail = tail->pre" is executed, the list->tail become NULL. And more strange, the value of tail->next and list->head->next is equal

Comment From: oranagra

7.0 is a stable version, but 6.2 is maintained as well. i'd like to think that i do have a mental map of the changes in 7.0 that have a potential to fix such a problem, but on the other hand maybe there are changes that fix it by chance (code was replaced) without realizing it. anyway, i think my comment was about the "sds" related crashes, have you experienced these on 6.2?

from your text, it seems that, what you mention is a case of a list that has only one node. i.e. in that case tail->pre is null, and head->next and tail->next are equal (because head==tail). but i suppose you mean that head->next and tail->next are equal but are not null, and that head is not equal to tail (just their next pointers are equal)? anyway, that would still seem like either we have a serious bug in the (very short) linked list implementation that only you can reproduce, or you have a memory corruption.

so assuming the second option (memory corruption), we heed to find out what makes you different than all the others who use redis and don't experience that... are you using any modules? commands that might be rarely used? an OS / or architecture that's not common?

Comment From: zhipingu

7.0 is a stable version, but 6.2 is maintained as well. i'd like to think that i do have a mental map of the changes in 7.0 that have a potential to fix such a problem, but on the other hand maybe there are changes that fix it by chance (code was replaced) without realizing it. anyway, i think my comment was about the "sds" related crashes, have you experienced these on 6.2?

from your text, it seems that, what you mention is a case of a list that has only one node. i.e. in that case tail->pre is null, and head->next and tail->next are equal (because head==tail). but i suppose you mean that head->next and tail->next are equal but are not null, and that head is not equal to tail (just their next pointers are equal)? anyway, that would still seem like either we have a serious bug in the (very short) linked list implementation that only you can reproduce, or you have a memory corruption.

so assuming the second option (memory corruption), we heed to find out what makes you different than all the others who use redis and don't experience that... are you using any modules? commands that might be rarely used? an OS / or architecture that's not common?

@oranagra 1. we have encountered several crashes on 6.2.7. 2.what I mention is a case of a list that has 869 nodes. And I mean that head->next and tail->next are equal but are not null, and that head is not equal to tail (just their next pointers are equal, but not always). 3.I use kylin OS

Comment From: zhipingu

Could it be related to the version of gcc? We use gcc 7.3.0? @oranagra

Comment From: oranagra

i don't know.. it could also be some incompatibility between something jemalloc does and your kernel. i'd advise to try disabling the optimization, and or switching to libc malloc. make noopt and make MALLOC=libc

Comment From: zhipingu

i don't know.. it could also be some incompatibility between something jemalloc does and your kernel. i'd advise to try disabling the optimization, and or switching to libc malloc. make noopt and make MALLOC=libc

Ok,thanks very much. I'll try it ASAP.

Comment From: zhipingu

i don't know.. it could also be some incompatibility between something jemalloc does and your kernel. i'd advise to try disabling the optimization, and or switching to libc malloc. make noopt and make MALLOC=libc

@oranagra We have tried it yesterday. Unfortunately, we still encountered crash.Is it possible that other parameters are not compatible?

Comment From: garth6666

i don't know.. it could also be some incompatibility between something jemalloc does and your kernel. i'd advise to try disabling the optimization, and or switching to libc malloc. make noopt and make MALLOC=libc

Ok,thanks very much. I'll try it ASAP.

You can try to use Redis compiled on Centos to run tests directly in Kylinos

Comment From: oranagra

@oranagra We have tried it yesterday. Unfortunately, we still encountered crash.Is it possible that other parameters are not compatible?

i can't think of anything else that can be in some way incompatible with your kernel. just to make sure, please check INFO MEMORY to see that indeed you managed to use libc malloc. redis's make file remembers your previous settings, and you have to do make distclean in order to change these.

Comment From: zhipingu

Reference in new is

@garth6666 We have tried it. However, we encountered crash

Comment From: zhipingu

@oranagra Yes,we were sure that we have managed to use libc malloc.

Comment From: zhipingu

Reference in

@oranagra We found something new. We encountered a global buffer overflow in the jemlloc library code, asan report as below: -2194680--- ERROR: AddressSanitizer: global-buffer-overflow on address 0x000000d5da18 at pc 0x000000abbaba bp 0x7fffdb05f7d0 sp 8x7fffdb05f7b0 READ of size 8 at 0x00000d5da18 thread T0

0 0xabbab9 in sz_index2size_lookup include/jemalloc/internal/sz.h:201

1. 0xabbab9 in sz_index2size include/jenalloc/internal/sz.h:209

2 0xabbab9 in arena salloc include/jemalloc/internal/arena_inlines_b.h:124

3 0xabbab9 in isalloc include/jemaLoc/internal/jemaLloc_internal_inlines_c.h:37

4 0xabbab9 in je malloc_usable size src/jemalloc.c:3149

5 0x458fa3 in zfree /tools/redis-5.0.12/src/zmalloc.c:202

6 0x4547d7 in sdsRemoveFreeSpace /tools/redis-5.0.12/src/sds.C:286

7 0x442c29 in clientsCronResizeQueryBuffer /tools/redis-5.0.12/src/server.c:874

8 0x4431aa in clientsCron /tools/redis-5.0.12/src/server.c:1001

9 0x443d72 in serverCron /tools/redis-5.0.12/src/server.c:1228

10 0x435a24 in processTimeEvents /tools/redis-5.0.12/src/ae.c:331

11 0x4365d4 in aeProcessEvents /tools/redis-5.0.12/src/ae.C:469

12 0x436a5e in aeMain /tools/redis-5.0.12/src/ae.C:501

13 0x452898 in main /tools/redis-5.0.12/src/server.C:4432

14 0x7fcc7c205b66 in_libc_start main (/lib64/Libc.S0.6+0x25b66)

15 0x4290e9 in _start {/too[s/redis-5.0.12/bin/redis-server+0x4290e9)

0x00000d5da18 is located 0 bytes to the right of global variable "je_sz_index2size tab' defined in src/sz.c:19:14. (0xd5d2c0) of size 1880 0x00000d5da18 is located 40 bytes to the left of global variable "je_sz_size2index _tab'" defined in src/sz.C:27:15 (0xd5da40) of size 512 SUMARY: Addresssanitizer: global-buffer-overflow include/jemalloc/internal/sz.h:201 in sZ_index2size_Lookup

Comment From: zhipingu

@oranagra We have tried it yesterday. Unfortunately, we still encountered crash.Is it possible that other parameters are not compatible?

i can't think of anything else that can be in some way incompatible with your kernel. just to make sure, please check INFO MEMORY to see that indeed you managed to use libc malloc. redis's make file remembers your previous settings, and you have to do make distclean in order to change these.

@oranagra Can we just replace jemalloc-5.1.0 in redis with jemalloc-4.4.0 because we didn't crash with redis-3.2.7

Comment From: oranagra

That crash in jemalloc is probably not because of jemalloc. It's either because we call zfree on an invalid pointer, or because something corrupted the heap and messed up jemalloc data structures.

I think you should not have any problem to switch to an old jemalloc. But you already tried switching to libc allocator and it didn't help, so I really don't think that's gonna help.

Comment From: guojje

我们也遇到了同样的的问题,zhipingu 你解决了吗?方便留个联系方式沟通下吗? Unfortunately,I haven't found a way to reproduce the problem