We found a crash (segfault) in 7.0.11 in a 3-node cluster. Addr2line indicates that it happens at cluster.c:1946, where curmaster is NULL when being dereferenced. But this bug seems flaky so I cannot provide concrete instructions to reproduce.
Here's the bug report.
Crash report
=== REDIS BUG REPORT START: Cut & paste starting from here ===
12:M 25 Jul 2023 20:16:44.449 # Redis 7.0.11 crashed by signal: 11, si_code: 1
12:M 25 Jul 2023 20:16:44.449 # Accessing address: 0x84c
12:M 25 Jul 2023 20:16:44.449 # Crashed running the instruction at: 0x6b8b8d
------ STACK TRACE ------
EIP:
/bin/redis-server *:6379 [cluster](clusterUpdateSlotsConfigWith+0x56d)[0x6b8b8d]
Backtrace:
/bin/redis-server *:6379 [cluster](sigsegvHandler+0x2c9)[0x6a1f49]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7ffff7e58420]
/bin/redis-server *:6379 [cluster](clusterUpdateSlotsConfigWith+0x56d)[0x6b8b8d]
/bin/redis-server *:6379 [cluster](clusterProcessPacket+0x3157)[0x6be317]
/bin/redis-server *:6379 [cluster](clusterReadHandler+0x3c4)[0x6c2ec4]
/bin/redis-server *:6379 [cluster][0x808cf1]
/bin/redis-server *:6379 [cluster](aeProcessEvents+0x82f)[0x50ca7f]
/bin/redis-server *:6379 [cluster](aeMain+0x3d)[0x50d75d]
/bin/redis-server *:6379 [cluster](main+0x1274)[0x53fce4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ffff7c5b083]
/bin/redis-server *:6379 [cluster](__libc_start_main+0x493)[0x8ec063]
/bin/redis-server *:6379 [cluster](_start+0x2e)[0x450a1e]
------ REGISTERS ------
12:M 25 Jul 2023 20:16:44.479 #
RAX:0000000000000100 RBX:00007fffffffd2e0
RCX:0000000000004ccc RDX:00000000000001a4
RDI:000000000000084c RSI:0000000000001554
RBP:00007fffffffd3b0 RSP:00007fffffff51c0
R8 :0000000000000030 R9 :0000000000000000
R10:0000000000000002 R11:0000000000000001
R12:0000000000000000 R13:000061d000006e80
R14:00007fffffff51c0 R15:0000000000001555
RIP:00000000006b8b8d EFL:0000000000010246
CSGSFS:002b000000000033
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51cf) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51ce) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51cd) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51cc) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51cb) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51ca) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51c9) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51c8) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51c7) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51c6) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51c5) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51c4) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51c3) -> 0000000000000000
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51c2) -> 00000000006b8620
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51c1) -> 0000000000956f2b
12:M 25 Jul 2023 20:16:44.479 # (00007fffffff51c0) -> 0000000041b58ab3
------ INFO OUTPUT ------
# Server
redis_version:7.0.11
redis_git_sha1:391aa407
redis_git_dirty:0
redis_build_id:e1cfd8d84705f27e
redis_mode:cluster
os:Linux 5.15.68 x86_64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:4.2.1
process_id:12
process_supervised:no
run_id:87417ce7414ac80853927717729141f42aee787a
tcp_port:6379
server_time_usec:1690316204438089
uptime_in_seconds:9
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:12594604
executable:/bin/redis-server
config_file:
io_threads_active:0
# Clients
connected_clients:0
cluster_connections:2
maxclients:4064
client_recent_max_input_buffer:16390
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
# Memory
used_memory:1555608
used_memory_human:1.48M
used_memory_rss:21348352
used_memory_rss_human:20.36M
used_memory_peak:1590828
used_memory_peak_human:1.52M
used_memory_peak_perc:97.79%
used_memory_overhead:1189834
used_memory_startup:1189649
used_memory_dataset:365774
used_memory_dataset_perc:99.95%
allocator_allocated:1544425
allocator_active:21316608
allocator_resident:21316608
total_system_memory:2078552064
total_system_memory_human:1.94G
used_memory_lua:31744
used_memory_vm_eval:31744
used_memory_lua_human:31.00K
used_memory_scripts_eval:0
number_of_cached_scripts:0
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:32768
used_memory_vm_total:64512
used_memory_vm_total_human:63.00K
used_memory_functions:181
used_memory_scripts:181
used_memory_scripts_human:181B
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:13.80
allocator_frag_bytes:19772183
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:31744
mem_fragmentation_ratio:13.82
mem_fragmentation_bytes:19803927
mem_not_counted_for_evict:0
mem_replication_backlog:4
mem_total_replication_buffers:0
mem_clients_slaves:0
mem_clients_normal:0
mem_cluster_links:0
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0
# Persistence
loading:0
async_loading:0
current_cow_peak:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1690316195
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_saves:0
rdb_last_cow_size:0
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:0
aof_rewrites_consecutive_failures:0
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0
# Stats
total_connections_received:19
total_commands_processed:29
instantaneous_ops_per_sec:10
total_net_input_bytes:54920
total_net_output_bytes:16941
total_net_repl_input_bytes:0
total_net_repl_output_bytes:0
instantaneous_input_kbps:0.67
instantaneous_output_kbps:0.33
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:0.00
rejected_connections:0
sync_full:2
sync_partial_ok:0
sync_partial_err:2
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:0
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:0
total_forks:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:7
dump_payload_sanitizations:0
total_reads_processed:56
total_writes_processed:33
io_threaded_reads_processed:0
io_threaded_writes_processed:0
reply_buffer_shrinks:3
reply_buffer_expands:0
# Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:558fce6645c495d52247a26acde2ed2f21712fa6
master_replid2:f0462026e8365b312f488046ab9e66cd078891d3
master_repl_offset:0
second_repl_offset:1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:0
# CPU
used_cpu_sys:0.044078
used_cpu_user:0.087177
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000
used_cpu_sys_main_thread:0.044010
used_cpu_user_main_thread:0.087043
# Modules
# Commandstats
cmdstat_ping:calls=2,usec=33,usec_per_call=16.50,rejected_calls=0,failed_calls=0
cmdstat_info:calls=3,usec=445,usec_per_call=148.33,rejected_calls=0,failed_calls=0
cmdstat_cluster|set-config-epoch:calls=1,usec=15,usec_per_call=15.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|nodes:calls=5,usec=1536,usec_per_call=307.20,rejected_calls=0,failed_calls=0
cmdstat_cluster|meet:calls=2,usec=384,usec_per_call=192.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|info:calls=1,usec=90,usec_per_call=90.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|keyslot:calls=1,usec=40,usec_per_call=40.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|reset:calls=1,usec=40073,usec_per_call=40073.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|setslot:calls=1,usec=34,usec_per_call=34.00,rejected_calls=0,failed_calls=1
cmdstat_cluster|countkeysinslot:calls=0,usec=0,usec_per_call=0.00,rejected_calls=1,failed_calls=0
cmdstat_cluster|myid:calls=4,usec=116,usec_per_call=29.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|addslots:calls=1,usec=536,usec_per_call=536.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|delslotsrange:calls=1,usec=225,usec_per_call=225.00,rejected_calls=0,failed_calls=0
cmdstat_psync:calls=2,usec=323,usec_per_call=161.50,rejected_calls=0,failed_calls=0
cmdstat_rename:calls=0,usec=0,usec_per_call=0.00,rejected_calls=1,failed_calls=0
cmdstat_dump:calls=0,usec=0,usec_per_call=0.00,rejected_calls=1,failed_calls=0
cmdstat_set:calls=0,usec=0,usec_per_call=0.00,rejected_calls=3,failed_calls=0
cmdstat_replconf:calls=4,usec=70,usec_per_call=17.50,rejected_calls=0,failed_calls=0
# Errorstats
errorstat_CLUSTERDOWN:count=2
errorstat_ERR:count=3
errorstat_MOVED:count=2
# Latencystats
latency_percentiles_usec_ping:p50=12.031,p99=21.119,p99.9=21.119
latency_percentiles_usec_info:p50=174.079,p99=201.727,p99.9=201.727
latency_percentiles_usec_cluster|set-config-epoch:p50=15.039,p99=15.039,p99.9=15.039
latency_percentiles_usec_cluster|nodes:p50=354.303,p99=360.447,p99.9=360.447
latency_percentiles_usec_cluster|meet:p50=172.031,p99=212.991,p99.9=212.991
latency_percentiles_usec_cluster|info:p50=90.111,p99=90.111,p99.9=90.111
latency_percentiles_usec_cluster|keyslot:p50=40.191,p99=40.191,p99.9=40.191
latency_percentiles_usec_cluster|reset:p50=40108.031,p99=40108.031,p99.9=40108.031
latency_percentiles_usec_cluster|setslot:p50=34.047,p99=34.047,p99.9=34.047
latency_percentiles_usec_cluster|myid:p50=20.095,p99=50.175,p99.9=50.175
latency_percentiles_usec_cluster|addslots:p50=536.575,p99=536.575,p99.9=536.575
latency_percentiles_usec_cluster|delslotsrange:p50=225.279,p99=225.279,p99.9=225.279
latency_percentiles_usec_psync:p50=115.199,p99=208.895,p99.9=208.895
latency_percentiles_usec_replconf:p50=7.007,p99=39.167,p99.9=39.167
# Cluster
cluster_enabled:1
# Keyspace
------ CLIENT LIST OUTPUT ------
------ MODULES INFO OUTPUT ------
------ CONFIG DEBUG OUTPUT ------
activedefrag no
repl-diskless-sync yes
lazyfree-lazy-eviction no
client-query-buffer-limit 1gb
lazyfree-lazy-user-flush no
lazyfree-lazy-expire no
io-threads 1
io-threads-do-reads no
repl-diskless-load disabled
lazyfree-lazy-user-del no
replica-read-only yes
slave-read-only yes
sanitize-dump-payload no
proto-max-bulk-len 512mb
lazyfree-lazy-server-del no
list-compress-depth 0
------ FAST MEMORY TEST ------
12:M 25 Jul 2023 20:16:44.497 # Bio thread for job type #0 terminated
12:M 25 Jul 2023 20:16:44.497 # Bio thread for job type #1 terminated
12:M 25 Jul 2023 20:16:44.499 # Bio thread for job type #2 terminated
*** Preparing to test memory region b4d000 (9236480 bytes)
*** Preparing to test memory region 7fff7000 (268435456 bytes)
*** Preparing to test memory region 2008fff7000 (15392867561472 bytes)
*** Preparing to test memory region 10007e66a000 (1044480 bytes)
*** Preparing to test memory region 10007e769000 (12288 bytes)
*** Preparing to test memory region 10007e76c000 (1044480 bytes)
*** Preparing to test memory region 10007e86b000 (12288 bytes)
*** Preparing to test memory region 10007e86e000 (1044480 bytes)
*** Preparing to test memory region 10007e96d000 (23638016 bytes)
*** Preparing to test memory region 602000000000 (458752 bytes)
*** Preparing to test memory region 602e00000000 (65536 bytes)
*** Preparing to test memory region 603000000000 (2031616 bytes)
*** Preparing to test memory region 603e00000000 (65536 bytes)
*** Preparing to test memory region 604000000000 (1507328 bytes)
*** Preparing to test memory region 604e00000000 (65536 bytes)
*** Preparing to test memory region 606000000000 (65536 bytes)
*** Preparing to test memory region 606e00000000 (65536 bytes)
*** Preparing to test memory region 607000000000 (65536 bytes)
*** Preparing to test memory region 607e00000000 (65536 bytes)
*** Preparing to test memory region 608000000000 (65536 bytes)
*** Preparing to test memory region 608e00000000 (65536 bytes)
*** Preparing to test memory region 60b000000000 (65536 bytes)
*** Preparing to test memory region 60be00000000 (65536 bytes)
*** Preparing to test memory region 60c000000000 (65536 bytes)
*** Preparing to test memory region 60ce00000000 (65536 bytes)
*** Preparing to test memory region 60d000000000 (65536 bytes)
*** Preparing to test memory region 60de00000000 (65536 bytes)
*** Preparing to test memory region 60e000000000 (65536 bytes)
*** Preparing to test memory region 60ee00000000 (65536 bytes)
*** Preparing to test memory region 60f000000000 (65536 bytes)
*** Preparing to test memory region 60fe00000000 (65536 bytes)
*** Preparing to test memory region 610000000000 (65536 bytes)
*** Preparing to test memory region 610e00000000 (65536 bytes)
*** Preparing to test memory region 611000000000 (65536 bytes)
*** Preparing to test memory region 611e00000000 (65536 bytes)
*** Preparing to test memory region 612000000000 (65536 bytes)
*** Preparing to test memory region 612e00000000 (65536 bytes)
*** Preparing to test memory region 613000000000 (65536 bytes)
*** Preparing to test memory region 613e00000000 (65536 bytes)
*** Preparing to test memory region 614000000000 (65536 bytes)
*** Preparing to test memory region 614e00000000 (65536 bytes)
*** Preparing to test memory region 615000000000 (65536 bytes)
*** Preparing to test memory region 615e00000000 (65536 bytes)
*** Preparing to test memory region 616000000000 (65536 bytes)
*** Preparing to test memory region 616e00000000 (65536 bytes)
*** Preparing to test memory region 617000000000 (65536 bytes)
*** Preparing to test memory region 617e00000000 (65536 bytes)
*** Preparing to test memory region 618000000000 (65536 bytes)
*** Preparing to test memory region 618e00000000 (65536 bytes)
*** Preparing to test memory region 619000000000 (131072 bytes)
*** Preparing to test memory region 619e00000000 (65536 bytes)
*** Preparing to test memory region 61a000000000 (65536 bytes)
*** Preparing to test memory region 61ae00000000 (65536 bytes)
*** Preparing to test memory region 61b000000000 (65536 bytes)
*** Preparing to test memory region 61be00000000 (65536 bytes)
*** Preparing to test memory region 61c000000000 (65536 bytes)
*** Preparing to test memory region 61ce00000000 (65536 bytes)
*** Preparing to test memory region 61d000000000 (65536 bytes)
*** Preparing to test memory region 61de00000000 (65536 bytes)
*** Preparing to test memory region 61e000000000 (65536 bytes)
*** Preparing to test memory region 61ee00000000 (65536 bytes)
*** Preparing to test memory region 621000000000 (524288 bytes)
*** Preparing to test memory region 621e00000000 (65536 bytes)
*** Preparing to test memory region 624000000000 (65536 bytes)
*** Preparing to test memory region 624000010000 (655360 bytes)
*** Preparing to test memory region 624e00000000 (65536 bytes)
*** Preparing to test memory region 625000000000 (65536 bytes)
*** Preparing to test memory region 625e00000000 (65536 bytes)
*** Preparing to test memory region 629000000000 (983040 bytes)
*** Preparing to test memory region 629e00000000 (65536 bytes)
*** Preparing to test memory region 62a000000000 (720896 bytes)
*** Preparing to test memory region 62ae00000000 (65536 bytes)
*** Preparing to test memory region 62b000000000 (458752 bytes)
*** Preparing to test memory region 62be00000000 (65536 bytes)
*** Preparing to test memory region 62d000000000 (131072 bytes)
*** Preparing to test memory region 62de00000000 (65536 bytes)
*** Preparing to test memory region 62e000000000 (196608 bytes)
*** Preparing to test memory region 62ee00000000 (65536 bytes)
*** Preparing to test memory region 62f000000000 (65536 bytes)
*** Preparing to test memory region 62fe00000000 (65536 bytes)
*** Preparing to test memory region 631000000000 (131072 bytes)
*** Preparing to test memory region 631e00000000 (65536 bytes)
*** Preparing to test memory region 640000000000 (12288 bytes)
*** Preparing to test memory region 7ffff3390000 (8388608 bytes)
*** Preparing to test memory region 7ffff3b9f000 (8388608 bytes)
*** Preparing to test memory region 7ffff43ae000 (8388608 bytes)
*** Preparing to test memory region 7ffff4bbc000 (819200 bytes)
*** Preparing to test memory region 7ffff5084000 (135168 bytes)
*** Preparing to test memory region 7ffff50a5000 (33726464 bytes)
*** Preparing to test memory region 7ffff70cf000 (131072 bytes)
*** Preparing to test memory region 7ffff78cf000 (3526656 bytes)
*** Preparing to test memory region 7ffff7c2c000 (4096 bytes)
*** Preparing to test memory region 7ffff7c2d000 (40960 bytes)
*** Preparing to test memory region 7ffff7e25000 (16384 bytes)
*** Preparing to test memory region 7ffff7e63000 (16384 bytes)
*** Preparing to test memory region 7ffff7fc6000 (16384 bytes)
*** Preparing to test memory region 7ffff7ffb000 (4096 bytes)
*** Preparing to test memory region 7ffff7ffe000 (4096 bytes)
Additional information
Steps to reproduce (if any)
The input involves some reconfigurations: at first other 2 nodes become the replica of this node; later one of them failover and becomes the new master; after this node becomes the replica, it gets soft reset (before this it is instructed to send itself a meet message), and eventually crashes when the other replica connects to it.
I am looking into this and trying to figure out the root cause.
Comment From: CATTechnology
Believe yourself
Comment From: Congyu-Liu
I think somehow I manged to figure out how this bug happened:
1. let a node becomes a replica
2. ask the node to meet itself, which then sends a meet message
3. soft reset the node, which then turns into a master
4. the stale meet message arrives; in clusterProcessPacket, sender is determined to be myself, whose flag is updated to a replica; now, myself->flags is set to be a replica, but myself->slaveof is NULL
6. when another node's gossip message arrives, in clusterUpdateSlotsConfigWith, curmaster is updated to NULL, eventually causing the segfault at cluster.c:1946
Comment From: enjoy-binbin
@Congyu-Liu thanks for the report, and sorry for the late response.
by any luck, do you still remember the cluster topology? Are all 3 nodes master or ? and i may need more info, for example, in step 1 above, what command was used to turn the node into a replica? Is the node originally the master node in the cluster, or is it a newly added node?
Comment From: Congyu-Liu
@enjoy-binbin Hello! Thanks for your response. IIRC, 3 nodes are initially master nodes. In step 1, I used CLUSTER REPLICATE to turn a node into a replica. The node is originally in the cluster.
Feel free to reach out to me for more information.
Comment From: sundb
@Congyu-Liu did you change the cluster nodes config manually? do you still keep the cluster nodes config file of the crash server?
Comment From: sundb
@Congyu-Liu how can you get a stale meet message after the node turns to a replica? thanks.
Comment From: Congyu-Liu
Hi @sundb! Sorry for the late reply.
did you change the cluster nodes config manually?
I used CLUSTER REPLICATE and CLUSTER RESET commands (assuming you are asking how I changed the node roles).
do you still keep the cluster nodes config file of the crash server?
Here is the command line arguments I used for each node:
--protected-mode no --port 6379 --cluster-enabled yes --loglevel verbose
how can you get a stale meet message after the node turns to a replica?
The meet message was delayed. It arrived after the node turned into a replica.
Let me know if you have more questions and I am happy to help.