Crash report
=== REDIS BUG REPORT START: Cut & paste starting from here ===
15455:S 19 Oct 2023 15:30:38.116 # Redis 7.2.1 crashed by signal: 11, si_code: 1
15455:S 19 Oct 2023 15:30:38.116 # Accessing address: 0x50
15455:S 19 Oct 2023 15:30:38.116 # Crashed running the instruction at: 0x5675c7
------ STACK TRACE ------
EIP:
/usr/local/redis-7.2.1/redis-server *:7390 [cluster](dbDictAfterReplaceEntry+0x77)[0x5675c7]
Backtrace:
/lib64/libpthread.so.0(+0xf5e0)[0x7f4dcbeee5e0]
/usr/local/redis-7.2.1/redis-server *:7390 [cluster](dbDictAfterReplaceEntry+0x77)[0x5675c7]
/usr/local/redis-7.2.1/redis-server *:7390 [cluster][0x5681d0]
/usr/local/redis-7.2.1/redis-server *:7390 [cluster](activeDefragCycle+0x723)[0x4c4a73]
/usr/local/redis-7.2.1/redis-server *:7390 [cluster](databasesCron+0x52)[0x577bd2]
/usr/local/redis-7.2.1/redis-server *:7390 [cluster](serverCron+0x25f)[0x57911f]
/usr/local/redis-7.2.1/redis-server *:7390 [cluster][0x45f8be]
/usr/local/redis-7.2.1/redis-server *:7390 [cluster](aeMain+0x145)[0x4cb3f5]
/usr/local/redis-7.2.1/redis-server *:7390 [cluster](main+0x4d0)[0x456070]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4dcbb3dc05]
/usr/local/redis-7.2.1/redis-server *:7390 [cluster][0x4566ba]
------ REGISTERS ------
15455:S 19 Oct 2023 15:30:38.117 #
RAX:0000000000033b00 RBX:00007f4dc4c0b2a0
RCX:00007f4a87c4bae7 RDX:0000000000000000
RDI:00007f4a87c4bab3 RSI:00007f4a87c4bae7
RBP:00007f4dcb629900 RSP:00007ffef9b39820
R8 :0000000000000000 R9 :0000000006666667
R10:00007f4dcca1a740 R11:00007f4dcca1aa80
R12:00007f4a87c4bab3 R13:0000000000000000
R14:00007f4dcb629900 R15:0000000000000000
RIP:00000000005675c7 EFL:0000000000010206
CSGSFS:0000000000000033
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b3982f) -> 00007f4dcb629900
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b3982e) -> 00000000008fa740
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b3982d) -> 00000000004c4a73
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b3982c) -> 000000000ba90000
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b3982b) -> 00007f4dcb629900
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b3982a) -> 0000000000000000
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b39829) -> 0000000001752000
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b39828) -> 00007f4dcb62f2a0
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b39827) -> ffffffffff752000
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b39826) -> 00007f4dcb629900
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b39825) -> 00000000004588b0
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b39824) -> 0000000000000000
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b39823) -> 00000000005681d0
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b39822) -> 0000000000000000
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b39821) -> 00007f4b55490000
15455:S 19 Oct 2023 15:30:38.117 # (00007ffef9b39820) -> 00007f4dc4c0b2a0
------ INFO OUTPUT ------
# Server
redis_version:7.2.1
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:c93627bd17bcfee
redis_mode:cluster
os:Linux 3.10.0-693.el7.x86_64 x86_64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:4.8.5
process_id:15455
process_supervised:no
run_id:9cb501b162c397640a697c67d94164acd329b886
tcp_port:7390
server_time_usec:1697700638109759
uptime_in_seconds:1890
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:3201822
executable:/usr/local/redis-7.2.1/redis-server
config_file:/opt/data/redis-7.2.1/7390/redis.conf
io_threads_active:0
listener0:name=tcp,bind=*,bind=-::*,port=7390
# Clients
connected_clients:3
cluster_connections:10
maxclients:20000
client_recent_max_input_buffer:20480
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
total_blocking_keys:0
total_blocking_keys_on_nokey:0
# Memory
used_memory:7107782656
used_memory_human:6.62G
used_memory_rss:12982214656
used_memory_rss_human:12.09G
used_memory_peak:12867371624
used_memory_peak_human:11.98G
used_memory_peak_perc:55.24%
used_memory_overhead:2094369872
used_memory_startup:2137288
used_memory_dataset:5013412784
used_memory_dataset_perc:70.56%
allocator_allocated:7107895992
allocator_active:12867813376
allocator_resident:12991401984
total_system_memory:33448808448
total_system_memory_human:31.15G
used_memory_lua:31744
used_memory_vm_eval:31744
used_memory_lua_human:31.00K
used_memory_scripts_eval:0
number_of_cached_scripts:0
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:32768
used_memory_vm_total:64512
used_memory_vm_total_human:63.00K
used_memory_functions:184
used_memory_scripts:184
used_memory_scripts_human:184B
maxmemory:21474836480
maxmemory_human:20.00G
maxmemory_policy:allkeys-lfu
allocator_frag_ratio:1.81
allocator_frag_bytes:5759917384
allocator_rss_ratio:1.01
allocator_rss_bytes:123588608
rss_overhead_ratio:1.00
rss_overhead_bytes:-9187328
mem_fragmentation_ratio:1.83
mem_fragmentation_bytes:5874433904
mem_not_counted_for_evict:0
mem_replication_backlog:1077129528
mem_total_replication_buffers:1077116128
mem_clients_slaves:0
mem_clients_normal:26272
mem_cluster_links:10720
mem_aof_buffer:0
mem_allocator:jemalloc-5.3.0
active_defrag_running:19
lazyfree_pending_objects:0
lazyfreed_objects:0
# Persistence
loading:0
async_loading:0
current_cow_peak:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:39997271
rdb_bgsave_in_progress:0
rdb_last_save_time:1697698748
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_saves:0
rdb_last_cow_size:0
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:0
aof_rewrites_consecutive_failures:0
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0
# Stats
total_connections_received:352
total_commands_processed:39997784
instantaneous_ops_per_sec:0
total_net_input_bytes:9913814462
total_net_output_bytes:559718
total_net_repl_input_bytes:9913805445
total_net_repl_output_bytes:0
instantaneous_input_kbps:0.03
instantaneous_output_kbps:0.03
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:0
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:0
total_forks:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:3645
active_defrag_misses:20
active_defrag_key_hits:916
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:0
dump_payload_sanitizations:0
total_reads_processed:8197323
total_writes_processed:2205
io_threaded_reads_processed:0
io_threaded_writes_processed:0
reply_buffer_shrinks:3
reply_buffer_expands:0
eventloop_cycles:8210302
eventloop_duration_sum:166479367
eventloop_duration_cmd_sum:75742306
instantaneous_eventloop_cycles_per_sec:15
instantaneous_eventloop_duration_usec:69
acl_access_denied_auth:0
acl_access_denied_cmd:0
acl_access_denied_key:0
acl_access_denied_channel:0
# Replication
role:slave
master_host:
master_port:7380
master_link_status:up
master_last_io_seconds_ago:4
master_sync_in_progress:0
slave_read_repl_offset:9913805187
slave_repl_offset:9913805187
slave_priority:100
slave_read_only:1
replica_announced:1
connected_slaves:0
master_failover_state:no-failover
master_replid:25b3854d029679334288fa8da4a51f98abe44585
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:9913805187
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1073741824
repl_backlog_first_byte_offset:8840054721
repl_backlog_histlen:1073750467
# CPU
used_cpu_sys:54.278523
used_cpu_user:166.850066
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000
used_cpu_sys_main_thread:54.272030
used_cpu_user_main_thread:166.848998
# Modules
# Commandstats
cmdstat_set:calls=26664978,usec=45725409,usec_per_call=1.71,rejected_calls=0,failed_calls=0
cmdstat_info:calls=189,usec=6915,usec_per_call=36.59,rejected_calls=0,failed_calls=0
cmdstat_slowlog|get:calls=32,usec=178,usec_per_call=5.56,rejected_calls=0,failed_calls=0
cmdstat_cluster|nodes:calls=3,usec=438,usec_per_call=146.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|info:calls=64,usec=3071,usec_per_call=47.98,rejected_calls=0,failed_calls=0
cmdstat_dbsize:calls=31,usec=69,usec_per_call=2.23,rejected_calls=0,failed_calls=0
cmdstat_ping:calls=188,usec=58,usec_per_call=0.31,rejected_calls=0,failed_calls=0
cmdstat_command|docs:calls=1,usec=1122,usec_per_call=1122.00,rejected_calls=0,failed_calls=0
cmdstat_del:calls=13332292,usec=30004998,usec_per_call=2.25,rejected_calls=0,failed_calls=0
cmdstat_config|set:calls=3,usec=22,usec_per_call=7.33,rejected_calls=0,failed_calls=0
cmdstat_config|get:calls=2,usec=26,usec_per_call=13.00,rejected_calls=0,failed_calls=0
cmdstat_select:calls=1,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0
# Errorstats
# Latencystats
latency_percentiles_usec_set:p50=1.003,p99=13.055,p99.9=19.071
latency_percentiles_usec_info:p50=31.103,p99=110.079,p99.9=118.271
latency_percentiles_usec_slowlog|get:p50=4.015,p99=17.023,p99.9=17.023
latency_percentiles_usec_cluster|nodes:p50=147.455,p99=203.775,p99.9=203.775
latency_percentiles_usec_cluster|info:p50=43.007,p99=102.399,p99.9=161.791
latency_percentiles_usec_dbsize:p50=1.003,p99=14.015,p99.9=14.015
latency_percentiles_usec_ping:p50=0.001,p99=1.003,p99.9=2.007
latency_percentiles_usec_command|docs:p50=1122.303,p99=1122.303,p99.9=1122.303
latency_percentiles_usec_del:p50=2.007,p99=15.039,p99.9=20.095
latency_percentiles_usec_config|set:p50=5.023,p99=14.015,p99.9=14.015
latency_percentiles_usec_config|get:p50=5.023,p99=21.119,p99.9=21.119
latency_percentiles_usec_select:p50=0.001,p99=0.001,p99.9=0.001
# Cluster
cluster_enabled:1
# Keyspace
db0:keys=13332686,expires=0,avg_ttl=0
# Cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:3
cluster_stats_messages_ping_sent:1887
cluster_stats_messages_pong_sent:1788
cluster_stats_messages_sent:3675
cluster_stats_messages_ping_received:1788
cluster_stats_messages_pong_received:1887
cluster_stats_messages_received:3675
total_cluster_links_buffer_limit_exceeded:0
------ MODULES INFO OUTPUT ------
------ CONFIG DEBUG OUTPUT ------
sanitize-dump-payload no
replica-read-only yes
lazyfree-lazy-eviction yes
list-compress-depth 0
lazyfree-lazy-server-del yes
slave-read-only yes
io-threads-do-reads no
repl-diskless-load on-empty-db
lazyfree-lazy-user-flush yes
repl-diskless-sync yes
activedefrag yes
lazyfree-lazy-user-del yes
proto-max-bulk-len 512mb
lazyfree-lazy-expire yes
client-query-buffer-limit 1gb
io-threads 1
------ FAST MEMORY TEST ------
15455:S 19 Oct 2023 15:30:38.118 # Bio worker thread #0 terminated
15455:S 19 Oct 2023 15:30:38.118 # Bio worker thread #1 terminated
15455:S 19 Oct 2023 15:30:38.118 # Bio worker thread #2 terminated
*** Preparing to test memory region 8dd000 (2273280 bytes)
*** Preparing to test memory region 10fe000 (135168 bytes)
*** Preparing to test memory region 7f4a56600000 (14191427584 bytes)
*** Preparing to test memory region 7f4da4500000 (397545472 bytes)
*** Preparing to test memory region 7f4dc017f000 (19398656 bytes)
*** Preparing to test memory region 7f4dc1c00000 (2097152 bytes)
*** Preparing to test memory region 7f4dc2800000 (6291456 bytes)
*** Preparing to test memory region 7f4dc2f55000 (8388608 bytes)
*** Preparing to test memory region 7f4dc3756000 (8388608 bytes)
*** Preparing to test memory region 7f4dc3f57000 (8388608 bytes)
*** Preparing to test memory region 7f4dc4757000 (5767168 bytes)
*** Preparing to test memory region 7f4dcb200000 (8388608 bytes)
*** Preparing to test memory region 7f4dcbeda000 (20480 bytes)
*** Preparing to test memory region 7f4dcc0f7000 (16384 bytes)
*** Preparing to test memory region 7f4dcca1a000 (16384 bytes)
*** Preparing to test memory region 7f4dcca28000 (4096 bytes)
*** Preparing to test memory region 7f4dcca29000 (4096 bytes)
*** Preparing to test memory region 7f4dcca2c000 (4096 bytes)
.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O
Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.
------ DUMPING CODE AROUND EIP ------
Symbol: dbDictAfterReplaceEntry (base: 0x567550)
Module: /usr/local/redis-7.2.1/redis-server *:7390 [cluster] (base 0x400000)
$ xxd -r -p /tmp/dump.hex /tmp/dump.bin
$ objdump --adjust-vma=0x567550 -D -b binary -m i386:x86-64 /tmp/dump.bin
------
15455:S 19 Oct 2023 15:31:16.230 # dump of function (hexdump of 247 bytes):
8b05423d390085c07506c30f1f440000415440f6c60755534889f37565488b56204889fd488b46184885d27409f6c2077550488972184885c07415a8077543488958205b5d415cc30f1f8400000000004889dfe848f6ffff4989c44889c7e8fd5eefff4c89e789c6e8e359f0ff488b553889c048c1e004488b525048895c0208ebc1e898c5edff660f1f840000000000415641554154554889d55348837f28ff4889fb0f95c00fb6d00fb6c04c8d24c7498b4424084839c50f82cd0000000fb64c173231d24989f580f9ff7409b20848d3e24883ea084801d04839c50f87a9000000488b13488b42404885c00f84c10000004889dfffd0
=== REDIS BUG REPORT END. Make sure to include from START to END. ===
Additional information
- on cluster mode, I don't know if normal replica node will crash.
Steps to reproduce: 1. set activedefrag to no 2. set keys then del keys to produce memory defragmentation 3. set activedefrag to yes on replica 4. lead to a crash
Comment From: hpatro
@Nanciico Do you have information on allocator_fragmentation_ratio before you enabled activedefrag? Did you generate random data via DEBUG POPULATE or loaded some specific data to reach this condition?
Comment From: hpatro
This issue seems to be applicable only to cluster-enabled setup as it crashes while updating the linked list for maintaining slot to key information.
I had written some tests for #11695 to test defrag on cluster mode. It successfully passes on 7.2.1 https://github.com/redis/redis/commit/6ff263b10d0399307e508d3b0564a1dd91bc30f0. Note: It only runs on the primary though.
Comment From: Nanciico
I do not generate keys via DEBUG POPULATE.
I just set string type to the cluster, and randomly delete them to make memory defragmentation.
memory_fragmentation_ratio = 1.83
I thought it will just happen in cluster mode, but I do not know why it just happens on replica.
Comment From: madolson
@zuiderkwast FYI, might be related to some of the changes we made in 7.0 for the linked list for key->slot, given that it's only on cluster mode.
Comment From: zuiderkwast
@madolson I'm not sure what you're refering to. The slot->keys api using dict entry metadata which is now gone?
Comment From: hpatro
@zuiderkwast I think this slot to keys DLL was introduced in 7.0. The crash seems to be happening while updating the DLL.
Comment From: hpatro
@Nanciico I tried to follow the steps and reproduce the scenario. However, couldn't make the server crash. Would you be able to share the set of command/data to load to reproduce this?
Comment From: Nanciico
@hpatro 1. Set activedefrag to no for all the nodes,include master and replica. 3. Just set string type up to used memory is 7G,key: 26bytes, value: 256 bytes. 4. Delete key every other key.
key1 key2 key3 key4 key5
key1 key3 key5
- set activedefrag to yes on replica
- crash.
It just use set and del command.
Comment From: Nanciico
while set activedefrag to yes on replica, the replica do not response to cluster nodes command, and the cpu rise to 100%.
Comment From: hpatro
@Nanciico I tried the above suggestion still haven't been able to reproduce locally. Are you able to reproduce this consistently? If yes, please provide the exact set of commands to help debug this further.