Redis [CRASH] Failure to rebalance on 7.0.0-rc2

Running redis 7.0.0-rc2 on ubuntu in AWS.

During a rebalance operation after adding nodes, this particular master containing these slots failed continuously on each rebalance attempt of the cluster.

=== REDIS BUG REPORT START: Cut & paste starting from here ===
9412:M 18 Jul 2022 15:59:55.278 # === ASSERTION FAILED ===
9412:M 18 Jul 2022 15:59:55.278 # ==> blocked.c:624 'server.also_propagate.numops == 0' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](handleClientsBlockedOnKeys+0x254)[0x4cecb4]
/usr/local/bin/redis-server *:6379 [cluster](beforeSleep+0xe8)[0x446c58]
/usr/local/bin/redis-server *:6379 [cluster](aeProcessEvents+0x3aa)[0x442b2a]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x1d)[0x442d4d]
/usr/local/bin/redis-server *:6379 [cluster](main+0x409)[0x43ecc9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fa725305840]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x29)[0x43f099]

------ INFO OUTPUT ------
# Server
redis_version:6.9.242
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:50a29168b229999b
redis_mode:cluster
os:Linux 4.4.0-1060-aws x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:5.4.0
process_id:9412
process_supervised:no
run_id:5762aa4fe3d56afa0a26fa03a8dc80727c19652a
tcp_port:6379
server_time_usec:1658159995278425
uptime_in_seconds:7266858
uptime_in_days:84
hz:10
configured_hz:10
lru_clock:13992827
executable:/usr/local/bin/redis-server
config_file:/etc/redis/6379.conf
io_threads_active:0

# Clients
connected_clients:1341
cluster_connections:70
maxclients:150000
client_recent_max_input_buffer:20480
client_recent_max_output_buffer:20504
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:45638469256
used_memory_human:42.50G
used_memory_rss:57257009152
used_memory_rss_human:53.32G
used_memory_peak:59139479568
used_memory_peak_human:55.08G
used_memory_peak_perc:77.17%
used_memory_overhead:29487727300
used_memory_startup:9423688
used_memory_dataset:16150741956
used_memory_dataset_perc:35.40%
allocator_allocated:45639175592
allocator_active:56496828416
allocator_resident:57283624960
total_system_memory:133661978624
total_system_memory_human:124.48G
used_memory_lua:37888
used_memory_vm_eval:37888
used_memory_lua_human:37.00K
used_memory_scripts_eval:0
number_of_cached_scripts:0
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:37888
used_memory_vm_total:75776
used_memory_vm_total_human:74.00K
used_memory_functions:184
used_memory_scripts:184
used_memory_scripts_human:184B
maxmemory:106929582899
maxmemory_human:99.59G
maxmemory_policy:noeviction
allocator_frag_ratio:1.24
allocator_frag_bytes:10857652824
allocator_rss_ratio:1.01
allocator_rss_bytes:786796544
rss_overhead_ratio:1.00
rss_overhead_bytes:-26615808
mem_fragmentation_ratio:1.25
mem_fragmentation_bytes:11618329144
mem_not_counted_for_evict:3584
mem_replication_backlog:104858916
mem_total_replication_buffers:105206024
mem_clients_slaves:348424
mem_clients_normal:7086552
mem_cluster_links:76160
mem_aof_buffer:3584
mem_allocator:jemalloc-5.2.1
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:10423

# Persistence
loading:0
async_loading:0
current_cow_peak:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:21839092757
rdb_bgsave_in_progress:0
rdb_last_save_time:1650893137
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:410
rdb_current_bgsave_time_sec:-1
rdb_saves:0
rdb_last_cow_size:6788845568
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:120527186
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:433
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:1626
aof_last_write_status:ok
aof_last_cow_size:31108997120
module_fork_in_progress:0
module_fork_last_cow_size:0
aof_current_size:16708212897
aof_base_size:15920075489
aof_pending_rewrite:0
aof_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:2

# Stats
total_connections_received:1493483
total_commands_processed:21833443018
instantaneous_ops_per_sec:4734
total_net_input_bytes:1661867530597
total_net_output_bytes:21996788022
instantaneous_input_kbps:881.31
instantaneous_output_kbps:1865.17
rejected_connections:0
sync_full:1
sync_partial_ok:1
sync_partial_err:1
expired_keys:1009194
expired_stale_perc:11.69
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:4354
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:18147550
keyspace_misses:4305787
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:792783
total_forks:1627
migrate_cached_sockets:1
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:1139
dump_payload_sanitizations:0
total_reads_processed:10690047223
total_writes_processed:34801822
io_threaded_reads_processed:0
io_threaded_writes_processed:0
reply_buffer_shrinks:27565
reply_buffer_expands:3151

# Replication
role:master
connected_slaves:2
min_slaves_good_slaves:2
slave0:ip=20.0.83.95,port=6379,state=online,offset=44091096078,lag=0
slave1:ip=20.0.83.12,port=6379,state=online,offset=44091048369,lag=0
master_failover_state:no-failover
master_replid:440b19326917cca6521c88c32af2eaa8336145c9
master_replid2:f7e7603168b44d0a9db4fdd96cf1b907beb3ea41
master_repl_offset:44091170424
second_repl_offset:43585485265
repl_backlog_active:1
repl_backlog_size:104857600
repl_backlog_first_byte_offset:43986311517
repl_backlog_histlen:104858908

# CPU
used_cpu_sys:137268.968000
used_cpu_user:180039.168000
used_cpu_sys_children:39060.212000
used_cpu_user_children:782172.496000
used_cpu_sys_main_thread:133114.124000
used_cpu_user_main_thread:179873.748000

# Modules

# Commandstats
cmdstat_quit:calls=125459,usec=50132,usec_per_call=0.40,rejected_calls=0,failed_calls=0
cmdstat_function|dump:calls=3,usec=10,usec_per_call=3.33,rejected_calls=0,failed_calls=0
cmdstat_config|get:calls=484333,usec=1952954,usec_per_call=4.03,rejected_calls=0,failed_calls=0
cmdstat_client|setname:calls=29306,usec=71317,usec_per_call=2.43,rejected_calls=0,failed_calls=0
cmdstat_psync:calls=2,usec=792987,usec_per_call=396493.50,rejected_calls=0,failed_calls=0
cmdstat_select:calls=7,usec=7,usec_per_call=1.00,rejected_calls=0,failed_calls=0
cmdstat_dbsize:calls=21,usec=20,usec_per_call=0.95,rejected_calls=0,failed_calls=0
cmdstat_readonly:calls=24241,usec=13403,usec_per_call=0.55,rejected_calls=0,failed_calls=0
cmdstat_replconf:calls=2367,usec=1441,usec_per_call=0.61,rejected_calls=0,failed_calls=0
cmdstat_get:calls=16108776,usec=52277307,usec_per_call=3.25,rejected_calls=137,failed_calls=0
cmdstat_ping:calls=838630,usec=269615,usec_per_call=0.32,rejected_calls=32,failed_calls=0
cmdstat_multi:calls=1095071,usec=216485,usec_per_call=0.20,rejected_calls=0,failed_calls=0
cmdstat_migrate:calls=228217,usec=44765074,usec_per_call=196.15,rejected_calls=0,failed_calls=2
cmdstat_del:calls=10735503943,usec=51089041306,usec_per_call=4.76,rejected_calls=0,failed_calls=0
cmdstat_info:calls=1098655,usec=72714853,usec_per_call=66.19,rejected_calls=0,failed_calls=0
cmdstat_cluster|nodes:calls=776,usec=85949,usec_per_call=110.76,rejected_calls=3,failed_calls=0
cmdstat_cluster|forget:calls=31,usec=2708,usec_per_call=87.35,rejected_calls=0,failed_calls=0
cmdstat_cluster|countkeysinslot:calls=4,usec=7,usec_per_call=1.75,rejected_calls=0,failed_calls=0
cmdstat_cluster|info:calls=484335,usec=20396392,usec_per_call=42.11,rejected_calls=11,failed_calls=0
cmdstat_cluster|getkeysinslot:calls=228229,usec=1237624,usec_per_call=5.42,rejected_calls=0,failed_calls=0
cmdstat_cluster|replicate:calls=2,usec=226,usec_per_call=113.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|setslot:calls=30,usec=109,usec_per_call=3.63,rejected_calls=0,failed_calls=0
cmdstat_cluster|meet:calls=1,usec=12,usec_per_call=12.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|reset:calls=1,usec=232112873,usec_per_call=232112880.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|slots:calls=4517,usec=338211,usec_per_call=74.88,rejected_calls=0,failed_calls=0
cmdstat_cluster|delslots:calls=2,usec=55,usec_per_call=27.50,rejected_calls=0,failed_calls=0
cmdstat_debug:calls=0,usec=0,usec_per_call=0.00,rejected_calls=2,failed_calls=0
cmdstat_set:calls=11076073412,usec=55942994477,usec_per_call=5.05,rejected_calls=952,failed_calls=0
cmdstat_restore-asking:calls=17576,usec=39680,usec_per_call=2.26,rejected_calls=0,failed_calls=0
cmdstat_exec:calls=1095071,usec=49235816,usec_per_call=44.96,rejected_calls=0,failed_calls=0

# Errorstats
errorstat_ASK:count=161
errorstat_ERR:count=4
errorstat_LOADING:count=46
errorstat_MOVED:count=928

# Latencystats
latency_percentiles_usec_quit:p50=0.001,p99=6.015,p99.9=9.023
latency_percentiles_usec_function|dump:p50=3.007,p99=4.015,p99.9=4.015
latency_percentiles_usec_config|get:p50=3.007,p99=13.055,p99.9=74.239
latency_percentiles_usec_client|setname:p50=2.007,p99=10.047,p99.9=22.015
latency_percentiles_usec_psync:p50=39.167,p99=796917.759,p99.9=796917.759
latency_percentiles_usec_select:p50=1.003,p99=2.007,p99.9=2.007
latency_percentiles_usec_dbsize:p50=1.003,p99=2.007,p99.9=2.007
latency_percentiles_usec_readonly:p50=1.003,p99=2.007,p99.9=9.023
latency_percentiles_usec_replconf:p50=1.003,p99=1.003,p99.9=6.015
latency_percentiles_usec_get:p50=3.007,p99=10.047,p99.9=21.119
latency_percentiles_usec_ping:p50=0.001,p99=4.015,p99.9=11.007
latency_percentiles_usec_multi:p50=0.001,p99=1.003,p99.9=10.047
latency_percentiles_usec_migrate:p50=214.015,p99=280.575,p99.9=2883.583
latency_percentiles_usec_del:p50=4.015,p99=19.071,p99.9=44.031
latency_percentiles_usec_info:p50=64.255,p99=116.223,p99.9=187.391
latency_percentiles_usec_cluster|nodes:p50=105.471,p99=218.111,p99.9=242.687
latency_percentiles_usec_cluster|forget:p50=89.087,p99=122.367,p99.9=122.367
latency_percentiles_usec_cluster|countkeysinslot:p50=2.007,p99=2.007,p99.9=2.007
latency_percentiles_usec_cluster|info:p50=40.191,p99=65.023,p99.9=79.359
latency_percentiles_usec_cluster|getkeysinslot:p50=5.023,p99=10.047,p99.9=15.039
latency_percentiles_usec_cluster|replicate:p50=87.039,p99=139.263,p99.9=139.263
latency_percentiles_usec_cluster|setslot:p50=3.007,p99=6.015,p99.9=6.015
latency_percentiles_usec_cluster|meet:p50=12.031,p99=12.031,p99.9=12.031
latency_percentiles_usec_cluster|reset:p50=1002438.655,p99=1002438.655,p99.9=1002438.655
latency_percentiles_usec_cluster|slots:p50=59.135,p99=225.279,p99.9=323.583
latency_percentiles_usec_cluster|delslots:p50=20.095,p99=35.071,p99.9=35.071
latency_percentiles_usec_set:p50=3.007,p99=25.087,p99.9=145.407
latency_percentiles_usec_restore-asking:p50=2.007,p99=11.007,p99.9=22.015
latency_percentiles_usec_exec:p50=40.191,p99=101.375,p99.9=137.215

# Cluster
cluster_enabled:1

# Keyspace
db0:keys=259699940,expires=259699940,avg_ttl=91233303

------ CLIENT LIST OUTPUT ------
...

------ MODULES INFO OUTPUT ------

------ CONFIG DEBUG OUTPUT ------
sanitize-dump-payload no
repl-diskless-sync yes
slave-read-only yes
io-threads-do-reads no
lazyfree-lazy-server-del no
client-query-buffer-limit 1gb
proto-max-bulk-len 512mb
list-compress-depth 0
replica-read-only yes
lazyfree-lazy-user-flush no
io-threads 1
repl-diskless-load disabled
lazyfree-lazy-expire no
lazyfree-lazy-user-del no
activedefrag no
lazyfree-lazy-eviction no

------ FAST MEMORY TEST ------
9412:M 18 Jul 2022 15:59:55.339 # Bio thread for job type #0 terminated
9412:M 18 Jul 2022 15:59:55.339 # Bio thread for job type #1 terminated
9412:M 18 Jul 2022 15:59:55.339 # Bio thread for job type #2 terminated
*** Preparing to test memory region 893000 (2297856 bytes)
*** Preparing to test memory region 2554000 (135168 bytes)
*** Preparing to test memory region 7f95cc800000 (8388608 bytes)
*** Preparing to test memory region 7f95cd000000 (28169080832 bytes)
*** Preparing to test memory region 7f9c60200000 (2097152 bytes)
*** Preparing to test memory region 7f9c60600000 (8388608 bytes)
*** Preparing to test memory region 7f9c60e00000 (2097152 bytes)
*** Preparing to test memory region 7f9c61200000 (8388608 bytes)
*** Preparing to test memory region 7f9c61a00000 (46172995584 bytes)
*** Preparing to test memory region 7fa721cfd000 (8388608 bytes)
*** Preparing to test memory region 7fa7224fe000 (8388608 bytes)
*** Preparing to test memory region 7fa722cff000 (8388608 bytes)
*** Preparing to test memory region 7fa723500000 (8388608 bytes)
*** Preparing to test memory region 7fa723d00000 (22020096 bytes)
*** Preparing to test memory region 7fa7256ab000 (16384 bytes)
*** Preparing to test memory region 7fa7258c8000 (16384 bytes)
*** Preparing to test memory region 7fa7261f5000 (24576 bytes)
*** Preparing to test memory region 7fa726208000 (4096 bytes)
.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O
Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.

=== REDIS BUG REPORT END. Make sure to include from START to END. ===

Comment From: guybe7

@tbream are you using modules?

Comment From: tbream

@guybe7, we are not using any modules.

Comment From: guybe7

@tbream can you please provide some 20-30 log lines before the crash report?

Comment From: tbream

@guybe7 here is the log lines before the crash report.

9412:S 18 Jul 2022 15:07:24.073 * Background append only file rewriting started by pid 9385
9385:C 18 Jul 2022 15:14:34.293 * SYNC append only file rewrite performed
9385:C 18 Jul 2022 15:14:34.936 * Fork CoW for AOF rewrite: current 29667 MB, peak 29667 MB, average 19238 MB
9412:S 18 Jul 2022 15:14:37.533 * Background AOF rewrite terminated with success
9412:S 18 Jul 2022 15:14:37.536 * Removing the history file appendonly.aof.1626.incr.aof in the background
9412:S 18 Jul 2022 15:14:37.536 * Removing the history file appendonly.aof.1626.base.rdb in the background
9412:S 18 Jul 2022 15:14:37.540 * Background AOF rewrite finished successfully
9412:S 18 Jul 2022 15:24:53.532 * FAIL message received from b377ebf66a3453ab8f050dcee11576d1e4ccb369 about 002132b04826f925bd1fda29366e5732767f975d
9412:S 18 Jul 2022 15:24:53.592 # Start of election delayed for 553 milliseconds (rank #0, offset 43585485264).
9412:S 18 Jul 2022 15:24:54.194 # Starting a failover election for epoch 103.
9412:S 18 Jul 2022 15:24:54.202 # Failover election won: I'm the new master.
9412:S 18 Jul 2022 15:24:54.202 # configEpoch set to 103 after successful failover
9412:M 18 Jul 2022 15:24:54.202 # Connection with master lost.
9412:M 18 Jul 2022 15:24:54.202 * Caching the disconnected master state.
9412:M 18 Jul 2022 15:24:54.202 * Discarding previously cached master state.
9412:M 18 Jul 2022 15:24:54.202 # Setting secondary replication ID to f7e7603168b44d0a9db4fdd96cf1b907beb3ea41, valid up to offset: 43585485265. New replication ID is 440b19326917cca6521c88c32af2eaa8336145c9
9412:M 18 Jul 2022 15:24:54.210 * Replica 20.0.83.95:6379 asks for synchronization
9412:M 18 Jul 2022 15:24:54.210 * Partial resynchronization request from 20.0.83.95:6379 accepted. Sending 0 bytes of backlog starting from offset 43585485265.
9412:M 18 Jul 2022 15:35:02.692 * Replica 20.0.83.12:6379 asks for synchronization
9412:M 18 Jul 2022 15:35:02.692 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '46afee26b05b5b65a75e0a1df05bb93b36f9226b', my replication IDs are '440b19326917cca6521c88c32af2eaa8336145c9' and 'f7e7603168b44d0a9db4fdd96cf1b907beb3ea41')
9412:M 18 Jul 2022 15:35:02.692 * Starting BGSAVE for SYNC with target: replicas sockets
9412:M 18 Jul 2022 15:35:03.484 * Background RDB transfer started by pid 12787
9412:M 18 Jul 2022 15:35:03.585 * Clear FAIL state for node 002132b04826f925bd1fda29366e5732767f975d: replica is reachable again.
12787:C 18 Jul 2022 15:41:52.597 * Fork CoW for RDB: current 6474 MB, peak 6474 MB, average 3676 MB
9412:M 18 Jul 2022 15:41:52.597 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
9412:M 18 Jul 2022 15:41:53.910 * Background RDB transfer terminated with success
9412:M 18 Jul 2022 15:41:53.910 * Streamed RDB transfer with replica 20.0.83.12:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
9412:M 18 Jul 2022 15:41:53.910 * Synchronization with replica 20.0.83.12:6379 succeeded
9412:M 18 Jul 2022 15:42:13.579 * FAIL message received from ecf9f6a90b3374c94a3bd361d3a961231dd4792e about 002132b04826f925bd1fda29366e5732767f975d
9412:M 18 Jul 2022 15:49:17.561 * Clear FAIL state for node 002132b04826f925bd1fda29366e5732767f975d: replica is reachable again.

Comment From: tbream

@guybe7, is there any guidance or advice after seeing the logs? Is the best course of action to upgrade the cluster to the latest release, or is this a new bug?

Our cluster is almost balanced, but this prevents having an equal number of slots on each main instance.

Comment From: guybe7

@tbream does it reproduce consistently?

Comment From: tbream

@guybe7, it was reproducing each time we ran. We tried three times before stopping since the max difference in slots was roughly around 6%.

Comment From: guybe7

@tbream I know it's a bit unusual, but is there any chance I can send you a patch with some extra logs so you can compile Redis and run it?

so far we are unable to reproduce it...

but maybe before we do that you can try an official release instead of an RC? 7.0.4 for example maybe we already solved it and we don't even know :)

Comment From: tbream

@guybe7, we will update to RC 7.0.4 in our next sprint, and I will get back to you if the error goes away.

Comment From: guybe7

@tbream please get back to me even if the error stays :)

link to 7.0.4: https://github.com/redis/redis/releases/tag/7.0.4

Comment From: guybe7

hi @tbream any luck?

Comment From: tbream

Running cluster in mix state with 12 nodes and 2 replicas each. Each master node was 7.0.4 while the replicas where the rc2 release. The rebalance failed after 12 slots moved. We are attempting to try again, and if it fails again we are going to upgrade the cluster completely and try again.

Some important notes, our workload only uses string data encoded in binary format. We also run a heavy workload and record our data with 48 hour expiration.

=== REDIS BUG REPORT START: Cut & paste starting from here ===
13285:M 17 Aug 2022 14:32:57.806 # === ASSERTION FAILED ===
13285:M 17 Aug 2022 14:32:57.806 # ==> blocked.c:624 'server.also_propagate.numops == 0' is not true

------ STACK TRACE ------

Backtrace:
/usr/local/bin/redis-server *:6379 [cluster](handleClientsBlockedOnKeys+0x255)[0x55735af3aea5]
/usr/local/bin/redis-server *:6379 [cluster](beforeSleep+0x125)[0x55735aeac365]
/usr/local/bin/redis-server *:6379 [cluster](aeProcessEvents+0x8f)[0x55735aea7eaf]
/usr/local/bin/redis-server *:6379 [cluster](aeMain+0x1d)[0x55735aea841d]
/usr/local/bin/redis-server *:6379 [cluster](main+0x342)[0x55735aea4012]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f6a7a68ec87]
/usr/local/bin/redis-server *:6379 [cluster](_start+0x2a)[0x55735aea46ca]

------ INFO OUTPUT ------
# Server
redis_version:7.0.4
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:749f16780445eb0d
redis_mode:cluster
os:Linux 5.4.0-1083-aws x86_64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:7.5.0
process_id:13285
process_supervised:systemd
run_id:377005791680c672957cc5376a370a7f8ca5492c
tcp_port:6379
server_time_usec:1660746777805837
uptime_in_seconds:148541
uptime_in_days:1
hz:10
configured_hz:10
lru_clock:16579609
executable:/usr/local/bin/redis-server
config_file:/etc/redis/6379.conf
io_threads_active:0

# Clients
connected_clients:972
cluster_connections:94
maxclients:150000
client_recent_max_input_buffer:20480
client_recent_max_output_buffer:20504
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:55864024944
used_memory_human:52.03G
used_memory_rss:57221574656
used_memory_rss_human:53.29G
used_memory_peak:56573307200
used_memory_peak_human:52.69G
used_memory_peak_perc:98.75%
used_memory_overhead:33877639840
used_memory_startup:9522656
used_memory_dataset:21986385104
used_memory_dataset_perc:39.36%
allocator_allocated:55864167080
allocator_active:56616329216
allocator_resident:57248256000
total_system_memory:133545504768
total_system_memory_human:124.37G
used_memory_lua:31744
used_memory_vm_eval:31744
used_memory_lua_human:31.00K
used_memory_scripts_eval:0
number_of_cached_scripts:0
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:32768
used_memory_vm_total:64512
used_memory_vm_total_human:63.00K
used_memory_functions:184
used_memory_scripts:184
used_memory_scripts_human:184B
maxmemory:80127302860
maxmemory_human:74.62G
maxmemory_policy:volatile-ttl
allocator_frag_ratio:1.01
allocator_frag_bytes:752162136
allocator_rss_ratio:1.01
allocator_rss_bytes:631926784
rss_overhead_ratio:1.00
rss_overhead_bytes:-26681344
mem_fragmentation_ratio:1.02
mem_fragmentation_bytes:1357597976
mem_not_counted_for_evict:1792
mem_replication_backlog:209717848
mem_total_replication_buffers:210391544
mem_clients_slaves:676344
mem_clients_normal:7706952
mem_cluster_links:102272
mem_aof_buffer:1792
mem_allocator:jemalloc-5.2.1
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:243322970

# Persistence
loading:0
async_loading:0
current_cow_peak:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:327874249
rdb_bgsave_in_progress:0
rdb_last_save_time:1660598236
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_saves:0
rdb_last_cow_size:0
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:258914843
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:532
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:30
aof_rewrites_consecutive_failures:0
aof_last_write_status:ok
aof_last_cow_size:10088341504
module_fork_in_progress:0
module_fork_last_cow_size:0
aof_current_size:21965410581
aof_base_size:21226510422
aof_pending_rewrite:0
aof_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

# Stats
total_connections_received:41832
total_commands_processed:331668272
instantaneous_ops_per_sec:4008
total_net_input_bytes:60974351624
total_net_output_bytes:44776370873
total_net_repl_input_bytes:48562434759
total_net_repl_output_bytes:43910213916
instantaneous_input_kbps:542.12
instantaneous_output_kbps:1591.75
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:1444.08
rejected_connections:0
sync_full:0
sync_partial_ok:3
sync_partial_err:0
expired_keys:74351009
expired_stale_perc:15.31
expired_time_cap_reached_count:67
expire_cycle_cpu_milliseconds:371542
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:6390349
keyspace_misses:102817387
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:880244
total_forks:30
migrate_cached_sockets:1
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:2950
dump_payload_sanitizations:0
total_reads_processed:232467081
total_writes_processed:398041292
io_threaded_reads_processed:0
io_threaded_writes_processed:0
reply_buffer_shrinks:8041
reply_buffer_expands:177

# Replication
role:master
connected_slaves:3
min_slaves_good_slaves:3
slave0:ip=20.0.83.95,port=6379,state=online,offset=657187170092,lag=0
slave1:ip=20.0.83.12,port=6379,state=online,offset=657187221714,lag=0
slave2:ip=20.0.81.74,port=6379,state=online,offset=657187104672,lag=0
master_failover_state:no-failover
master_replid:95e5e13c9f6a6ed4f24969b612f9db6e3fa2fba4
master_replid2:4639133f0c2ff856a2b2014f9cb481d17a7bc865
master_repl_offset:657187307138
second_repl_offset:642550566777
repl_backlog_active:1
repl_backlog_size:209715200
repl_backlog_first_byte_offset:656977573759
repl_backlog_histlen:209733380

# CPU
used_cpu_sys:4130.038555
used_cpu_user:5044.700985
used_cpu_sys_children:832.636074
used_cpu_user_children:13167.762578
used_cpu_sys_main_thread:4051.275663
used_cpu_user_main_thread:4616.482400

# Modules

# Commandstats
cmdstat_cluster|getkeysinslot:calls=268155,usec=1555421,usec_per_call=5.80,rejected_calls=0,failed_calls=0
cmdstat_cluster|setslot:calls=25,usec=75,usec_per_call=3.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|slots:calls=53,usec=4513,usec_per_call=85.15,rejected_calls=0,failed_calls=0
cmdstat_cluster|failover:calls=1,usec=45,usec_per_call=45.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|replicate:calls=1,usec=162,usec_per_call=162.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|nodes:calls=675,usec=80480,usec_per_call=119.23,rejected_calls=0,failed_calls=0
cmdstat_cluster|info:calls=9803,usec=525400,usec_per_call=53.60,rejected_calls=6,failed_calls=0
cmdstat_psync:calls=3,usec=190,usec_per_call=63.33,rejected_calls=0,failed_calls=0
cmdstat_config|get:calls=9802,usec=37921,usec_per_call=3.87,rejected_calls=0,failed_calls=0
cmdstat_config|set:calls=10,usec=53,usec_per_call=5.30,rejected_calls=0,failed_calls=1
cmdstat_client|setname:calls=8050,usec=16978,usec_per_call=2.11,rejected_calls=0,failed_calls=0
cmdstat_select:calls=1,usec=1,usec_per_call=1.00,rejected_calls=0,failed_calls=0
cmdstat_command|docs:calls=2,usec=1950,usec_per_call=975.00,rejected_calls=0,failed_calls=0
cmdstat_dbsize:calls=15,usec=9,usec_per_call=0.60,rejected_calls=0,failed_calls=0
cmdstat_ping:calls=12306,usec=5460,usec_per_call=0.44,rejected_calls=26,failed_calls=0
cmdstat_quit:calls=4499,usec=1243,usec_per_call=0.28,rejected_calls=0,failed_calls=0
cmdstat_unlink:calls=33476847,usec=115751069,usec_per_call=3.46,rejected_calls=0,failed_calls=0
cmdstat_set:calls=228660769,usec=983676173,usec_per_call=4.30,rejected_calls=2755,failed_calls=0
cmdstat_del:calls=62121094,usec=214596668,usec_per_call=3.45,rejected_calls=0,failed_calls=0
cmdstat_migrate:calls=268142,usec=119755939,usec_per_call=446.61,rejected_calls=0,failed_calls=0
cmdstat_replconf:calls=204406,usec=113893,usec_per_call=0.56,rejected_calls=0,failed_calls=0
cmdstat_get:calls=6599438,usec=10273174,usec_per_call=1.56,rejected_calls=161,failed_calls=0
cmdstat_info:calls=24175,usec=1571990,usec_per_call=65.03,rejected_calls=0,failed_calls=0

# Errorstats
errorstat_ASK:count=1075
errorstat_ERR:count=2
errorstat_LOADING:count=32
errorstat_MOVED:count=1841

# Latencystats
latency_percentiles_usec_cluster|getkeysinslot:p50=6.015,p99=11.007,p99.9=15.039
latency_percentiles_usec_cluster|setslot:p50=3.007,p99=6.015,p99.9=6.015
latency_percentiles_usec_cluster|slots:p50=82.431,p99=150.527,p99.9=181.247
latency_percentiles_usec_cluster|failover:p50=45.055,p99=45.055,p99.9=45.055
latency_percentiles_usec_cluster|replicate:p50=162.815,p99=162.815,p99.9=162.815
latency_percentiles_usec_cluster|nodes:p50=119.295,p99=184.319,p99.9=230.399
latency_percentiles_usec_cluster|info:p50=49.151,p99=110.079,p99.9=130.047
latency_percentiles_usec_psync:p50=70.143,p99=77.311,p99.9=77.311
latency_percentiles_usec_config|get:p50=4.015,p99=10.047,p99.9=28.031
latency_percentiles_usec_config|set:p50=5.023,p99=10.047,p99.9=10.047
latency_percentiles_usec_client|setname:p50=2.007,p99=10.047,p99.9=17.023
latency_percentiles_usec_select:p50=1.003,p99=1.003,p99.9=1.003
latency_percentiles_usec_command|docs:p50=974.847,p99=978.943,p99.9=978.943
latency_percentiles_usec_dbsize:p50=1.003,p99=1.003,p99.9=1.003
latency_percentiles_usec_ping:p50=0.001,p99=5.023,p99.9=11.007
latency_percentiles_usec_quit:p50=0.001,p99=1.003,p99.9=7.007
latency_percentiles_usec_unlink:p50=2.007,p99=19.071,p99.9=36.095
latency_percentiles_usec_set:p50=3.007,p99=21.119,p99.9=43.007
latency_percentiles_usec_del:p50=3.007,p99=16.063,p99.9=31.103
latency_percentiles_usec_migrate:p50=444.415,p99=602.111,p99.9=1056.767
latency_percentiles_usec_replconf:p50=1.003,p99=1.003,p99.9=4.015
latency_percentiles_usec_get:p50=1.003,p99=6.015,p99.9=17.023
latency_percentiles_usec_info:p50=64.255,p99=108.031,p99.9=134.143

# Cluster
cluster_enabled:1

# Keyspace
db0:keys=313249715,expires=313249715,avg_ttl=83194746

------ CLIENT LIST OUTPUT ------
...
------ MODULES INFO OUTPUT ------

------ CONFIG DEBUG OUTPUT ------
lazyfree-lazy-expire yes
replica-read-only yes
lazyfree-lazy-server-del yes
proto-max-bulk-len 512mb
lazyfree-lazy-eviction yes
lazyfree-lazy-user-flush no
list-compress-depth 0
sanitize-dump-payload no
repl-diskless-load on-empty-db
activedefrag no
io-threads-do-reads no
io-threads 1
lazyfree-lazy-user-del no
client-query-buffer-limit 1gb
repl-diskless-sync yes
slave-read-only yes

------ FAST MEMORY TEST ------
13285:M 17 Aug 2022 14:32:57.814 # Bio thread for job type #0 terminated
13285:M 17 Aug 2022 14:32:57.814 # Bio thread for job type #1 terminated
13285:M 17 Aug 2022 14:32:57.814 # Bio thread for job type #2 terminated
*** Preparing to test memory region 55735b2fd000 (2301952 bytes)
*** Preparing to test memory region 55735d077000 (135168 bytes)
*** Preparing to test memory region 7f5747000000 (36188590080 bytes)
*** Preparing to test memory region 7f5fba200000 (8388608 bytes)
*** Preparing to test memory region 7f5fbaa00000 (2097152 bytes)
*** Preparing to test memory region 7f5fbae00000 (8388608 bytes)
*** Preparing to test memory region 7f5fbb600000 (2097152 bytes)
*** Preparing to test memory region 7f5fbba00000 (8388608 bytes)
*** Preparing to test memory region 7f5fbc200000 (46072332288 bytes)
*** Preparing to test memory region 7f6a7658a000 (8388608 bytes)
*** Preparing to test memory region 7f6a76d8b000 (8388608 bytes)
*** Preparing to test memory region 7f6a7758c000 (8388608 bytes)
*** Preparing to test memory region 7f6a77d8d000 (8388608 bytes)
*** Preparing to test memory region 7f6a7858d000 (13631488 bytes)
*** Preparing to test memory region 7f6a79400000 (8388608 bytes)
*** Preparing to test memory region 7f6a7a22a000 (4096 bytes)
*** Preparing to test memory region 7f6a7aa5a000 (16384 bytes)
*** Preparing to test memory region 7f6a7ac79000 (16384 bytes)
*** Preparing to test memory region 7f6a7af00000 (4096 bytes)
*** Preparing to test memory region 7f6a7b8c0000 (40960 bytes)
*** Preparing to test memory region 7f6a7b8d6000 (4096 bytes)
.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O
Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.

=== REDIS BUG REPORT END. Make sure to include from START to END. ===

       Please report the crash by opening an issue on github:

Comment From: guybe7

@tbream ok so the problem still exists... i have a setup in which the issue reproduces, but quite rarely. i'll try to add some more logs and run the tests in a loop i'll keep you updated

Comment From: tbream

Our second attempt today was able to successfully rebalance the remaining 71 nodes without failures.

Comment From: guybe7

update: i managed to find and solve the issue in my setup but unfortunately it was module-related (different bug, same outcome)

re this case: i'm pretty sure the problem is that a key is lazy-expired somewhere outside of call and therefore leaves the also_propagate array full, causing the assertion. i went over all uses of lookupKey* (which can lazy-expire) and the immediate suspect is getNodeByQuery which is called in processCommand, which in turn can early-exit before reaching call

i'm very unfamiliar with the cluster code, so forgive me if i'm saying nonsense: my guess is that during rebalance, some GET reaches a node that contains the key now, but it shouldn't belong there (slot moves to another node). perhaps in this case, the node that handles the GET calls getNodeByQuery, lazy-expires the key, and rejects the command, thinking the key shouldn't belong to it? i do see some rejected GETs in the bug report.

i would say that we should add propagatePendingCommands at the end of getNodeByQuery, with a comment and a test. i'm not entirely sure because i didn't dig deep enough yet. another option is to make getNodeByQuery avoid lazy-expire altogether. i feel that this is the way to go, because the purpose of this function is to get the node to which the key belongs, it feels that a side effect like deleting the key(s) is unwanted in this case.

@oranagra i will be on vacation until next Monday, please have a look, maybe it's better if we handle it ASAP.

Comment From: tbream

Our workload entirely consists of using the SET with the EX, NX, GEToptions.

Comment From: guybe7

@tbream ok, also SET can cause the same situation

but i do see some GETs:

cmdstat_get:calls=6599438,usec=10273174,usec_per_call=1.56,rejected_calls=161,failed_calls=0

Comment From: tbream

Yes, i did forget that we have a workload around the gets in addition. Both of these are happening during the rebalance operation.

Comment From: oranagra

@madolson, please see Guy's https://github.com/redis/redis/issues/11014#issuecomment-1220618515 i feel that the right solution is to keep calling lookupKey, but make sure that it'll avoid del + propagate (same as we do when client pause write is enabled).

anyway, first thing's first, we need to reproduce it in a test, so far i tried and failed.

Comment From: madolson

Here is a reproduction: https://github.com/redis/redis/blob/e4e35a96f36e88d19270c123a6c831fe20363286/tests/unit/cluster/test.tcl

The theory is correct. If a key is expired while being accessed in getNodesByQuery, the expiration will be generated but propagatePendingCommands will not be triggered.