Redis [BUG] Active defrag may cause cluster node failover

Describe the bug

After enabling activedefrag on our Redis cluster, a few days later all 6 of the primary nodes failed over to the replicas. They failed over one at a time over about an hour or so. Each primary failed over about 3-5 seconds after an active defrag was started on its replica node. The nodes had completed successful defragmentations over the previous days with no issues. This happened on 2 of the 4 clusters we enabled activedefrag on before we disabled the feature. The clusters all have 6 primary nodes, each with one replica.

To reproduce

Enable active defrag on all nodes in a cluster and wait a while. Not sure how easily this can be reproduced, but it happened on multiple clusters in our environment.

Expected behavior

Active defrag on a replica node would not cause a node failover.

Additional information

Verbose level logs from one of the replicas:

S 26 Mar 2021 00:08:20.636 - Starting active defrag, frag=10%, frag_bytes=601371816, cpu=1%
S 26 Mar 2021 00:08:21.509 - Node 083f030577e5bd5dc9504036c471a67f197bc739 reported node 0ac9ecb7177550972f884075ce4a34bb5d8840 as not reachable.
S 26 Mar 2021 00:08:22.506 - Node 51e234032a86d5027757efec2db6894f949d08b3 reported node 0ac9ecb7177550972f884075ce4a34bb5d8840 as not reachable.
S 26 Mar 2021 00:08:22.659 - Node 96c11a34a6ef42a964aa738a858e418a78d1b7be reported node 0ac9ecb7177550972f884075ce4a34bb5d8840 as not reachable.
S 26 Mar 2021 00:08:22.700 - Node 198f6f23e017962ba2372062ec5a2bea994cd732 reported node 0ac9ecb7177550972f884075ce4a34bb5d8840 as not reachable.
S 26 Mar 2021 00:08:22.700 * Marking node ce0ac9ecb7177550972f884075ce4a34bb5d8840 as failing (quorum reached).
S 26 Mar 2021 00:08:22.703 - Node db77608ae6b80d23664e4e6c2e916452704df585 reported node 0ac9ecb7177550972f884075ce4a34bb5d8840 as not reachable.
S 26 Mar 2021 00:08:22.743 # Start of election delayed for 923 milliseconds (rank #0, offset 66060669216).
S 26 Mar 2021 00:08:23.671 # Starting a failover election for epoch 18.
S 26 Mar 2021 00:08:23.680 # Failover election won: I'm the new master.

Logs from the corresponding primary node. It looks like this log that was happening every 5 seconds previously stopped happening around the time the active defrag started on the replica, and then there were a bunch of new client connections. It seems like the node slowed down or got stuck, which caused our clients to open up new connections/max out their connection pools to the node.

M 26 Mar 2021 00:07:58.107 - DB 0: 12258508 keys (8388278 volatile) in 16777216 slots HT.
M 26 Mar 2021 00:08:03.121 - DB 0: 12258704 keys (8388444 volatile) in 16777216 slots HT.
M 26 Mar 2021 00:08:08.134 - DB 0: 12258614 keys (8388393 volatile) in 16777216 slots HT.
M 26 Mar 2021 00:08:13.148 - DB 0: 12258780 keys (8388520 volatile) in 16777216 slots HT.
M 26 Mar 2021 00:08:23.160 - Accepted 10.74.241.4:41246
M 26 Mar 2021 00:08:23.160 - Accepted 10.74.241.4:41248
M 26 Mar 2021 00:08:23.160 - Accepted 10.74.241.4:41250
M 26 Mar 2021 00:08:23.160 - Accepted 10.74.241.4:41252
M 26 Mar 2021 00:08:23.160 - Accepted 10.74.177.111:35586
...
S 26 Mar 2021 00:08:31.005 * Full resync from master: da5414463626bb6ea13fb931641f32829481e515:66063381306

Redis info/cluster info from a primary node in the cluster

# Server
redis_version:6.0.5
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:db63ea56716d515f
redis_mode:cluster
os:Linux 3.10.0-1160.6.1.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:8.3.0
process_id:1
run_id:03a7f2a70bd8601000f37377b20b0251a429bc56
tcp_port:6379
uptime_in_seconds:5521854
uptime_in_days:63
hz:10
configured_hz:10
lru_clock:6172592
executable:/data/redis-server
config_file:/etc/redis/redis.conf

# Clients
connected_clients:677
client_recent_max_input_buffer:4
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:5997093992
used_memory_human:5.59G
used_memory_rss:6562820096
used_memory_rss_human:6.11G
used_memory_peak:6098125536
used_memory_peak_human:5.68G
used_memory_peak_perc:98.34%
used_memory_overhead:974546390
used_memory_startup:3869608
used_memory_dataset:5022547602
used_memory_dataset_perc:83.80%
allocator_allocated:5997794368
allocator_active:6484668416
allocator_resident:6585536512
total_system_memory:12428709888
total_system_memory_human:11.58G
used_memory_lua:41984
used_memory_lua_human:41.00K
used_memory_scripts:216
used_memory_scripts_human:216B
number_of_cached_scripts:1
maxmemory:6000000000
maxmemory_human:5.59G
maxmemory_policy:allkeys-lru
allocator_frag_ratio:1.08
allocator_frag_bytes:486874048
allocator_rss_ratio:1.02
allocator_rss_bytes:100868096
rss_overhead_ratio:1.00
rss_overhead_bytes:-22716416
mem_fragmentation_ratio:1.09
mem_fragmentation_bytes:565095840
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:16986
mem_clients_normal:11482636
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:527374991
rdb_bgsave_in_progress:0
rdb_last_save_time:1611263474
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:1196924928
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:218523
total_commands_processed:711196011
instantaneous_ops_per_sec:984
total_net_input_bytes:78717853978
total_net_output_bytes:110573258327
instantaneous_input_kbps:118.74
instantaneous_output_kbps:250.05
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:1
expired_keys:1188098
expired_stale_perc:3.31
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:22711
evicted_keys:823034
keyspace_hits:138469946
keyspace_misses:10153408
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:142799
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:9112043
active_defrag_misses:173621778
active_defrag_key_hits:7654474
active_defrag_key_misses:35920459
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0

# Replication
role:master
connected_slaves:1
slave0:ip=10.74.178.168,port=6379,state=online,offset=70772608236,lag=0
master_replid:1cd856df66e3af4e38cc13179026d1e751b8fa0a
master_replid2:c2511e4504865c8ad3fcf705a753687db04a69b9
master_repl_offset:70772658623
second_repl_offset:66028590979
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:70771610048
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:30752.745015
used_cpu_user:21255.508407
used_cpu_sys_children:8.530935
used_cpu_user_children:83.261017

# Modules

# Cluster
cluster_enabled:1

# Keyspace
db0:keys=12234916,expires=8345678,avg_ttl=150404561681

cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:12
cluster_size:6
cluster_current_epoch:18
cluster_my_epoch:13
cluster_stats_messages_ping_sent:19561837
cluster_stats_messages_pong_sent:19698184
cluster_stats_messages_meet_sent:1
cluster_stats_messages_fail_sent:11
cluster_stats_messages_auth-req_sent:11
cluster_stats_messages_auth-ack_sent:5
cluster_stats_messages_update_sent:6
cluster_stats_messages_sent:39260055
cluster_stats_messages_ping_received:19698173
cluster_stats_messages_pong_received:19561767
cluster_stats_messages_fail_received:11
cluster_stats_messages_auth-req_received:5
cluster_stats_messages_auth-ack_received:5
cluster_stats_messages_received:39259961

Comment From: oranagra

@nosammai did you happen to have latency monitor (LATENCY LATEST) enabled on these servers? Did you change any of the threshold of active defrag (other than enabling it)?

Comment From: nosammai

Sadly we didn't have latency monitor enabled on those clusters at the time. I didn't change any of the default settings for the active defrag, just turned it on.

['active-defrag-cycle-min', '1', 'active-defrag-cycle-max', '25', 'active-defrag-threshold-lower', '10', 'active-defrag-threshold-upper', '100', 'active-defrag-max-scan-fields', '1000', 'active-defrag-ignore-bytes', '104857600']