Redis [CRASH] Cluster CLUSTERMSG_EXT_TYPE_FORGOTTEN_NODE

Notice! - If a Redis module was involved, please open an issue in the module's repo instead! - If you're using docker on Apple M1, please make sure the image you're using was compiled for ARM!

Crash report

Paste the complete crash log between the quotes below. Please include a few lines from the log preceding the crash report to provide some context.

I'm adding a mechanism to periodically refresh the routing table to JedisCluster https://github.com/redis/jedis/pull/3596, with a test that performs the following steps: 1. Build a three-node redis cluster 2. Add a node 3. Delete a node

The crash occurs in the middle process. At this time, a background thread will randomly send Cluster slots commands to Redis every 3 seconds. I am attaching the complete reproduction steps at the end.

The crash occurred at https://github.com/redis/redis/pull/10869 The code here.

23329:C 27 Oct 2023 14:01:02.313 # WARNING: Changing databases number from 16 to 1 since we are in cluster mode
23330:C 27 Oct 2023 14:01:02.313 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
23330:C 27 Oct 2023 14:01:02.313 * Redis version=255.255.255, bits=64, commit=372ea218, modified=0, pid=23330, just started
23330:C 27 Oct 2023 14:01:02.314 * Configuration loaded
23330:M 27 Oct 2023 14:01:02.314 * monotonic clock: POSIX clock_gettime
23330:M 27 Oct 2023 14:01:02.318 * Running mode=cluster, port=7379.
23330:M 27 Oct 2023 14:01:02.319 # WARNING: The TCP backlog setting of 511 cannot be enforced because kern.ipc.somaxconn is set to the lower value of 128.
23330:M 27 Oct 2023 14:01:02.319 * No cluster configuration found, I'm 2f3f80c8626e25fc593a1d76a19aa8dc16d693d5
23330:M 27 Oct 2023 14:01:02.356 * Server initialized
23330:M 27 Oct 2023 14:01:02.357 * Ready to accept connections tcp
23330:M 27 Oct 2023 14:01:24.777 * IP address for this node updated to 127.0.0.1
23330:M 27 Oct 2023 14:01:29.630 * Cluster state changed: ok
23330:M 27 Oct 2023 14:01:29.863 # Cluster state changed: fail
23330:M 27 Oct 2023 14:01:35.132 * Cluster state changed: ok
23330:M 27 Oct 2023 14:01:35.249 # Missing implement of connection type tls
23330:M 27 Oct 2023 14:01:35.305 # Cluster state changed: fail
23330:M 27 Oct 2023 14:01:40.520 * Cluster state changed: ok
23330:M 27 Oct 2023 14:01:40.604 # Cluster state changed: fail
23330:M 27 Oct 2023 14:01:45.822 * Cluster state changed: ok
23330:M 27 Oct 2023 14:01:45.906 # Cluster state changed: fail
23330:M 27 Oct 2023 14:01:51.160 * Cluster state changed: ok
23330:M 27 Oct 2023 14:01:51.393 # Cluster state changed: fail
23330:M 27 Oct 2023 14:01:56.546 * Cluster state changed: ok
23330:M 27 Oct 2023 14:01:56.684 # Cluster state changed: fail
23330:M 27 Oct 2023 14:02:01.933 * Cluster state changed: ok
23330:M 27 Oct 2023 14:02:02.014 # Cluster state changed: fail
23330:M 27 Oct 2023 14:02:07.484 * Cluster state changed: ok
23330:M 27 Oct 2023 14:02:07.627 # Cluster state changed: fail
23330:M 27 Oct 2023 14:02:12.692 * Cluster state changed: ok
23330:M 27 Oct 2023 14:02:12.942 # Cluster state changed: fail
23330:M 27 Oct 2023 14:02:18.159 * Cluster state changed: ok
23330:M 27 Oct 2023 14:02:18.223 # Cluster state changed: fail
23330:M 27 Oct 2023 14:02:23.437 * Cluster state changed: ok
23330:M 27 Oct 2023 14:02:23.549 # Cluster state changed: fail
23330:M 27 Oct 2023 14:02:28.821 * Cluster state changed: ok
23330:M 27 Oct 2023 14:02:34.999 # Cluster state changed: fail
23330:M 27 Oct 2023 14:02:40.264 * Cluster state changed: ok
23330:M 27 Oct 2023 14:02:40.446 # Cluster state changed: fail
23330:M 27 Oct 2023 14:02:45.667 * Cluster state changed: ok
23330:M 27 Oct 2023 14:02:45.791 # Cluster state changed: fail
23330:M 27 Oct 2023 14:02:51.040 * Cluster state changed: ok
23330:M 27 Oct 2023 14:02:53.164 # Cluster state changed: fail
23330:M 27 Oct 2023 14:02:58.368 * Cluster state changed: ok
23330:M 27 Oct 2023 14:03:04.137 # Cluster state changed: fail
23330:M 27 Oct 2023 14:03:09.386 * Cluster state changed: ok
23330:M 27 Oct 2023 14:03:09.504 # Cluster state changed: fail
23330:M 27 Oct 2023 14:03:14.787 * Cluster state changed: ok
23330:M 27 Oct 2023 14:03:21.915 # Cluster state changed: fail
23330:M 27 Oct 2023 14:03:27.150 * Cluster state changed: ok
23330:M 27 Oct 2023 14:03:27.287 # Cluster state changed: fail
23330:M 27 Oct 2023 14:03:32.537 * Cluster state changed: ok
23330:M 27 Oct 2023 14:03:32.633 # Cluster state changed: fail
23330:M 27 Oct 2023 14:03:37.859 * Cluster state changed: ok
23330:M 27 Oct 2023 14:03:38.039 # Cluster state changed: fail
23330:M 27 Oct 2023 14:03:43.250 * Cluster state changed: ok
23330:M 27 Oct 2023 14:03:43.320 # Cluster state changed: fail
23330:M 27 Oct 2023 14:03:48.556 * Cluster state changed: ok
23330:M 27 Oct 2023 14:03:48.642 # Cluster state changed: fail
23330:M 27 Oct 2023 14:03:48.663 * Cluster state changed: ok
23330:M 27 Oct 2023 14:03:48.667 # Cluster state changed: fail
23330:M 27 Oct 2023 14:03:53.829 * Cluster state changed: ok
23330:M 27 Oct 2023 14:03:54.078 # Cluster state changed: fail
23330:M 27 Oct 2023 14:03:59.302 * Cluster state changed: ok
23330:M 27 Oct 2023 14:04:01.182 # Cluster state changed: fail
23330:M 27 Oct 2023 14:04:06.521 * Cluster state changed: ok
23330:M 27 Oct 2023 14:04:06.718 # Cluster state changed: fail
23330:M 27 Oct 2023 14:04:11.935 * Cluster state changed: ok
23330:M 27 Oct 2023 14:04:12.091 # Cluster state changed: fail


=== REDIS BUG REPORT START: Cut & paste starting from here ===
23330:M 27 Oct 2023 14:04:12.402 # === ASSERTION FAILED ===
23330:M 27 Oct 2023 14:04:12.403 # ==> cluster.c:2655 'de != NULL' is not true

------ STACK TRACE ------

Backtrace:
0   redis-server                        0x000000010082fbb0 clusterReadHandler + 10956
1   redis-server                        0x00000001008bd27c connSocketEventHandler + 448
2   redis-server                        0x000000010077c86c aeProcessEvents + 784
3   redis-server                        0x000000010079fbe4 main + 21584
4   dyld                                0x0000000100c5d08c start + 520

------ INFO OUTPUT ------
# Server
redis_version:255.255.255
redis_git_sha1:372ea218
redis_git_dirty:0
redis_build_id:a73a6dc291930496
redis_mode:cluster
os:Darwin 21.6.0 arm64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:kqueue
atomicvar_api:c11-builtin
gcc_version:4.2.1
process_id:23330
process_supervised:no
run_id:624777a7034c33802bde837ae9bc5ac792631003
tcp_port:7379
server_time_usec:1698386652401190
uptime_in_seconds:190
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:3887836
executable:/Users/yangbodong/git/my/jedis/redis-server
config_file:
io_threads_active:0
listener0:name=tcp,bind=*,bind=-::*,port=7379

# Clients
connected_clients:15
cluster_connections:6
maxclients:10000
client_recent_max_input_buffer:16896
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
total_blocking_keys:0
total_blocking_keys_on_nokey:0

# Memory
used_memory:4705760
used_memory_human:4.49M
used_memory_rss:10502144
used_memory_rss_human:10.02M
used_memory_peak:4958432
used_memory_peak_human:4.73M
used_memory_peak_perc:94.90%
used_memory_overhead:4267136
used_memory_startup:4209232
used_memory_dataset:438624
used_memory_dataset_perc:88.34%
allocator_allocated:4688560
allocator_active:10470400
allocator_resident:10470400
total_system_memory:34359738368
total_system_memory_human:32.00G
used_memory_lua:31744
used_memory_vm_eval:31744
used_memory_lua_human:31.00K
used_memory_scripts_eval:0
number_of_cached_scripts:0
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:32768
used_memory_vm_total:64512
used_memory_vm_total_human:63.00K
used_memory_functions:216
used_memory_scripts:216
used_memory_scripts_human:216B
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:2.23
allocator_frag_bytes:5781840
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:31744
mem_fragmentation_ratio:2.24
mem_fragmentation_bytes:5813584
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_total_replication_buffers:0
mem_clients_slaves:0
mem_clients_normal:44240
mem_cluster_links:13448
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0

# Persistence
loading:0
async_loading:0
current_cow_peak:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:12
rdb_bgsave_in_progress:0
rdb_last_save_time:1698386462
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_saves:0
rdb_last_cow_size:0
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:0
aof_rewrites_consecutive_failures:0
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:52
total_commands_processed:3050
instantaneous_ops_per_sec:19
total_net_input_bytes:1610010
total_net_output_bytes:1431776
total_net_repl_input_bytes:0
total_net_repl_output_bytes:0
instantaneous_input_kbps:24.60
instantaneous_output_kbps:8.22
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:0
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:0
total_forks:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:2
dump_payload_sanitizations:0
total_reads_processed:3146
total_writes_processed:3022
io_threaded_reads_processed:0
io_threaded_writes_processed:0
client_query_buffer_limit_disconnections:0
client_output_buffer_limit_disconnections:0
reply_buffer_shrinks:26
reply_buffer_expands:0
eventloop_cycles:6091
eventloop_duration_sum:6129025
eventloop_duration_cmd_sum:414021
instantaneous_eventloop_cycles_per_sec:37
instantaneous_eventloop_duration_usec:831
acl_access_denied_auth:0
acl_access_denied_cmd:0
acl_access_denied_key:0
acl_access_denied_channel:0

# Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:e782ced0d0d65d728f68f3b85e4c64f2d0658425
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:1.451754
used_cpu_user:0.868631
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000

# Modules

# Commandstats
cmdstat_flushdb:calls=27,usec=10152,usec_per_call=376.00,rejected_calls=0,failed_calls=0
cmdstat_auth:calls=52,usec=396,usec_per_call=7.62,rejected_calls=0,failed_calls=0
cmdstat_flushall:calls=26,usec=10275,usec_per_call=395.19,rejected_calls=0,failed_calls=0
cmdstat_cluster|forget:calls=1,usec=75,usec_per_call=75.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|links:calls=1,usec=9,usec_per_call=9.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|flushslots:calls=1,usec=123,usec_per_call=123.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|meet:calls=60,usec=1333,usec_per_call=22.22,rejected_calls=0,failed_calls=0
cmdstat_cluster|info:calls=2732,usec=372172,usec_per_call=136.23,rejected_calls=0,failed_calls=0
cmdstat_cluster|reset:calls=27,usec=13106,usec_per_call=485.41,rejected_calls=0,failed_calls=0
cmdstat_cluster|nodes:calls=8,usec=629,usec_per_call=78.62,rejected_calls=0,failed_calls=0
cmdstat_cluster|countkeysinslot:calls=1,usec=2,usec_per_call=2.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|addslots:calls=29,usec=4723,usec_per_call=162.86,rejected_calls=0,failed_calls=0
cmdstat_cluster|slots:calls=21,usec=963,usec_per_call=45.86,rejected_calls=0,failed_calls=0
cmdstat_set:calls=6,usec=14,usec_per_call=2.33,rejected_calls=2,failed_calls=0
cmdstat_client|setinfo:calls=52,usec=42,usec_per_call=0.81,rejected_calls=0,failed_calls=0
cmdstat_client|setname:calls=4,usec=4,usec_per_call=1.00,rejected_calls=0,failed_calls=0
cmdstat_client|getname:calls=2,usec=3,usec_per_call=1.50,rejected_calls=0,failed_calls=0

# Errorstats
errorstat_MOVED:count=2

# Latencystats
latency_percentiles_usec_flushdb:p50=346.111,p99=856.063,p99.9=856.063
latency_percentiles_usec_auth:p50=7.007,p99=16.063,p99.9=37.119
latency_percentiles_usec_flushall:p50=419.839,p99=856.063,p99.9=856.063
latency_percentiles_usec_cluster|forget:p50=75.263,p99=75.263,p99.9=75.263
latency_percentiles_usec_cluster|links:p50=9.023,p99=9.023,p99.9=9.023
latency_percentiles_usec_cluster|flushslots:p50=123.391,p99=123.391,p99.9=123.391
latency_percentiles_usec_cluster|meet:p50=20.095,p99=51.199,p99.9=54.015
latency_percentiles_usec_cluster|info:p50=129.023,p99=356.351,p99.9=1089.535
latency_percentiles_usec_cluster|reset:p50=475.135,p99=892.927,p99.9=892.927
latency_percentiles_usec_cluster|nodes:p50=78.335,p99=123.391,p99.9=123.391
latency_percentiles_usec_cluster|countkeysinslot:p50=2.007,p99=2.007,p99.9=2.007
latency_percentiles_usec_cluster|addslots:p50=165.887,p99=195.583,p99.9=195.583
latency_percentiles_usec_cluster|slots:p50=45.055,p99=119.295,p99.9=119.295
latency_percentiles_usec_set:p50=1.003,p99=8.031,p99.9=8.031
latency_percentiles_usec_client|setinfo:p50=1.003,p99=3.007,p99.9=3.007
latency_percentiles_usec_client|setname:p50=1.003,p99=2.007,p99.9=2.007
latency_percentiles_usec_client|getname:p50=1.003,p99=2.007,p99.9=2.007

# Cluster
cluster_enabled:1

# Keyspace

# Cluster info
cluster_state:fail
cluster_slots_assigned:12288
cluster_slots_ok:12288
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:4
cluster_size:2
cluster_current_epoch:4
cluster_my_epoch:1
cluster_stats_messages_ping_sent:157
cluster_stats_messages_pong_sent:306
cluster_stats_messages_meet_sent:60
cluster_stats_messages_sent:523
cluster_stats_messages_ping_received:306
cluster_stats_messages_pong_received:218
cluster_stats_messages_received:524
total_cluster_links_buffer_limit_exceeded:0

------ CLUSTER NODES OUTPUT ------
cc021b41e0504fac33dbd9712b2d4022ba4e78d7 127.0.0.1:7382@17382,,tls-port=0,shard-id=fd1d3143d53d497a6d51eee67a0ad41d54d881a8 master - 0 1698386652344 4 connected 12288-16383
3892a781832ac98614986d2bf4ce6ccdff59ca47 127.0.0.1:7380@17380,,tls-port=0,shard-id=5a75793c10872c99d64bdbf0e4dee781f6557a92 master - 0 1698386652343 0 connected
1720bd1eb7b1ba5b494d54ebc837bbd49602c939 127.0.0.1:7381@17381,,tls-port=0,shard-id=fa2efd57f6ce86bcc1610eb619d9b0cb1d402afc master - 0 1698386652344 2 connected 8192-12287
2f3f80c8626e25fc593a1d76a19aa8dc16d693d5 127.0.0.1:7379@17379,,tls-port=0,shard-id=888475eb71fafa37567bf01303421437baf3e179 myself,master - 0 1698386650000 1 connected 0-4095

------ CLIENT LIST OUTPUT ------
id=51 addr=[::1]:53477 laddr=[::1]:7379 fd=24 name= age=11 idle=6 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=33 addr=127.0.0.1:53073 laddr=127.0.0.1:7379 fd=12 name= age=86 idle=79 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=48 addr=[::1]:53405 laddr=[::1]:7379 fd=21 name= age=24 idle=18 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=35 addr=127.0.0.1:53100 laddr=127.0.0.1:7379 fd=13 name= age=79 idle=68 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=42 addr=[::1]:53286 laddr=[::1]:7379 fd=17 name= age=45 idle=40 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=45 addr=127.0.0.1:53342 laddr=127.0.0.1:7379 fd=19 name= age=34 idle=29 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=47 addr=127.0.0.1:53373 laddr=127.0.0.1:7379 fd=20 name= age=29 idle=24 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=41 addr=[::1]:53256 laddr=[::1]:7379 fd=16 name= age=50 idle=45 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=36 addr=[::1]:53154 laddr=[::1]:7379 fd=14 name= age=68 idle=63 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=49 addr=127.0.0.1:53440 laddr=127.0.0.1:7379 fd=23 name= age=18 idle=11 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=39 addr=127.0.0.1:53195 laddr=127.0.0.1:7379 fd=15 name= age=63 idle=51 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=28 addr=[::1]:53012 laddr=[::1]:7379 fd=22 name= age=97 idle=92 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=43 addr=[::1]:53318 laddr=[::1]:7379 fd=18 name= age=40 idle=34 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=31 addr=127.0.0.1:53040 laddr=127.0.0.1:7379 fd=11 name= age=92 idle=87 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1824 events=r cmd=cluster|reset user=default redir=-1 resp=2 lib-name= lib-ver=
id=54 addr=[::1]:53510 laddr=[::1]:7379 fd=25 name= age=6 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=0 qbuf-free=16890 argv-mem=0 multi-mem=0 rbs=1024 rbp=521 obl=0 oll=0 omem=0 tot-mem=18704 events=r cmd=cluster|info user=default redir=-1 resp=2 lib-name= lib-ver=

------ MODULES INFO OUTPUT ------

------ CONFIG DEBUG OUTPUT ------
replica-read-only yes
repl-diskless-sync yes
lazyfree-lazy-server-del no
client-query-buffer-limit 1gb
sanitize-dump-payload no
lazyfree-lazy-expire no
slave-read-only yes
lazyfree-lazy-user-flush no
lazyfree-lazy-eviction no
io-threads-do-reads no
list-compress-depth 0
activedefrag no
proto-max-bulk-len 512mb
repl-diskless-load disabled
lazyfree-lazy-user-del no
io-threads 1

=== REDIS BUG REPORT END. Make sure to include from START to END. ===

       Please report the crash by opening an issue on github:

           http://github.com/redis/redis/issues

  If a Redis module was involved, please open in the module's repo instead.

  Suspect RAM error? Use redis-server --test-memory to verify it.

  Some other issues could be detected by redis-server --check-system

Additional information

OS distribution and version Linux and redis unstable
Steps to reproduce (if any)
clone and make unstable redis
run jedis test

git clone https://github.com/yangbodong22011/jedis.git
cd jedis
git checkout feature-cluster-periodic-topology-refresh
export PATH=$PATH:/home/bodong.ybd/workdir/redis/src (replace to you_redis_dir_src)
make start
mvn -Dtest="redis.clients.jedis.JedisClusterTest" test

then, test will fail, and log in /tmp

(base) ➜  jedis git:(feature-cluster-periodic-topology-refresh) ls /tmp
argus.sock             logtail.sock  redis1.pid  redis5.log  redis8.pid                redis_cluster_node2.pid   redis_cluster_node5.conf  redis_unavailable.pid  sentinel3.conf  sentinel5.log
fix_syslog.log         redis         redis2.log  redis5.pid  redis9.log                redis_cluster_node3.conf  redis_cluster_node5.log   sentinel1.conf         sentinel3.log   sentinel5.pid
hsperfdata_agent       redis10.log   redis2.pid  redis6.log  redis9.pid                redis_cluster_node3.log   redis_cluster_node5.pid   sentinel1.log          sentinel3.pid   tairtest
hsperfdata_bodong.ybd  redis10.pid   redis3.log  redis6.pid  redis_cluster_node1.conf  redis_cluster_node3.pid   redis_uds.log             sentinel1.pid          sentinel4.conf  yumcli-hc-1.0.0-alios7.x86_64.rpm
hsperfdata_nobody      redis11.log   redis3.pid  redis7.log  redis_cluster_node1.log   redis_cluster_node4.conf  redis_uds.pid             sentinel2.conf         sentinel4.log
hsperfdata_root        redis11.pid   redis4.log  redis7.pid  redis_cluster_node2.conf  redis_cluster_node4.log   redis_uds.sock            sentinel2.log          sentinel4.pid
logtail_check_point    redis1.log    redis4.pid  redis8.log  redis_cluster_node2.log   redis_cluster_node4.pid   redis_unavailable.log     sentinel2.pid          sentinel5.conf
(base) ➜  jedis git:(feature-cluster-periodic-topology-refresh)

I copied a log, in which the crash occurred in redis_cluster_node1.log

redis-cluster-core-log.tar.gz

Comment From: zuiderkwast

The relevant code where serverAssert(de != NULL) fails:

        } else if (type == CLUSTERMSG_EXT_TYPE_FORGOTTEN_NODE) {
            clusterMsgPingExtForgottenNode *forgotten_node_ext = &(ext->ext[0].forgotten_node);
            clusterNode *n = clusterLookupNode(forgotten_node_ext->name, CLUSTER_NAMELEN);
            if (n && n != myself && !(nodeIsSlave(myself) && myself->slaveof == n)) {
                sds id = sdsnewlen(forgotten_node_ext->name, CLUSTER_NAMELEN);
                dictEntry *de = dictAddRaw(server.cluster->nodes_black_list, id, NULL);
                serverAssert(de != NULL);
                uint64_t expire = server.unixtime + ntohu64(forgotten_node_ext->ttl);
                dictSetUnsignedIntegerVal(de, expire);
                clusterDelNode(n);
                clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|
                                     CLUSTER_TODO_SAVE_CONFIG);
            }
        } else if (type == CLUSTERMSG_EXT_TYPE_SHARDID) {

dictAddRaw returned NULL, which means the node is already in the blacklist. I wonder how this can happen for a known node, i.e. a node id for which clusterLookupNode() returned a node. I thought it would be impossible to add a node which is in the blacklist. Any ideas how it could happen?

Anyway, I think we could be defensive and use dictAddOrFind instead of dictAddRaw here. Then, if the node is already in the blacklist, we just update the expire time.

Comment From: yangbodong22011

I thought it would be impossible to add a node which is in the blacklist. Any ideas how it could happen?

@zuiderkwast I have provided the steps to reproduce (Steps to reproduce) in the above issue content and this problem can be reproduced stably.

Comment From: yangbodong22011

@zuiderkwast Any new ideas?

Comment From: zuiderkwast

I tried to run the test suite now, without too much success unfortunately. make start no problem. I needed to install maven before running mvn, which downloaded a million dependencies, and then the tests started, but I got socket exceptions like "Connection reset", "Broken pipe (Write failed)" and "Failed to connect to any host resolved for DNS name." for all the 66 test cases. Not sure what's wrong.

In the meantime, I started reading source code of Redis and of Jedis test suite. The new test case clusterPeriodTopologyRefreshTest() added in the Jedis PR https://github.com/redis/jedis/pull/3596 does the following:

It creates a cluster of 3 nodes.
Later it calls cleanUp() (inherited from JedisClusterTestBase) which does FLUSHDB and a CLUSTER RESET SOFT on all nodes.
Then, 4 nodes are joined into a cluster by letting node 1 meet the other nodes (sending CLUSTER MEET to node 1 for each of the other nodes).
Later cleanUp() is called again (soft reset),
and then setUp() again which joins 3 nodes into a cluster.

Now from reading Redis cluster.c:

In a soft reset, all nodes except myself are deleted (forgotten) but the blacklist is not touched. I this correct? Maybe it can be improved. (Should we empty the blacklist? Should we add all known nodes to the blacklist?)
In CLUSTER MEET, the blacklist it is not checked, so it is possible to add a blacklisted node using CLUSTER MEET. The node is not removed from the blacklist when the node is added.
The blacklist is only populated in CLUSTER FORGET and when receiving gossip with the forgotten-node ping extension. The backlist is only checked before adding new nodes discovered via gossip.

So, in this new Jedis test case, it appears that the blacklist is not used, because CLUSTER FORGET is not used, but if any of the nodes are in the blacklist since an earlier test case, this can cause trouble to this new test case, since they may refuse to discover each other via gossip.

Comment From: yangbodong22011

but I got socket exceptions like "Connection reset", "Broken pipe (Write failed)" and "Failed to connect to any host resolved for DNS name." for all the 66 test cases. Not sure what's wrong.

You can check /tmp/redis_cluster_node1.log for coredump.

The blacklist is only populated in CLUSTER FORGET and when receiving gossip with the forgotten-node ping extension. The backlist is only checked before adding new nodes discovered via gossip.

Before executing clusterPeriodTopologyRefreshTest(), the test will first run testClusterForgetNode() link. Its steps are as follows:

node1 node2 node3 execute clusterMeet on node4
node1 node2 node3 execute clusterForget on node4

will it have an impact?

Comment From: zuiderkwast

Yes, it's possible that the testClusterForgetNode() test case will have an impact.

For debugging, I added some temporary fields to CLUSTER INFO which is displayed in the log when the node crashes on the assert. It shows one node is in the blacklist:

cluster_nodes_blacklist_size:1
cluster_nodes_blacklist_1:ff6e2a0e63d08584014b8275de49e811cb45a94f(TTL=44)

In the crash log, the output of CLUSTER NODES is displayed like this:

------ CLUSTER NODES OUTPUT ------
2780deee5c3a1ee95d390efd86588a4967c7f691 127.0.0.1:7380@17380,,tls-port=0,shard-id=8eb1fec74b041e23551e3e827e228c4f068efce2 master - 0 1698661579262 0 connected
ff6e2a0e63d08584014b8275de49e811cb45a94f 127.0.0.1:7382@17382,,tls-port=0,shard-id=aec4e6e54360653696b30f84e1ba01027d1eead6 master - 0 1698661579261 0 connected
ed938c5f762cde86ba4e4b8fa674c120684d21b1 127.0.0.1:7381@17381,,tls-port=0,shard-id=89188c22051d7ee202014eab1e186c567ebb1f90 master - 0 1698661579261 3 connected 8192-12287
962490d4f5ca092467781fc8fb7e439f40fa72b2 127.0.0.1:7379@17379,,tls-port=0,shard-id=a9794c3e7f61924825b5737326b8d9f1f0c398fc myself,master - 0 1698661577000 1 connected 0-4095

When I added the fix of the assert (use dictAddOrFind and deleted the assert) the tests don't crash but instead they hang, or they seem to take forever. So now redis doesn't crash but maybe the blacklist prevents the cluster from becoming stable after the "forget" testcase.

Then, I added code to empty the nodes blacklist in CLUSTER RESET. With this the JedisClusterTest test suite passes in less than 4 minutes. :)

@madolson Is it acceptable to empty the nodes blacklist in CLUSTER RESET? Any use case where it's not wanted?

Comment From: zuiderkwast

@yangbodong22011 A workaround can be to use CLUSTER RESET HARD instead of SOFT in cleanUp(). It gives the node a new id, so if the old id is in the blacklist of another node, it doesn't match. In the tests for redis itself, hard reset is used in init-tests.tcl, which is included first in each test suite: https://github.com/redis/redis/blob/unstable/tests/cluster/tests/includes/init-tests.tcl#L37

Comment From: yangbodong22011

@yangbodong22011 A workaround can be to use CLUSTER RESET HARD instead of SOFT in cleanUp(). It gives the node a new id, so if the old id is in the blacklist of another node, it doesn't match. In the tests for redis itself, hard reset is used in init-tests.tcl, which is included first in each test suite: https://github.com/redis/redis/blob/unstable/tests/cluster/tests/includes/init-tests.tcl#L37

Thanks, I will try.

Comment From: zuiderkwast

@yangbodong22011 So now #12702 was merged, your test suite should pass without changes, because I added to the PR to empty the blacklist on CLUSTER RESET.

Comment From: yangbodong22011

@yangbodong22011 So now #12702 was merged, your test suite should pass without changes, because I added to the PR to empty the blacklist on CLUSTER RESET.

@zuiderkwast I did not change HARD in the PR to SOFT, because HARD better meets the test conditions.

But I modified it locally to SOFT, compiled it with the latest Redis unstable branch and ran the test, and it passed.

Thank you for troubleshooting this issue : )