Bug Description: Version: Redis version=7.0.11
Hi, I observed an issue with Redis cluster nodes where the following happened with our automated Redis management control plane: 1. A node N1 was issued a CLUSTER REPLICATE command 2. Control Plane checked if the node was unassigned (replication didn't happen yet so it returned unassigned) 3. The node N1 was reassigned for Primary role - and issued CLUSTER ADDSLOTSRANGE for some other range 4. Final state had the node both as replica and primary:
CLUSTER NODES O/P for Node on Port 7001
3dd70494e109dfdfa44fd2ff69d1a12be1f3642b 172.18.133.131:7001@17001,padgupta-lappy. myself,slave,nofailover c23aa8a150e0291eaad4b88cc505b88b8b2479b0 0 1698913926000 3 connected 8192-16383
c23aa8a150e0291eaad4b88cc505b88b8b2479b0 172.18.133.131:7000@17000,padgupta-lappy. master,nofailover - 0 1698914350449 3 connected 0-8191
CLUSTER NODES O/P for Node on Port 7000
c23aa8a150e0291eaad4b88cc505b88b8b2479b0 172.18.133.131:7000@17000,padgupta-lappy. myself,master,nofailover - 0 0 3 connected 0-8191
3dd70494e109dfdfa44fd2ff69d1a12be1f3642b 172.18.133.131:7001@17001,padgupta-lappy. slave,nofailover c23aa8a150e0291eaad4b88cc505b88b8b2479b0 0 1698914345168 3 connected
Post this, if you resend CLUSTER REPLICATE command to the NodeN1, Redis crashes.
I'm not sure if this dual role is expected in the first place, but it does cause a crash later.
BUG REPORT:
------ STACK TRACE ------
Backtrace: /usr/bin/redis-server :7001 [cluster] (clusterSetMaster+0xdf)[0x563c35e2a22f] /usr/bin/redis-server :7001 [cluster] (clusterCommand+0x1c13)[0x563c35e31823] /usr/bin/redis-server :7001 [cluster] (call+0xee)[0x563c35da510e] /usr/bin/redis-server :7001 [cluster] (processCommand+0x6fd)[0x563c35da617d] /usr/bin/redis-server :7001 [cluster] (processInputBuffer+0x107)[0x563c35dc2517] /usr/bin/redis-server :7001 [cluster] (readQueryFromClient+0x318)[0x563c35dc2a58] /usr/bin/redis-server :7001 [cluster] (+0x17323c)[0x563c35e9823c] /usr/bin/redis-server :7001 [cluster] (aeProcessEvents+0x1e2)[0x563c35d9c182] /usr/bin/redis-server :7001 [cluster] (aeMain+0x1d)[0x563c35d9c4bd] /usr/bin/redis-server :7001 [cluster] (main+0x354)[0x563c35d93df4] /lib/x86_64-linux-gnu/libc.so.6 (+0x29d90)[0x7f5f788ccd90] /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0x80)[0x7f5f788cce40] /usr/bin/redis-server *:7001 [cluster] (_start+0x25)[0x563c35d944c5]
Repro Steps:
Minimal Pseduo-Code for Repro:
firstShard.ClusterAddSlots(context.TODO(), 0, 8191)
firstShard.ClusterBumpEpoch(context.TODO()))
secondShard.ClusterReplicate(context.TODO(), NodeIdOfFirstShard)
secondShard.ClusterAddSlots(context.TODO(), 8192, 16383)
secondShard.ClusterBumpEpoch(context.TODO()))
[For Crash- issue another replicate]
secondShard.ClusterReplicate(context.TODO(), NodeIdOfFirstShard)
Expectation:
If we send CLUSTER REPLICATE to a master node, it denies replication command. I would expect similar if we send CLUSTER ADDSLOTSRANGE to a replica node (until it's state is reset). It should atleast be consistent with view from other nodes eventually from gossip propogation. Currently, The CLUSTER NODES of the 2 shards don't converge (before the crash).
Comment From: codeplayer14
Updating with full bug report
== REDIS BUG REPORT START: Cut & paste starting from here === 11939:S 02 Nov 2023 14:24:59.396 # === ASSERTION FAILED === 11939:S 02 Nov 2023 14:24:59.396 # ==> cluster.c:4580 'myself->numslots == 0' is not true
------ STACK TRACE ------
Backtrace: /usr/bin/redis-server :7001 [cluster] (clusterSetMaster+0xdf)[0x563c35e2a22f] /usr/bin/redis-server :7001 [cluster] (clusterCommand+0x1c13)[0x563c35e31823] /usr/bin/redis-server :7001 [cluster] (call+0xee)[0x563c35da510e] /usr/bin/redis-server :7001 [cluster] (processCommand+0x6fd)[0x563c35da617d] /usr/bin/redis-server :7001 [cluster] (processInputBuffer+0x107)[0x563c35dc2517] /usr/bin/redis-server :7001 [cluster] (readQueryFromClient+0x318)[0x563c35dc2a58] /usr/bin/redis-server :7001 [cluster] (+0x17323c)[0x563c35e9823c] /usr/bin/redis-server :7001 [cluster] (aeProcessEvents+0x1e2)[0x563c35d9c182] /usr/bin/redis-server :7001 [cluster] (aeMain+0x1d)[0x563c35d9c4bd] /usr/bin/redis-server :7001 [cluster] (main+0x354)[0x563c35d93df4] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f5f788ccd90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f5f788cce40] /usr/bin/redis-server *:7001 [cluster] (_start+0x25)[0x563c35d944c5]
------ INFO OUTPUT ------
Server
redis_version:7.0.11 redis_git_sha1:00000000 redis_git_dirty:0 redis_build_id:3af367a78d5e21e9 redis_mode:cluster os:Linux 5.15.90.1-microsoft-standard-WSL2 x86_64 arch_bits:64 monotonic_clock:POSIX clock_gettime multiplexing_api:epoll atomicvar_api:c11-builtin gcc_version:11.3.0 process_id:11939 process_supervised:no run_id:a28a5f2e211968330ac236a0d8dda9bf32709b7c tcp_port:7001 server_time_usec:1698915299396196 uptime_in_seconds:198 uptime_in_days:0 hz:10 configured_hz:10 lru_clock:4416483 executable:/usr/bin/redis-server config_file:/home/paddy14/GoProjects/WebXTCache/services/garnet-operator/redis-server/redis-config/redis7001.conf io_threads_active:0
Clients
connected_clients:4 cluster_connections:4 maxclients:10000 client_recent_max_input_buffer:24 client_recent_max_output_buffer:0 blocked_clients:0 tracking_clients:0 clients_in_timeout_table:0
Memory
used_memory:1821368 used_memory_human:1.74M used_memory_rss:16105472 used_memory_rss_human:15.36M used_memory_peak:2082824 used_memory_peak_human:1.99M used_memory_peak_perc:87.45% used_memory_overhead:1618556 used_memory_startup:1586296 used_memory_dataset:202812 used_memory_dataset_perc:86.28% allocator_allocated:2149576 allocator_active:2609152 allocator_resident:6033408 total_system_memory:16686968832 total_system_memory_human:15.54G used_memory_lua:31744 used_memory_vm_eval:31744 used_memory_lua_human:31.00K used_memory_scripts_eval:0 number_of_cached_scripts:0 number_of_functions:0 number_of_libraries:0 used_memory_vm_functions:32768 used_memory_vm_total:64512 used_memory_vm_total_human:63.00K used_memory_functions:184 used_memory_scripts:184 used_memory_scripts_human:184B maxmemory:10737418240 maxmemory_human:10.00G maxmemory_policy:allkeys-lru allocator_frag_ratio:1.21 allocator_frag_bytes:459576 allocator_rss_ratio:2.31 allocator_rss_bytes:3424256 rss_overhead_ratio:2.67 rss_overhead_bytes:10072064 mem_fragmentation_ratio:8.95 mem_fragmentation_bytes:14306632 mem_not_counted_for_evict:0 mem_replication_backlog:20508 mem_total_replication_buffers:20504 mem_clients_slaves:0 mem_clients_normal:7216 mem_cluster_links:4352 mem_aof_buffer:0 mem_allocator:jemalloc-5.2.1 active_defrag_running:0 lazyfree_pending_objects:0 lazyfreed_objects:0
Persistence
loading:0 async_loading:0 current_cow_peak:0 current_cow_size:0 current_cow_size_age:0 current_fork_perc:0.00 current_save_keys_processed:0 current_save_keys_total:0 rdb_changes_since_last_save:0 rdb_bgsave_in_progress:0 rdb_last_save_time:1698915101 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:-1 rdb_current_bgsave_time_sec:-1 rdb_saves:0 rdb_last_cow_size:0 rdb_last_load_keys_expired:0 rdb_last_load_keys_loaded:0 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_rewrites:0 aof_rewrites_consecutive_failures:0 aof_last_write_status:ok aof_last_cow_size:0 module_fork_in_progress:0 module_fork_last_cow_size:0
Stats
total_connections_received:3 total_commands_processed:25 instantaneous_ops_per_sec:0 total_net_input_bytes:89180 total_net_output_bytes:178624 total_net_repl_input_bytes:511 total_net_repl_output_bytes:0 instantaneous_input_kbps:0.01 instantaneous_output_kbps:0.02 instantaneous_input_repl_kbps:0.01 instantaneous_output_repl_kbps:0.00 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 expired_keys:0 expired_stale_perc:0.00 expired_time_cap_reached_count:0 expire_cycle_cpu_milliseconds:0 evicted_keys:0 evicted_clients:0 total_eviction_exceeded_time:0 current_eviction_exceeded_time:0 keyspace_hits:0 keyspace_misses:0 pubsub_channels:0 pubsub_patterns:0 pubsubshard_channels:0 latest_fork_usec:0 total_forks:0 migrate_cached_sockets:0 slave_expires_tracked_keys:0 active_defrag_hits:0 active_defrag_misses:0 active_defrag_key_hits:0 active_defrag_key_misses:0 total_active_defrag_time:0 current_active_defrag_time:0 tracking_total_keys:0 tracking_total_items:0 tracking_total_prefixes:0 unexpected_error_replies:0 total_error_replies:0 dump_payload_sanitizations:0 total_reads_processed:47 total_writes_processed:188 io_threaded_reads_processed:0 io_threaded_writes_processed:0 reply_buffer_shrinks:4 reply_buffer_expands:0
Replication
role:slave master_host:172.18.133.131 master_port:7000 master_link_status:up master_last_io_seconds_ago:2 master_sync_in_progress:0 slave_read_repl_offset:266 slave_repl_offset:266 slave_priority:100 slave_read_only:1 replica_announced:1 connected_slaves:0 master_failover_state:no-failover master_replid:7c5fafd84676dd352225f1c5edac2b0c9f41e72d master_replid2:0000000000000000000000000000000000000000 master_repl_offset:266 second_repl_offset:-1 repl_backlog_active:1 repl_backlog_size:1048576 repl_backlog_first_byte_offset:15 repl_backlog_histlen:252
CPU
used_cpu_sys:0.253435 used_cpu_user:0.158397 used_cpu_sys_children:0.000000 used_cpu_user_children:0.000000 used_cpu_sys_main_thread:0.252074 used_cpu_user_main_thread:0.157546
Modules
Commandstats
cmdstat_ping:calls=18,usec=23,usec_per_call=1.28,rejected_calls=0,failed_calls=0 cmdstat_cluster|meet:calls=1,usec=28,usec_per_call=28.00,rejected_calls=0,failed_calls=0 cmdstat_cluster|nodes:calls=2,usec=635,usec_per_call=317.50,rejected_calls=0,failed_calls=0 cmdstat_cluster|addslots:calls=1,usec=620,usec_per_call=620.00,rejected_calls=0,failed_calls=0 cmdstat_cluster|bumpepoch:calls=1,usec=64,usec_per_call=64.00,rejected_calls=0,failed_calls=0 cmdstat_cluster|replicate:calls=1,usec=240,usec_per_call=240.00,rejected_calls=0,failed_calls=0 cmdstat_command|docs:calls=1,usec=1041,usec_per_call=1041.00,rejected_calls=0,failed_calls=0
Errorstats
Latencystats
latency_percentiles_usec_ping:p50=1.003,p99=2.007,p99.9=2.007 latency_percentiles_usec_cluster|meet:p50=28.031,p99=28.031,p99.9=28.031 latency_percentiles_usec_cluster|nodes:p50=109.055,p99=528.383,p99.9=528.383 latency_percentiles_usec_cluster|addslots:p50=622.591,p99=622.591,p99.9=622.591 latency_percentiles_usec_cluster|bumpepoch:p50=64.255,p99=64.255,p99.9=64.255 latency_percentiles_usec_cluster|replicate:p50=240.639,p99=240.639,p99.9=240.639 latency_percentiles_usec_command|docs:p50=1044.479,p99=1044.479,p99.9=1044.479
Cluster
cluster_enabled:1
Keyspace
argv[1]: '"replicate"' argv[2]: '"59a38a2564efca9b132732424c643870fff819e9"'
------ MODULES INFO OUTPUT ------
------ CONFIG DEBUG OUTPUT ------ proto-max-bulk-len 512mb lazyfree-lazy-server-del no io-threads-do-reads no lazyfree-lazy-expire no repl-diskless-sync yes list-compress-depth 0 io-threads 1 sanitize-dump-payload no lazyfree-lazy-user-del no client-query-buffer-limit 1gb lazyfree-lazy-user-flush no activedefrag no repl-diskless-load disabled replica-read-only yes slave-read-only yes lazyfree-lazy-eviction no
------ FAST MEMORY TEST ------ 11939:S 02 Nov 2023 14:24:59.397 # Bio thread for job type #0 terminated 11939:S 02 Nov 2023 14:24:59.398 # Bio thread for job type #1 terminated 11939:S 02 Nov 2023 14:24:59.398 # Bio thread for job type #2 terminated *** Preparing to test memory region 563c36052000 (2306048 bytes) *** Preparing to test memory region 563c37a6c000 (270336 bytes) *** Preparing to test memory region 7f5f75b7c000 (2621440 bytes) *** Preparing to test memory region 7f5f75dfd000 (8388608 bytes) *** Preparing to test memory region 7f5f765fe000 (8388608 bytes) *** Preparing to test memory region 7f5f76dff000 (8388608 bytes) *** Preparing to test memory region 7f5f77600000 (8388608 bytes) *** Preparing to test memory region 7f5f77e00000 (8388608 bytes) *** Preparing to test memory region 7f5f78610000 (24576 bytes) *** Preparing to test memory region 7f5f7863c000 (8192 bytes) *** Preparing to test memory region 7f5f7877b000 (4096 bytes) *** Preparing to test memory region 7f5f788a1000 (8192 bytes) *** Preparing to test memory region 7f5f78abe000 (53248 bytes) *** Preparing to test memory region 7f5f78f0b000 (12288 bytes) *** Preparing to test memory region 7f5f79078000 (4096 bytes) *** Preparing to test memory region 7f5f79166000 (8192 bytes) .O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.
=== REDIS BUG REPORT END. Make sure to include from START to END. ===
Comment From: enjoy-binbin
we discussed this in #12282
Comment From: codeplayer14
Great. Thanks for the prompt response @enjoy-binbin !