Redis [ERR] ... CLUSTERDOWN but the cluster isn't down

We were resharding data to a new cluster node, encountered a problem and are now stuck in situation probably caused by a bug. When trying to reshard, we get this message:

[ERR] Calling MIGRATE: ERR Target instance replied with error: CLUSTERDOWN The cluster is down

But the cluster is up! Below the steps we followed.

First we created an empty node to our new separate server then we add it to our existing redis clusters:

server1-ip:port master connected server2-ip:port master connected server3-ip:port master connected server4-ip:port master connected server5-ip:port master connected new-server-ip:port master connected

We started to reshard data from server1-ip:port to new-server-ip:port using this command -> "./redis-trib.rb reshard --from --to --slots --yes ::" . We encountered an error:

Moving slot 7402 from 6f70203705a1f26b561f39a600930f7b22dfeb98 Moving slot 7403 from 6f70203705a1f26b561f39a600930f7b22dfeb98 Moving slot 6904 from server1-ip:port to new-server-ip:port: ........$ Moving slot 6905 from server1-ip:port to new-server-ip:port: ........$ Moving slot 6906 from server1-ip:port to new-server-ip:port: ........$ Moving slot 6907 from server1-ip:port to new-server-ip:port: ........$ Moving slot 6908 from server1-ip:port to new-server-ip:port: ........$ Moving slot 6909 from server1-ip:port to new-server-ip:port: ........$ [ERR] Calling MIGRATE: IOERR error or timeout reading to target instance

We try to fix/Check for open slots using this command ./redis-trib.rb fix ip:port before restart the resharding.

Performing Cluster Check (using node new-server-ip:port) M: 80570f4d791d9834bd28322c25337be00e1370b2 new-server-ip:port slots:6904-6909 (6 slots) master 0 additional replica(s) M: 9527684833c252c5dd0ee5f44afa13730cb689ee server2-ip:port slots:0-50 (51 slots) master 0 additional replica(s) M: 8b6accb0259089f4f5fc3942b34fb6b7fcbde33e server5-ip:port slots:51-592,6566-6903 (880 slots) master 0 additional replica(s) M: 5b887a2fc38eade4b6366b4d1de2926733e082d2 server3-ip:port slots:926-3318 (2393 slots) master 0 additional replica(s) M: 6f70203705a1f26b561f39a600930f7b22dfeb98 server1-ip:port slots:6910-16383 (9474 slots) master 0 additional replica(s) M: 0a52eec580372bd365351be0b0833dbd364aa633 server4-ip:port slots:593-925,3319-6565 (3580 slots) master 0 additional replica(s) [OK] All nodes agree about slots configuration. Check for open slots... Check slots coverage... [OK] All 16384 slots covered.

We restart the resharding and it was successfully restarted but we have encountered an error:

Moving slot 7007 from 6f70203705a1f26b561f39a600930f7b22dfeb98 Moving slot 7008 from 6f70203705a1f26b561f39a600930f7b22dfeb98 Moving slot 7009 from 6f70203705a1f26b561f39a600930f7b22dfeb98 Moving slot 6910 from server1-ip:port to new-server-ip:port: ..............................$ Moving slot 6911 from server1-ip:port to new-server-ip:port: ..............................$ Moving slot 6912 from server1-ip:port to new-server-ip:port: ..............................$ [ERR] Calling MIGRATE: ERR Target instance replied with error: CLUSTERDOWN The cluster is down

But actually the cluster isn't down:

9527684833c252c5dd0ee5f44afa13730cb689ee server2-ip:port master - 0 1485250688989 2 connected 0-50 5b887a2fc38eade4b6366b4d1de2926733e082d2 server3-ip:port master - 0 1485250686984 3 connected 926-3318 80570f4d791d9834bd28322c25337be00e1370b2 new-server-ip:port myself,master - 0 0 6 connected 6904-6911 [6912-<-6f70203705a1f26b561f39a600930f7b22dfeb98] 8b6accb0259089f4f5fc3942b34fb6b7fcbde33e server5-ip:port master - 0 1485250687986 5 connected 51-592 6566-6903 6f70203705a1f26b561f39a600930f7b22dfeb98 server1-ip:port master - 0 1485250689993 1 connected 6912-16383 0a52eec580372bd365351be0b0833dbd364aa633 server4-ip:port master - 0 1485250688989 4 connected 593-925 3319-6565

We have try to fixed it again by running the ./redis-trib.rb fix ip:port but it gives us this error:

Performing Cluster Check (using node new-server-ip:port) M: 80570f4d791d9834bd28322c25337be00e1370b2 new-server-ip:port slots:6904-6911 (8 slots) master 0 additional replica(s) M: 9527684833c252c5dd0ee5f44afa13730cb689ee server2-ip:port slots:0-50 (51 slots) master 0 additional replica(s) M: 5b887a2fc38eade4b6366b4d1de2926733e082d2 server3-ip:port slots:926-3318 (2393 slots) master 0 additional replica(s) M: 8b6accb0259089f4f5fc3942b34fb6b7fcbde33e server5-ip:port slots:51-592,6566-6903 (880 slots) master 0 additional replica(s) M: 6f70203705a1f26b561f39a600930f7b22dfeb98 server1-ip:port slots:6912-16383 (9472 slots) master 0 additional replica(s) M: 0a52eec580372bd365351be0b0833dbd364aa633 server4-ip:port slots:593-925,3319-6565 (3580 slots) master 0 additional replica(s) [OK] All nodes agree about slots configuration. Check for open slots... [WARNING] Node new-server-ip:port has slots in importing state (6912). [WARNING] Node server1-ip:port has slots in migrating state (6912). [WARNING] The following slots are open: 6912 Fixing open slot 6912 Set as migrating in: server1-ip:port Set as importing in: new-server-ip:port Moving slot 6912 from server1-ip:port to new-server-ip:port: [ERR] Calling MIGRATE: ERR Target instance replied with error: CLUSTERDOWN The cluster is down

info for server1-ip:port - SOURCE NODE

Server

redis_version:3.2.3 redis_git_sha1:00000000 redis_git_dirty:0 redis_build_id:4992f89db2d932d redis_mode:cluster os:Linux 3.13.0-37-generic x86_64 arch_bits:64 multiplexing_api:epoll gcc_version:4.8.2 process_id:25284 run_id:eeb0be947760b033df999a84b1f1024ffc56f94d tcp_port:7010 uptime_in_seconds:6719679 uptime_in_days:77 hz:10 lru_clock:8854109 executable:/home/cybranding/redis-3.2.3/redis-stable/src/redis-server config_file:/etc/redis_cluster_client2/redis-3.2.3/7010/redis.conf

Clients

connected_clients:6 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:0

Memory

used_memory:263262791176 used_memory_human:245.18G used_memory_rss:222207938560 used_memory_rss_human:206.95G used_memory_peak:263262843256 used_memory_peak_human:245.18G total_system_memory:405738954752 total_system_memory_human:377.87G used_memory_lua:37888 used_memory_lua_human:37.00K maxmemory:0 maxmemory_human:0B maxmemory_policy:noeviction mem_fragmentation_ratio:0.84 mem_allocator:jemalloc-4.0.3

Persistence

loading:0 rdb_changes_since_last_save:3477248820 rdb_bgsave_in_progress:0 rdb_last_save_time:1478529438 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:-1 rdb_current_bgsave_time_sec:-1 aof_enabled:1 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:12415 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok aof_current_size:76954766881 aof_base_size:71475261210 aof_pending_rewrite:0 aof_buffer_length:0 aof_rewrite_buffer_length:0 aof_pending_bio_fsync:0 aof_delayed_fsync:0

Stats

total_connections_received:135923 total_commands_processed:1624882108 instantaneous_ops_per_sec:121 total_net_input_bytes:183344702562 total_net_output_bytes:238996158132 instantaneous_input_kbps:7.65 instantaneous_output_kbps:0.94 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 expired_keys:2696602 evicted_keys:0 keyspace_hits:293331974 keyspace_misses:4634274 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:8247933 migrate_cached_sockets:0

Replication

role:master connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0

CPU

used_cpu_sys:228998.14 used_cpu_user:106213.70 used_cpu_sys_children:13948.03 used_cpu_user_children:38121.80

Cluster

cluster_enabled:1

Keyspace

db0:keys=157638834,expires=32133,avg_ttl=38497283

info for new-server-ip:port - TARGET NODE

Server

redis_version:3.2.3 redis_git_sha1:00000000 redis_git_dirty:0 redis_build_id:b5038506891fcfe5 redis_mode:cluster os:Linux 4.4.0-47-generic x86_64 arch_bits:64 multiplexing_api:epoll gcc_version:5.4.0 process_id:29729 run_id:be9a3b0fa9e56dd78829f432189cc3faed2b70a4 tcp_port:7015 uptime_in_seconds:600025 uptime_in_days:6 hz:10 lru_clock:8853916 executable:/root/redis-3.2.3/redis-3.2.3/src/redis-server config_file:/etc/redis_cluster_client2/7015/redis.conf

Clients

connected_clients:5 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:0

Memory

used_memory:197574704 used_memory_human:188.42M used_memory_rss:209297408 used_memory_rss_human:199.60M used_memory_peak:399048784 used_memory_peak_human:380.56M total_system_memory:270378438656 total_system_memory_human:251.81G used_memory_lua:37888 used_memory_lua_human:37.00K maxmemory:0 maxmemory_human:0B maxmemory_policy:noeviction mem_fragmentation_ratio:1.06 mem_allocator:jemalloc-4.0.3

Persistence

loading:0 rdb_changes_since_last_save:173468 rdb_bgsave_in_progress:0 rdb_last_save_time:1484648899 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:-1 rdb_current_bgsave_time_sec:-1 aof_enabled:1 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok aof_current_size:71610854 aof_base_size:64129446 aof_pending_rewrite:0 aof_buffer_length:0 aof_rewrite_buffer_length:0 aof_pending_bio_fsync:0 aof_delayed_fsync:0

Stats

total_connections_received:4477 total_commands_processed:56480 instantaneous_ops_per_sec:0 total_net_input_bytes:3772430822 total_net_output_bytes:200708212 instantaneous_input_kbps:0.00 instantaneous_output_kbps:0.00 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 expired_keys:217 evicted_keys:0 keyspace_hits:3981 keyspace_misses:403 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:0 migrate_cached_sockets:0

Replication

role:master connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0

CPU

used_cpu_sys:317.34 used_cpu_user:209.47 used_cpu_sys_children:0.00 used_cpu_user_children:0.00

Cluster

cluster_enabled:1

Keyspace

db0:keys=150389,expires=28,avg_ttl=37790580

Comment From: doyoubi

We came across the same problem when scaling up cluster. I put our case and reason here in case someone need it some day in the future. In our case the CLUSTERDOWN The cluster is down arises when we start migration immediately after we add some nodes in the cluster. This happens because the cluster_state(from INFO cluster command) of newly added nodes is still fail as first initialized and we send MIGRATE to these unprepared nodes. We added additional logs in the two places where REDIS_CLUSTER_REDIR_DOWN_STATE is returned and finally found out that the added log appeared only before Cluster state changed: ok. Our redis version is 3.0.7.

I'm not sure whether redis-trib.rb would check the cluster_state before doing the migration.

Comment From: gensmusic

same here when we restart a slave of a master in cluster. The cluster is up but got a lot of errors.

Comment From: chihuo91

@otherpirate Any chance to fix this bug?

Comment From: chihuo91

My situation is even more wired. When I use redis client, lettuce, to get cluster info. The cluster_state is ok. But when I use redis-trib.rb to do migration, it will fail and the cluster_state is fail. Any thoughts?

Comment From: spencerparkin

Stack overflow guy says the cluster down error is issued when not all slots are covered by the cluster. I'm getting this error immediately after bringing up my cluster, so it's possible i'm not correctly covering all slots. I thought I had allocated the entire slot space across the cluster, but apparently not.

In the other cases above, cluster slots can stop being covered if the resharding isn't done quite right, I suppose.

EDIT: Sure enough, after debugging my code, I found that I had covered all but the very last slot of the 16384-slot key-space. After covering that slot, I no longer get the error.