Redis Redis sentinel never exits sdown state if initial password auth failed

On my environment we are using many sentinel servers (v4.0.11) which are managing redis clusters.

Every cluster has 3 nodes (1 master + 2 slaves) with unique password per cluster. Our monitoring shows that sometime after sentinel failover or other events some of the sentinel nodes are going to +sdown state and never returns to normal.

Tcpdump shows that problem is insufficient check of initial authentication and missing handling of the auth error on other stages. Below is part of the endless conversation:

T 10.213.12.206:23660 -> 10.213.12.97:51992 [AP]
  -NOAUTH Authentication required...-NOAUTH Authentication required...


T 10.213.12.97:51992 -> 10.213.12.206:23660 [AP]
  *1..$4..INFO..*3..$7..PUBLISH..$18..__sentinel__:hello..$96..10.213.12.97,5000,b0ce305fa460fb4b738d5f0c559d91382701501a,3909,redis-cpat,10.213.12.206,23660,0..


T 10.213.12.206:23660 -> 10.213.12.97:51992 [AP]
  -NOAUTH Authentication required...-NOAUTH Authentication required...


T 10.213.12.97:51992 -> 10.213.12.206:23660 [AP]
  *1..$4..INFO..*3..$7..PUBLISH..$18..__sentinel__:hello..$96..10.213.12.97,5000,b0ce305fa460fb4b738d5f0c559d91382701501a,3909,redis-cpat,10.213.12.206,23660,0..


T 10.213.12.206:23660 -> 10.213.12.97:51992 [AP]
  -NOAUTH Authentication required...-NOAUTH Authentication required...


T 10.213.12.97:51992 -> 10.213.12.206:23660 [AP]
  *1..$4..INFO..*3..$7..PUBLISH..$18..__sentinel__:hello..$96..10.213.12.97,5000,b0ce305fa460fb4b738d5f0c559d91382701501a,3909,redis-cpat,10.213.12.206,23660,0..


T 10.213.12.206:23660 -> 10.213.12.97:51992 [AP]
  -NOAUTH Authentication required...-NOAUTH Authentication required...

If i am manually connecting to affected masters and killing connection or doing sentinel reset <name> - it starts to work normally. We are experiencing such problems few times per week.

I think solution should be to drop connection if NOAUTH Authentication required received. Do you know if it is possible or needs to be implemented? Another option is to drop tcp connection if auth failed > n times.

Comment From: samm-git

Part of the dump after i did CLIENT KILL on the master to close failed connection:

T 10.213.12.206:23660 -> 10.213.12.97:51992 [AP]
  -NOAUTH Authentication required...-NOAUTH Authentication required...


T 10.213.12.206:23660 -> 10.213.12.97:52016 [AP]
  *3..$7..message..$18..__sentinel__:hello..$96..10.213.12.93,5000,6f7de613d3447ed92e7e5d3615c2e820906af8c2,3909,redis-cpat,10.213.12.206,23660,0..


T 10.213.12.97:53372 -> 10.213.12.206:23660 [AP]
  *2..$4..AUTH..$40..123455a0274bc274ef7a1109df09d3c0315e9043..*3..$6..CLIENT..$7..SETNAME..$21..sentinel-b0ce305f-cmd..*1..$4..PING..*1..$4..INFO..*3..$7..PUBLISH.
.$18..__sentinel__:hello..$96..10.213.12.97,5000,b0ce305fa460fb4b738d5f0c559d91382701501a
  ,3909,redis-cpat,10.213.12.206,23660,0..


T 10.213.12.206:23660 -> 10.213.12.97:53372 [AP]
  +OK..+OK..+PONG..$3198..# Server..redis_version:4.0.11..redis_git_sha1:00000000..redis_git_dirty:0..redis_build_id:f7b13aa754d83881..redis_mode:standalone..os:Lin
ux 4.14.63-coreos x86_64..arch_bits:64..multiplexing_api:epoll..atomicvar_api:atomic-buil
  tin..gcc_version:6.4.0..process_id:9..run_id:3f6b29342055b94f2d1505aa1a4c28ced0522f26..tcp_port:6379..uptime_in_seconds:1320..uptime_in_days:0..hz:10..lru_clock:1
1785801..executable:/usr/local/bin/redis-server..config_file:/etc/redis-node.conf....# Cl
  ients..connected_clients:44..client_longest_output_list:0..client_biggest_input_buf:145..blocked_clients:0....# Memory..used_memory:306395440..used_memory_human:2
92.20M..used_memory_rss:354381824..used_memory_rss_human:337.96M..used_memory_peak:306679
  928..used_memory_peak_human:292.47M..used_memory_peak_perc:99.91%..used_memory_overhead:3183708..used_memory_startup:786584..used_memory_dataset:303211732..used_m
emory_dataset_perc:99.22%..total_system_memory:64416129024..total_system_memory_human:59.
  99G..used_memory_lua:37888..used_memory_lua_human:37.00K..maxmemory:943718400..maxmemory_human:900.00M..maxmemory_policy:allkeys-lru..mem_fragmentation_ratio:1.16
..mem_allocator:jemalloc-4.0.3..active_defrag_running:0..lazyfree_pending_objects:0....#
  Persistence..loading:0..rdb_changes_since_last_save:438..rdb_bgsave_in_progress:0..rdb_last_save_time:1538511243..rdb_last_bgsave_status:ok..rdb_last_bgsave_time_
sec:1..rdb_current_bgsave_time_sec:-1..rdb_last_cow_size:217354240..aof_enabled:1..aof_re
  write_in_progress:0..aof_rewrite_scheduled:0..aof_last_rewrite_time_sec:1..aof_current_rewrite_time_sec:-1..aof_last_bgrewrite_status:ok..aof_last_write_status:ok
..aof_last_cow_size:100749312..aof_current_size:274435916..aof_base_size:274394808..aof_p

Comment From: samm-git

I think that merging of the https://github.com/antirez/redis/pull/1241 would solve this issue.

Update: no, because it is not about NOAUTH reply. However, similar approach could be implemented for NOAUTH

Comment From: samm-git

To add some details - i am adding masters dynamically from the script

    eval "redis-cli ${sentinel} sentinel monitor ${name} ${master_ip} ${master_port} ${quorum_size}"
    eval "redis-cli ${sentinel} sentinel set ${name} auth-pass \"\${redis_password}\""
    eval "redis-cli ${sentinel} sentinel set ${name} down-after-milliseconds 60000"
    eval "redis-cli ${sentinel} sentinel set ${name} failover-timeout 120000"
   eval "redis-cli ${sentinel} sentinel set ${name} parallel-syncs 1"

I think race between sentinel monitor and sentinel set ${name} auth-pass could be a contributing factor here

Comment From: WiFeng

I think you should also describe the background of the problem. eg: which node is master ? which node is sentinel ? How do you monitor the sdown? sentinel node is sdown or in the point of view of sentinel the master node is sdown ?

Which condition that it appears again regularly?

Comment From: samm-git

@WiFeng i think problem here is very clear, in fact - 2 problems.

When sentinel masters are configured using API and not from file - there is a race between master configuration and password setup, so sentinel can start to connect w/o password.
Due to bad auth failure handling in sentinel this failure will mark healthy master as sdown forever, and sentinel restart or SENTINEL RESET <master> is required to make it healthy again.

Comment From: WiFeng

@samm-git I have understood the problem now, and suggest you that at first config master/slave and setting password then start sentinel node.

Fixing the problem by optimizeding the redis code is so easy. Actually it is not a bug, so need @antirez to decide that if shoud be optimized.

Comment From: samm-git

@WiFeng of course it is a bug, or to be more correct - combination of 2 bugs.

In my configuration i am defining redis clusters using api and not static file. It is not possible to define password before or at the same time defining as master.
As you could see from the trace - in case of AUTH failures it never tries to reconnect but marking master as down, so bug would never self heal, even if password is configured.

Comment From: samm-git

@WiFeng one of the possible option could be to drop connection on redis server side after N auth failures (i think its good idea to prevent such loops). Another - to extend sentinel monitor command to allow to define password in it to avoid this race.

Comment From: WiFeng

after set auth-pass, reset master should be ok by fixing redis server code

发自我的 iPhone

在 2018年10月14日，上午3:20，Alex Samorukov notifications@github.com 写道：

@WiFeng one of the possible option could be to drop connection on redis server side after N auth failures (i think its good idea to prevent such loops). Another - to extend sentinel monitor command to allow to define password in it to avoid this race.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Comment From: WiFeng

As above , the code is here

void sentinelSetCommand(client *c) {

    ...
    ...

    } else if (!strcasecmp(option,"auth-pass") && moreargs > 0) {
            /* auth-pass <password> */
            char *value = c->argv[++j]->ptr;
            sdsfree(ri->auth_pass);
            ri->auth_pass = strlen(value) ? sdsnew(value) : NULL;

            // reset master then reconnect the master and slave node
            sentinelResetMaster(ri, SENTINEL_RESET_NO_SENTINELS);

            changes++;
    }

    ...

    ...
}

Comment From: dongdaoguang

I meet the same problem(redis version is 4.0.14-0.1), have you solved it ? @samm-git

Comment From: samm-git

@dongdaoguang just by changing monitoring logging on my side and restart connect if this race happens. Not a real solution but workaround

Comment From: dongdaoguang

@samm-git you can execute "SENTINEL reset {master-name}" at last. SENTINEL monitor mymaster 192.168.3.149 6379 2 SENTINEL set mymaster auth-pass 123456 SENTINEL reset mymaster

Comment From: samm-git

@dongdaoguang thanks for the hint. anyway, solved long time ago for me, dont want to go into that code anymore. Still think that it needs to be handled differently inside redis.