Redis [BUG] Sentinels are not able to elect a new master

Describe the bug

Sentinels are not able to elect a new master when forcefully stop a node and recreating it with new ip. This is a continuation of the Bitnami chart issue 6165 and a tentative to try to understand why this issue happens

To reproduce

initial status:

each pod contains Redis, Sentinel and Metrics container:

NAME     READY   STATUS     RESTARTS   AGE     IP
redis-0     3/3          Running     0                    20m     172.30.165.5         <- MASTER
redis-1     3/3          Running     0                    19m     172.30.193.94       <- SLAVE
redis-2     3/3          Running     0                    17m     172.30.233.109     <- SLAVE

sentinel on each pod reports the same information:

redis-0 -> master0:name=primary,status=ok,address=172.30.165.5:6379,slaves=2,sentinels=3
redis-1 -> master0:name=primary,status=ok,address=172.30.165.5:6379,slaves=2,sentinels=3
redis-2 -> master0:name=primary,status=ok,address=172.30.165.5:6379,slaves=2,sentinels=3

steps:

force a delete of the master kubectl delete pod redis-0 --force --grace-period=0 <- this cause the deletion of both redis AND sentinel container in the first pod (redis-0)
(contains suppositions that i'm not sure are 100% correct) the first failover (10:27:50.721) happen when the master reaches the odown status (~45s after the redis-0 shutdown), but slaves are considered not eligible anymore (<10:27:16) so sentinel retries continuously to failover to the old master that doesn't exist anymore (172.30.165.5). Below there are the sentinel logs plus the extract of redis-cli targeting sentinel:

# Termination of the master
redis-0 sentinel 1:signal-handler (1633861625) Received SIGTERM scheduling shutdown...
redis-0 sentinel 1:signal-handler (1633861625) Received SIGTERM scheduling shutdown...
redis-0 sentinel 1:X 10 Oct 2021 10:27:06.044 # User requested shutdown...
redis-0 sentinel 1:X 10 Oct 2021 10:27:06.044 # Sentinel is now ready to exit, bye bye...
- redis-0

# Recreation of redis-0 with the same name as old master (redis-0 -> redis-0)
# but with different ip (172.30.165.5 -> 172.30.165.20)
+ redis-0 › sentinel
redis-0 sentinel  10:27:10.47 WARN  ==> redis-headless.svc.cluster.local does not contain the IP of this pod: 172.30.165.20
redis-0 sentinel  10:27:15.49 INFO  ==> redis-headless.svc.cluster.local has my IP: 172.30.165.20
redis-0 sentinel  10:27:15.56 INFO  ==> Cleaning sentinels in sentinel node: 172.30.193.94
redis-1 sentinel 1:X 10 Oct 2021 10:27:15.586 # +reset-master master primary 172.30.165.5 6379
redis-0 sentinel 1
redis-1 sentinel 1:X 10 Oct 2021 10:27:16.614 * +sentinel sentinel e9e17c86b01fa230c75b61d56962d49dd220fbe0 172.30.233.109 26379 @ primary 172.30.165.5 6379

#######################
## Output added from redis-cli
# slaves number goes from 2 to 0 (?)
# sentinels numbers goes from 3 to 2
redis-0 sentinel 10 Oct 2021 10:27:16 command terminated with exit code 1
redis-1 sentinel 10 Oct 2021 10:27:16 master0:name=primary,status=ok,address=172.30.165.5:6379,slaves=0,sentinels=2
redis-2 sentinel 10 Oct 2021 10:27:16 master0:name=primary,status=ok,address=172.30.165.5:6379,slaves=0,sentinels=2
#######################

redis-0 sentinel  10:27:20.59 INFO  ==> Cleaning sentinels in sentinel node: 172.30.233.109
redis-2 sentinel 1:X 10 Oct 2021 10:27:20.603 # +reset-master master primary 172.30.165.5 6379
redis-0 sentinel 1
redis-2 sentinel 1:X 10 Oct 2021 10:27:20.678 * +sentinel sentinel 246f443a73b8d893e94d3393579f889ca539ec8d 172.30.193.94 26379 @ primary 172.30.165.5 6379
redis-0 sentinel  10:27:25.60 INFO  ==> Sentinels clean up done
redis-1 sentinel 1:X 10 Oct 2021 10:27:45.609 # +sdown master primary 172.30.165.5 6379
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.631 # +sdown master primary 172.30.165.5 6379

# Master down detected
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.721 # +odown master primary 172.30.165.5 6379 #quorum 2/2
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.721 # +new-epoch 1

# First failover
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.721 # +try-failover master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:27:50.727 # +new-epoch 1
redis-1 sentinel 1:X 10 Oct 2021 10:27:50.729 # +vote-for-leader e9e17c86b01fa230c75b61d56962d49dd220fbe0 1
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.724 # +vote-for-leader e9e17c86b01fa230c75b61d56962d49dd220fbe0 1
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.730 # 246f443a73b8d893e94d3393579f889ca539ec8d voted for e9e17c86b01fa230c75b61d56962d49dd220fbe0 1
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.814 # +elected-leader master primary 172.30.165.5 6379
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.814 # +failover-state-select-slave master primary 172.30.165.5 6379

# No slaves available. We had two before 10:27:16
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.867 # -failover-abort-no-good-slave master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:27:50.931 # +odown master primary 172.30.165.5 6379 #quorum 2/2
redis-1 sentinel 1:X 10 Oct 2021 10:27:50.931 # Next failover delay: I will not start a failover before Sun Oct 10 10:33:51 2021
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.967 # Next failover delay: I will not start a failover before Sun Oct 10 10:33:50 2021
redis-0 sentinel Could not connect to Redis at 172.30.165.5:26379: Connection timed out

#######################
## Output added from redis-cli
# master status goes from ok to odown
redis-0 sentinel 1:X 10 Oct 2021 10:28:52 command terminated with exit code 1
redis-1 sentinel 1:X 10 Oct 2021 10:28:52 master0:name=primary,status=odown,address=172.30.165.5:6379,slaves=0,sentinels=2
redis-2 sentinel 1:X 10 Oct 2021 10:28:52 master0:name=primary,status=odown,address=172.30.165.5:6379,slaves=0,sentinels=2
#######################

redis-2 sentinel 1:X 10 Oct 2021 10:29:35.470 # +reset-master master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:29:35.848 # -odown master primary 172.30.165.5 6379
redis-2 sentinel 1:X 10 Oct 2021 10:29:37.542 * +sentinel sentinel 246f443a73b8d893e94d3393579f889ca539ec8d 172.30.193.94 26379 @ primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:29:40.487 # +reset-master master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:29:41.647 * +sentinel sentinel e9e17c86b01fa230c75b61d56962d49dd220fbe0 172.30.233.109 26379 @ primary 172.30.165.5 6379
redis-2 sentinel 1:X 10 Oct 2021 10:30:05.531 # +sdown master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:30:10.574 # +sdown master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:30:10.641 # +odown master primary 172.30.165.5 6379 #quorum 2/2
redis-1 sentinel 1:X 10 Oct 2021 10:30:10.641 # +new-epoch 2

# Trying to failover to the old master
redis-1 sentinel 1:X 10 Oct 2021 10:30:10.641 # +try-failover master primary 172.30.165.5 6379

# infinite retry to failover to old master (?)
# ...

redis-0 is not able to rejoin anymore since Sentinel has not elected a new master

NAME     READY   STATUS     RESTARTS   AGE     IP
redis-0     1/3          Running     2                   2m44s  172.30.165.20   
redis-1     3/3          Running     0                   38m      172.30.193.94    
redis-2     3/3          Running     0                   36m      172.30.233.109

Expected behavior

The ability of sentinel to elect a slave to a master

Lowering the down-after-milliseconds and failover-timeout to a value near 5s make the slave election possible. What worries me is that what happens if the failover fails for an occasional problem (e.g. network disruption): Sentinel will retry to failover to slaves? If the speculation (2.) is right we have only <10s to elect a slave and retries are limited. I'll expect to retry infinitely (after down-after-milliseconds + failover-timeout) to the slaves if the master is not available (e.g. network disruption can last more than <10s)

Additional information

Redis version: 6.0.14 Deleted this configuration of sentinel provided by the Bitnami charts, to use the default sentinel values

    sentinel down-after-milliseconds primary 60000
    sentinel failover-timeout primary 18000

Comment From: jsecchiero

nvm seems that the slave cleanup occur because of the internal logic of the Bitnami Redis chart that is fixed here