Describe the bug
Sentinels are not able to elect a new master when forcefully stop a node and recreating it with new ip. This is a continuation of the Bitnami chart issue 6165 and a tentative to try to understand why this issue happens
To reproduce
initial status:
each pod contains Redis, Sentinel and Metrics container:
NAME READY STATUS RESTARTS AGE IP
redis-0 3/3 Running 0 20m 172.30.165.5 <- MASTER
redis-1 3/3 Running 0 19m 172.30.193.94 <- SLAVE
redis-2 3/3 Running 0 17m 172.30.233.109 <- SLAVE
sentinel on each pod reports the same information:
redis-0 -> master0:name=primary,status=ok,address=172.30.165.5:6379,slaves=2,sentinels=3
redis-1 -> master0:name=primary,status=ok,address=172.30.165.5:6379,slaves=2,sentinels=3
redis-2 -> master0:name=primary,status=ok,address=172.30.165.5:6379,slaves=2,sentinels=3
steps:
- force a delete of the master
kubectl delete pod redis-0 --force --grace-period=0<- this cause the deletion of both redis AND sentinel container in the first pod (redis-0) - (contains suppositions that i'm not sure are 100% correct) the first failover (10:27:50.721) happen when the master reaches the odown status (~45s after the redis-0 shutdown), but slaves are considered not eligible anymore (<10:27:16) so sentinel retries continuously to failover to the old master that doesn't exist anymore (172.30.165.5). Below there are the sentinel logs plus the extract of redis-cli targeting sentinel:
# Termination of the master
redis-0 sentinel 1:signal-handler (1633861625) Received SIGTERM scheduling shutdown...
redis-0 sentinel 1:signal-handler (1633861625) Received SIGTERM scheduling shutdown...
redis-0 sentinel 1:X 10 Oct 2021 10:27:06.044 # User requested shutdown...
redis-0 sentinel 1:X 10 Oct 2021 10:27:06.044 # Sentinel is now ready to exit, bye bye...
- redis-0
# Recreation of redis-0 with the same name as old master (redis-0 -> redis-0)
# but with different ip (172.30.165.5 -> 172.30.165.20)
+ redis-0 › sentinel
redis-0 sentinel 10:27:10.47 WARN ==> redis-headless.svc.cluster.local does not contain the IP of this pod: 172.30.165.20
redis-0 sentinel 10:27:15.49 INFO ==> redis-headless.svc.cluster.local has my IP: 172.30.165.20
redis-0 sentinel 10:27:15.56 INFO ==> Cleaning sentinels in sentinel node: 172.30.193.94
redis-1 sentinel 1:X 10 Oct 2021 10:27:15.586 # +reset-master master primary 172.30.165.5 6379
redis-0 sentinel 1
redis-1 sentinel 1:X 10 Oct 2021 10:27:16.614 * +sentinel sentinel e9e17c86b01fa230c75b61d56962d49dd220fbe0 172.30.233.109 26379 @ primary 172.30.165.5 6379
#######################
## Output added from redis-cli
# slaves number goes from 2 to 0 (?)
# sentinels numbers goes from 3 to 2
redis-0 sentinel 10 Oct 2021 10:27:16 command terminated with exit code 1
redis-1 sentinel 10 Oct 2021 10:27:16 master0:name=primary,status=ok,address=172.30.165.5:6379,slaves=0,sentinels=2
redis-2 sentinel 10 Oct 2021 10:27:16 master0:name=primary,status=ok,address=172.30.165.5:6379,slaves=0,sentinels=2
#######################
redis-0 sentinel 10:27:20.59 INFO ==> Cleaning sentinels in sentinel node: 172.30.233.109
redis-2 sentinel 1:X 10 Oct 2021 10:27:20.603 # +reset-master master primary 172.30.165.5 6379
redis-0 sentinel 1
redis-2 sentinel 1:X 10 Oct 2021 10:27:20.678 * +sentinel sentinel 246f443a73b8d893e94d3393579f889ca539ec8d 172.30.193.94 26379 @ primary 172.30.165.5 6379
redis-0 sentinel 10:27:25.60 INFO ==> Sentinels clean up done
redis-1 sentinel 1:X 10 Oct 2021 10:27:45.609 # +sdown master primary 172.30.165.5 6379
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.631 # +sdown master primary 172.30.165.5 6379
# Master down detected
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.721 # +odown master primary 172.30.165.5 6379 #quorum 2/2
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.721 # +new-epoch 1
# First failover
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.721 # +try-failover master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:27:50.727 # +new-epoch 1
redis-1 sentinel 1:X 10 Oct 2021 10:27:50.729 # +vote-for-leader e9e17c86b01fa230c75b61d56962d49dd220fbe0 1
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.724 # +vote-for-leader e9e17c86b01fa230c75b61d56962d49dd220fbe0 1
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.730 # 246f443a73b8d893e94d3393579f889ca539ec8d voted for e9e17c86b01fa230c75b61d56962d49dd220fbe0 1
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.814 # +elected-leader master primary 172.30.165.5 6379
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.814 # +failover-state-select-slave master primary 172.30.165.5 6379
# No slaves available. We had two before 10:27:16
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.867 # -failover-abort-no-good-slave master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:27:50.931 # +odown master primary 172.30.165.5 6379 #quorum 2/2
redis-1 sentinel 1:X 10 Oct 2021 10:27:50.931 # Next failover delay: I will not start a failover before Sun Oct 10 10:33:51 2021
redis-2 sentinel 1:X 10 Oct 2021 10:27:50.967 # Next failover delay: I will not start a failover before Sun Oct 10 10:33:50 2021
redis-0 sentinel Could not connect to Redis at 172.30.165.5:26379: Connection timed out
#######################
## Output added from redis-cli
# master status goes from ok to odown
redis-0 sentinel 1:X 10 Oct 2021 10:28:52 command terminated with exit code 1
redis-1 sentinel 1:X 10 Oct 2021 10:28:52 master0:name=primary,status=odown,address=172.30.165.5:6379,slaves=0,sentinels=2
redis-2 sentinel 1:X 10 Oct 2021 10:28:52 master0:name=primary,status=odown,address=172.30.165.5:6379,slaves=0,sentinels=2
#######################
redis-2 sentinel 1:X 10 Oct 2021 10:29:35.470 # +reset-master master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:29:35.848 # -odown master primary 172.30.165.5 6379
redis-2 sentinel 1:X 10 Oct 2021 10:29:37.542 * +sentinel sentinel 246f443a73b8d893e94d3393579f889ca539ec8d 172.30.193.94 26379 @ primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:29:40.487 # +reset-master master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:29:41.647 * +sentinel sentinel e9e17c86b01fa230c75b61d56962d49dd220fbe0 172.30.233.109 26379 @ primary 172.30.165.5 6379
redis-2 sentinel 1:X 10 Oct 2021 10:30:05.531 # +sdown master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:30:10.574 # +sdown master primary 172.30.165.5 6379
redis-1 sentinel 1:X 10 Oct 2021 10:30:10.641 # +odown master primary 172.30.165.5 6379 #quorum 2/2
redis-1 sentinel 1:X 10 Oct 2021 10:30:10.641 # +new-epoch 2
# Trying to failover to the old master
redis-1 sentinel 1:X 10 Oct 2021 10:30:10.641 # +try-failover master primary 172.30.165.5 6379
# infinite retry to failover to old master (?)
# ...
- redis-0 is not able to rejoin anymore since Sentinel has not elected a new master
NAME READY STATUS RESTARTS AGE IP
redis-0 1/3 Running 2 2m44s 172.30.165.20
redis-1 3/3 Running 0 38m 172.30.193.94
redis-2 3/3 Running 0 36m 172.30.233.109
Expected behavior
The ability of sentinel to elect a slave to a master
Lowering the down-after-milliseconds and failover-timeout to a value near 5s make the slave election possible. What worries me is that what happens if the failover fails for an occasional problem (e.g. network disruption): Sentinel will retry to failover to slaves? If the speculation (2.) is right we have only <10s to elect a slave and retries are limited. I'll expect to retry infinitely (after down-after-milliseconds + failover-timeout) to the slaves if the master is not available (e.g. network disruption can last more than <10s)
Additional information
Redis version: 6.0.14
Deleted this configuration of sentinel provided by the Bitnami charts, to use the default sentinel values
sentinel down-after-milliseconds primary 60000
sentinel failover-timeout primary 18000
Comment From: jsecchiero
nvm seems that the slave cleanup occur because of the internal logic of the Bitnami Redis chart that is fixed here