Hello,
I have a very strange issue here. I setup node1,node2 as master/slave redis. and sentinel is running on node1,node2,node3,and node4. Whenever I shutdown 2 nodes, let's say node 2 (current master) and node 4, the sentinel on node1 is voted as leader and it could not promote node1 to be new master. Even with debug logging, I can see only this errors: 41444:X 22 Apr 2022 14:06:08.905 # -failover-abort-not-elected master tracking_master 10.231.252.233 6379
But if sentinel on another node (node 3) was voted as leader, then it can promote node1 as master without issue.
The other way is working, when I shutdown node1 and node3, it will failover to node2 without issue.
Did anyone have the same issue and might point me to the right direction? Thanks!
Comment From: moticless
Hi @phamlankt, it can help if you share the configuration of all sentinels. Please note that it is recommended to have odd number of sentinels, usually 3 or 5, and configure quorum of 2 or 3 correspondingly (avoid split-brain problem).
Comment From: phamlankt
Hi @moticless : thanks for the recommendation. We will add the fifth sentinel node and increase quorum to 3. Please look at the configuration of all sentinels and let me know if you see something wrong. `#node 1 daemonize yes pidfile "/var/run/sentinel/redis-sentinel.pid" logfile "/var/log/redis/redis-sentinel.log" protected-mode no port 26379 sentinel myid 94e1e1e3aa1e3e4fd596d47237c7982738f41124 sentinel deny-scripts-reconfig yes sentinel monitor tracking_master 192.168.0.232 6379 2 dir "/var/lib/redis" # Generated by CONFIG REWRITE user default on nopass ~* +@all sentinel down-after-milliseconds tracking_master 5000 sentinel config-epoch tracking_master 2 sentinel leader-epoch tracking_master 2 sentinel known-replica tracking_master 192.168.0.233 6379 sentinel known-sentinel tracking_master 192.168.0.241 26379 f0c2aa361282dc501bbef6db40eee9b51f3654e2 sentinel known-sentinel tracking_master 192.168.0.233 26379 ab112c57c3d8e15959f8956198d96966e6ffc40a sentinel known-sentinel tracking_master 192.168.0.240 26379 a1fd4e82287fe2622a9939ca08bbe0deb42d818c sentinel current-epoch 2
node2
daemonize yes pidfile "/var/run/sentinel/redis-sentinel.pid" logfile "/var/log/redis/redis-sentinel.log" protected-mode no port 26379 sentinel myid ab112c57c3d8e15959f8956198d96966e6ffc40a sentinel deny-scripts-reconfig yes sentinel monitor tracking_master 192.168.0.233 6379 2 dir "/var/lib/redis" # Generated by CONFIG REWRITE user default on nopass ~* +@all sentinel down-after-milliseconds tracking_master 5000 sentinel config-epoch tracking_master 1 sentinel leader-epoch tracking_master 1 sentinel known-replica tracking_master 192.168.0.232 6379 sentinel known-sentinel tracking_master 192.168.0.240 26379 a1fd4e82287fe2622a9939ca08bbe0deb42d818c sentinel known-sentinel tracking_master 192.168.0.241 26379 f0c2aa361282dc501bbef6db40eee9b51f3654e2 sentinel known-sentinel tracking_master 192.168.0.232 26379 94e1e1e3aa1e3e4fd596d47237c7982738f41124 sentinel current-epoch 1
node 3
daemonize yes pidfile "/var/run/sentinel/redis-sentinel.pid" logfile "/var/log/redis/redis-sentinel.log" protected-mode no port 26379 sentinel myid a1fd4e82287fe2622a9939ca08bbe0deb42d818c sentinel deny-scripts-reconfig yes sentinel monitor tracking_master 192.168.0.232 6379 2 dir "/var/lib/redis" # Generated by CONFIG REWRITE user default on nopass ~* +@all sentinel down-after-milliseconds tracking_master 5000 sentinel config-epoch tracking_master 2 sentinel leader-epoch tracking_master 2 sentinel known-replica tracking_master 192.168.0.233 6379 sentinel known-sentinel tracking_master 192.168.0.232 26379 94e1e1e3aa1e3e4fd596d47237c7982738f41124 sentinel known-sentinel tracking_master 192.168.0.233 26379 ab112c57c3d8e15959f8956198d96966e6ffc40a sentinel known-sentinel tracking_master 192.168.0.241 26379 f0c2aa361282dc501bbef6db40eee9b51f3654e2 sentinel current-epoch 2
node 4
daemonize yes pidfile "/var/run/sentinel/redis-sentinel.pid" logfile "/var/log/redis/redis-sentinel.log" protected-mode no port 26379 sentinel myid f0c2aa361282dc501bbef6db40eee9b51f3654e2 sentinel deny-scripts-reconfig yes sentinel monitor tracking_master 192.168.0.233 6379 2 dir "/var/lib/redis" # Generated by CONFIG REWRITE user default on nopass ~* +@all sentinel down-after-milliseconds tracking_master 5000 sentinel config-epoch tracking_master 1 sentinel leader-epoch tracking_master 1 sentinel known-replica tracking_master 192.168.0.232 6379 sentinel known-sentinel tracking_master 192.168.0.233 26379 ab112c57c3d8e15959f8956198d96966e6ffc40a sentinel known-sentinel tracking_master 192.168.0.232 26379 94e1e1e3aa1e3e4fd596d47237c7982738f41124 sentinel known-sentinel tracking_master 192.168.0.240 26379 a1fd4e82287fe2622a9939ca08bbe0deb42d818c sentinel current-epoch 1
`
Comment From: moticless
Based on current-epoch, I suspect that it relates to even number of sentinels. Please let's see if odd number of sentinels resolves the problem.
Comment From: phamlankt
I have read this again "However the quorum is only used to detect the failure. In order to actually perform a failover, one of the Sentinels need to be elected leader for the failover and be authorized to proceed. This only happens with the vote of the majority of the Sentinel processes." If it is the case, actually after shutting down 2 nodes (including the master), the other two nodes detects the failure and tried to vote for leader, but no leader was actually voted because there is not enough majority, in my case, at least 3, is it correct? That's why no failover could happen to node 1
Comment From: moticless
Consider the command sentinel monitor <master-group-name> <ip> <port> <quorum>. Based on your configuration above, you set quorum to 2 such that enough two instances to agree to elect a leader - which might lead to few side effects. We better put our effort to investigate a valid configuration with 3 or 5 sentinels and quorum of 2 or 3 correspondingly.
Comment From: phamlankt
It looks to me quorum 2 is enough to detect only the failure, but to elect leader, it needs a majority and 2 (among 4) is not enough in our case. So as your recommendation, we added the fifth sentinel node and it resolves the problem. Thanks for that. Once more thing, the retry between the failover is currently 6 minutes, do you know which parameter can be tuned to make this shorter? I see this failover-timeout parameter in our current setup, but it is set to 180 000 which is only 3 minutes, not 6 minutes. So I am not sure. 37) "failover-timeout" 38) "180000"
Comment From: moticless
Good to know it is working now.
Please consider the first bullet in sentinel.conf: ```
sentinel failover-timeout
Specifies the failover timeout in milliseconds. It is used in many ways:
- The time needed to re-start a failover after a previous failover was
already tried against the same master by a given Sentinel, is two
times the failover timeout.
- The time needed for a replica replicating to a wrong master according
to a Sentinel current configuration, to be forced to replicate
with the right master, is exactly the failover timeout (counting since
the moment a Sentinel detected the misconfiguration).
- The time needed to cancel a failover that is already in progress but
did not produced any configuration change (SLAVEOF NO ONE yet not
acknowledged by the promoted replica).
- The maximum time a failover in progress waits for all the replicas to be
reconfigured as replicas of the new master. However even after this time
the replicas will be reconfigured by the Sentinels anyway, but not with
the exact parallel-syncs progression as specified.
Default is 3 minutes.
sentinel failover-timeout mymaster 180000
Comment From: phamlankt
Great! Thank you very much! I will close the issue here.