HI,in my production situation,i came across a problem which make me confuse,version 3.2.10

one master is down ,and the slave begin to failover ,but failed ,because all the other master think its up

i search the code,it make me more confuse three master,three slave

1、slave need wait trouble master became fail state,then begin to failover 2、only majority of master nodes knows its pfail,then make the node fail state,and master broadcaset all the node which can connect make them mark fail state 3、according above,once slave begin to failover,at lease one alive master had changed pfail state to fail state,and at lease one vote to the slave which want to failover to be master 4、but in my situation,all the alive master denied to grant failover auth to the slave,and think its master is up,this make me confuse

could you help me ,thanks

Comment From: yester354

for example 1、three master :a ,b,c .three slave a1,b1,c1, 2、when master a is down,then slave a1 wait b or c to brodcast the master a is failed and let a1 mark this node failed state 3、node a1 receive the message and mark node a fail flag then begin to ask master node b ,c to grant auth to failover 4、master node b and c think node a is up,deined to grant auth to a1,then failover failed

the pre condition of failover of slave a1 is other master b or c had broadcast node a fail state ,but why at last both node b and c denied to grant failover auth to a1 because of they think node a is up, according to the code,at lease one master node should grant auth to slave a1 although network parttion. I may not be thinking right ,please help me ,thanks

Comment From: yester354

maybe there is one situation like this 1、slave a1 interact with master node b ,c and both b ,c mark node a pfail state, as majority of master node mark pfail state,then node slave a1 mark node a fail state ,then begin to failover 2、when the request for failover grant auth reach master node b and c,both node b and c didnt contect witch each other,they all stay in mark node a in pfail state,then they denied to grant auth to node a1

i think this is rarely situation,but i can repeat it many times

Comment From: capathida

I have seen this too. I believe the hard coded timeout of 500ms + rand%500 is too short sometimes. At least the logs on the other masters seem to not have gotten the master disconnect signal. Seen this during load for example. I wonder if the hard coded time should be configurable or perhaps it should be retried after a while if this happens.