Hi,I'm having some problems with slave failover. We have a 3 master 3 slave cluster, m1 m2 m3 s1 s2 s3 When m1 was down, the majority of pfail were first collected by s1,so s1 execute the function markNodeAsFailingIfNeeded, but didn't send fail message to the other nodes. The next clusterCron of S1 detected the "fail" of m1, and then s1 start a failover.But when s1 request for failover auth, all masters reject it because they didn't get fail message about m1 from s1. All masters continue to gossip with other nodes, until they collected enough pfails to confirm m1 was down. So the s1 need to try a second failover to get votes from m2 and m3. If cluster-node-timeout is 30s, the above situation would take 30s to make m1 as pfail, and 60s for s1 to "Failover attempt expired.", and 60s for auth_retry_time,. So s1 took about 150s to failover.It's so long.This situation happens many times in our production environment. Why can't slave send fail message? I tried to remove the judgment about whether the node is master or not in markNodeAsFailingIfNeeded, so any node can execute the clusterSendFail, and solved the above problems.Are there any problems with this change?

Comment From: caiyuxinggg

I think the solution is to let the slave send fail message about a node, or never let any slave execute the "markNodeAsFailingIfNeeded" function.

Comment From: enjoy-binbin

It was fixed in the PR listed above, i am closing it