case:we have a cluster, it contains 7 master nodes(m1,m2,m3,m4,m5,m6,m7), and each master node has two slave nodes(m1-s1,m1-s2, m2-s1, m2-s2, m3-s1,m3-s2, m4-s1, m4-s2, m5-s1,m5-s2, m6-s1,m6-s2, m7-s1, m7-s2). and m1 crashes, in current recover flow, other master nodes will feel m1 crash event, and m1-s1 will broadcast a message to cluster and other master node will vote, then m1-s1 become new master node.

my question: if we treat m1, m1-s1,m1-s2 as a local cluster, when m1 crash, m1-s1 or m1-s2 can only broadcast vote message in this local cluster, and m1-s1 can also get 2 vote(he vote for himself, m1-s2 vote for m1-s1) and become master, why master nodes must vote?

Please help me figure it out, thanks a lot.

Comment From: madolson

Original answer from here: https://github.com/redis/redis/issues/9495

Great question! The rational is that masters with assigned slots can, almost, always achieve consensus about who is in the quorum group and who is not. Having replicas start voting implies runs into difficult situations when there are only a single replica per master, we can run into split brain scenarios when the master and replica are disconnected. The nice property of only having masters vote is that there is only ever one per shard, so once it's demoted we know we have another live node that is replacing it.

We are planning on iterating on this as part of our Cluster V2 project, see #8948, and want to build a more stable quorum group that doesn't rely on mastership. This is so that we can also apply this to non-clustered Redis configurations, exactly like what you're suggesting. Let me know if this helps!

Comment From: Roronoa-Zoro

Thanks @madolson, it really helps. I got a new question about "we can run into split brain scenarios when the master and replica are disconnected", how this case happens?

Comment From: madolson

Sure, I was saying that if there is a master and a slave in a shard, if there is a network split between the two, it's ambiguous which one should take over. You generally need at least 3 nodes to make a decision about failover, which is why designating a subset of nodes to make failover decisions usually works better.