I haven't seen an issue for this yet, so I'm opening this now for discussion.
Purpose: Achieve quorum in small clusters with 1-2 shards.
Desired properties:
- Allow both voting and non-voting replicas in the same cluster.
- Backward compatibility with nodes that don't recognize voting replicas.
How:
- New config
allow-voting-replicaswhich controls two things: whether a node is a voting replica and whether it counts votes from other voting replicas. - A flag in the cluster bus that marks that a node counts voting replicas (a bit
in
clusterMsgandclusterMsgDataGossipand inclusterNode). - An aux field in
nodes.conf. - Include voting replicas in
server.cluster->size, which affects the quorum. - Count votes from voting replicas when receiving FAILOVER AUTH ACK.
- Count voting replicas in failure detection, i.e. when counting failure reports to mark a node as FAIL.
Two competing ways of counting votes
-
A node without this feature counts the quorum (i.e.
server.cluster->size) as before, and counts votes from masters only. -
A node with this feature (
allow-voting-replicas yes) includes the voting replicas in the quorum and counts their votes.
Each replica decides (based on the config) how it counts the votes. If a node considers itself a winner, it promotes itself to master. Therefore, there is a risk that two nodes will consider themselves the winner with the same epoch and both promote themselves to master.
Ideally, all nodes in a cluster count votes and quorum in the same way, but in some setups, at least during rolling upgrade, there will be two ways of counting.
Worst case scenario
Two replicas A and B, with the same replica rank, initiate an election at roughly the same time. A majority of the masters vote for replica A while if the voting replicas are counted, there's a majority for replica B. Due to different configuration of A and B, both replicas consider themselves the winner. Both promote themselves to master with the same epoch and broadcast PONG to the rest of the cluster.
When A and B discover they have the same epoch, the conflict resolution algorithm kicks in and favors one of them, say A, which bumps its epoch.
When other nodes discover that B claims the same slots as A and A has the greater epoch, they send an UPDATE to B. B acknowledges that the exact same set of slots is owned by A, so B assumes that it was failed over by A and turns itself into a replica of A.
Any writes that were done to B while B considered itself a master are discarded, so there is potential data loss here.
How to prevent and mitigate this scenario?
Factors that mitigate the risk:
-
Often, the result of the election is the same in both ways of counting.
-
Replicas typically don't initiate an election at the same time. Each replica has a randomized delay before initiating an election.
To further mitigate the risk that a voting replica and a non-voting replica initiate an election at the same time, the voting replica can back off by giving itself a higher replica rank than the non-voting replica, if both nodes share the same replication offset.
When the two winners scenario has happened, we can try to sort out the situation faster. The epoch conflict resolution algorithm can be adjusted so that if a node find that it looses the conflict resolution and both nodes claim the exact same set of slots, the losing node turns itself into a replica of the winning node. There is no need to wait for UPDATE messages from other nodes to do this.
Comment From: madolson
A minor aside is that the node voting need not be a replica, it could just be a passive observer (I think we call these masters without slots in the code).
One other point we need to decide on is if we believe this needs to be available for classic cluster mode or if it should be reserved for cluster V2.
Comment From: zuiderkwast
We can allow "masters without slots" to vote too. That's really a minor thing and it doesn't really matter much to the design here.
This ticket is intended for cluster V1 (backward compatible). For V2 (flotilla) I assume voting is completely orthogonal to serving data and that V2 config is completely different, but I may be wrong. Maybe we should design the configuration so that it can apply to V2 as well in the future?
If we want this for 8.0, we can implement it as outlined here without too much effort. I assume V2 is not ready until perhaps 9.0 or WDYT?
Comment From: madolson
This ticket is intended for cluster V1 (backward compatible). For V2 (flotilla) I assume voting is completely orthogonal to serving data and that V2 config is completely different, but I may be wrong. Maybe we should design the configuration so that it can apply to V2 as well in the future?
I'm guessing the configs will look very different between the two, I think we can safely ignore the configuration alignment for now.
If we want this for 8.0, we can implement it as outlined here without too much effort. I assume V2 is not ready until perhaps 9.0 or WDYT?
The goal is to have it for 8.0, however we definitely won't deprecate the classic cluster until at least 9.0 (and I think we're being very optimistic we will practically deprecate it anytime soon). So there is definitely some longevity to implementing it now.
Comment From: srgsanky
Will this feature be strictly used on clusters with 1 or 2 shards? If we accidentally use this with a large cluster that has a lot of replicas per shard (100 shard cluster with 5 nodes in each shard), we will be increasing the number of message necessary to perform failover (in the example from 51 to 251).
If we limit this feature on 1-2 shard clusters, we may not have to worry about backward compatibility as the nodes with older engine versions will not try to elect primaries?
How will server.cluster->size be decremented during a rolling update when replicas are taken out of service? Is it the responsibility of "admin" client to update all nodes in the cluster? Do we need to wait till cluster forget information is received for the decommissioned replica?
Comment From: zuiderkwast
Good questions, @srgsanky.
You're right that too many voters is a problem in large clusters. Maybe it's better that it only works for 1-2 shard clusters. That's the use case we want to address.
server.cluster->size is problematic when replicas are counted, as you say, since a master is replaced by another one using failovers, but replicas can just disappear. We have gossip for cluster forget, but if a node is removed without cluster forget, it is still counted indefinitely. Maybe we should exclude replicas with status FAIL from the quorum?
Comment From: madolson
Actually giving a more complete answer now. I don't think we can just naively put replicas in the quorum group. As was mentioned, replicas fail and it's undesirable to not have an epoch bump associated with a quorum change (which is the case today, either with a failover or with slot migration adding or removing a shard) since we don't have consistency in voting in the quorum. We also would ideally keep it to an odd number of nodes where possible, if you have 2 masters and 2 replicas, you ideally only want 3 of the nodes voting.
I think at the very least the list of voting replicas should be in the epoch'd information. Like, there is a list of "voting replicas" which nodes broadcast out. We could add a config like, "cluster-allowed-voters". If a replica sees the quorum size is below that config, it could nominate itself and have to be committed by the current quorum. We'll trust an operator configures this config correctly. One nice property we have is we can keep quorum size as odd, which is preferred to prevent split brains. If a node fails, it'll stay in the group until another node votes itself in or is removed.
Early I had considered that maybe we have an external system be able to set the replicas which can vote. This would be added as part of the epoch, so changes to this value are spread around the cluster with a higher epoch. It's less elegant as it requires operators to dictate the set of nodes, and whenever a node fails they need to update that list.
Maybe we should exclude replicas with status FAIL from the quorum?
We would need conensus on the failure, if we think it failed but someone else doesn't think it's failed, we won't necessarily converge.
Comment From: zuiderkwast
if you have 2 masters and 2 replicas, you ideally only want 3 of the nodes voting.
If one of the masters has crashed, there are only 3 nodes alive. If only 2 can vote, such cluster can't do automatic failover. :scream_cat:
I think at the very least the list of voting replicas should be in the epoch'd information. Like, there is a list of "voting replicas" which nodes broadcast out. We could add a config like, "cluster-allowed-voters".
Interesting. So how about this?: When a node bumps the epoch, it decides which nodes are included in the quorum. It adds a new ping extension which contains the list of the node ids of the voting capable replicas, but only if there are less than 3 masters. This makes it compatible with nodes that don't support voting replicas.
Maybe we should exclude replicas with status FAIL from the quorum?
We would need conensus on the failure, if we think it failed but someone else doesn't think it's failed, we won't necessarily converge.
Good point that it's good to have consensus on the quorum. Just noting: Currently, when a master is added and a slot is migrated to it, it bumps the epoch and changes the quorum without consensus.
Comment From: madolson
Currently, when a master is added and a slot is migrated to it, it bumps the epoch and changes the quorum without consensus.
Yeah, this can cause issues. I know it's been mentioned that was done specifically for performance, even though it does add a time window in which we can end up divergent. We could harden this by requiring consensus during quorum group changes (adding your first slot OR removing your last slot).
Comment From: zuiderkwast
We could harden this by requiring consensus during quorum group changes (adding your first slot OR removing your last slot).
How would we do that? A new consensus request/ack pair? We can't really reuse FAILOVER_AUTH_REQUEST, since it's only for replicas to request failover. A request is ignored in this if:
/* Node must be a slave and its master down.
* The master can be non failing if the request is flagged
* with CLUSTERMSG_FLAG0_FORCEACK (manual failover). */
if (nodeIsMaster(node) || master == NULL ||
(!nodeFailed(master) && !force_ack))
{
I don't think it's a performance problem if it's only done for the first and last slot of a node. Anyhow, this is a different topic. Shall I open a separate ticket to discuss it?