Describe the bug
In cluster mode enabled, when a replica attempts to campaign for leadership, it first bumps its local currentEpoch here. But if the election gets aborted because the current leader comes back alive, the replica never "un-bump" its local currentEpoch and would leave it artificially bumped indefinitely. In this state, the currentEpoch on this replica would be larger than any actual configEpoch of any node.
To reproduce
- Setup 2 shards. Shard 1 has a replica and owns some slots. After the cluster converged,
configEpochon nodes in shard 1 is0,configEpochon nodes in shard 2 is2:
dev-dsk-nanya-2b-ea8f5b51 % redis-cli -p 6380 cluster nodes
fbeb558fa145246e9e557aada626acc373bcb921 10.189.116.208:6379@16379 master - 0 1644016176208 0 connected 1-3
85d1b8f04add3be3802b91c191ae563b889442b9 10.189.116.208:6380@16380 myself,slave fbeb558fa145246e9e557aada626acc373bcb921 0 1644016174000 0 connected
7bd610388aed471c1758d20f94cc97177071bb83 10.189.116.208:6381@16381 master - 0 1644016175248 2 connected
- Use
gdbto pause both primaries in shard 1 and 2. So that replica in shard 1 would start campaign for leadership but it wouldn't win because it can't get enough votes. Replica in shard 1 would bumped its localcurrentEpochfrom0to3:
10502:S 04 Feb 2022 23:09:34.201 * Starting a failover election for epoch 3.
- Un-pause primary in shard 1. Replica in shard 1 would abort the leader election. But its local
currentEpochwould stay as3indefinitely, while its shard'sconfigEpochcorrectly remains0:
dev-dsk-nanya-2b-ea8f5b51 % redis-cli -p 6380 cluster info
...
cluster_current_epoch:3
cluster_my_epoch:0
Expected behavior
If an ongoing election is aborted, replica should "un-bump" its local currentEpoch to its previous value, or whatever is the currently max configEpoch in the cluster.
Additional information
Any additional information that is relevant to the problem.
Comment From: madolson
I'm not sure we can "un-bump" the currentEpoch. If for any reason a failover authorization is "delayed", and the replica tries to nominate itself again, it would be doing it based off of a previously used Epoch and might receive messages confirming the failover even the authorization messages were intended for a different election. The only mitigation is that we could bump the epoch consistently across the cluster even for failed elections.
Comment From: ny0312
After some thought, I agree that "un-bump" the currentEpoch is not safe, for reasons you laid out.
I also agree that the preferred behavior should be to "bump the epoch consistently across the cluster even for failed elections". In other words, each Epoch can have at most one leader, and is allowed to have no leader at all. This resembles with the design of Raft protocol.
And Redis cluster bus election already does so - After a candidate replica bumped its currentEpoch but failed to get elected, it would still propagate its bumped currentEpoch as part of outgoing gossip messages, and receivers will update their own local currentEpoch accordingly as well.
So this issue is a result of me not fully understanding the design of Redis cluster bus election protocol. I'm closing it. Thanks for the answer.