The problem/use-case that the feature addresses

Hi, I would like to get some feedback on a proposal for implementing a new strategy in primary-replica scenarios to help make failovers during planned events run more smoothly and be less prone to client/server issues.

With the current tooling that Redis provides for running failovers in a primary-replica scenario—where an existing replica is promoted to primary and the former primary is either demoted or removed for maintenance—clients are required to shift traffic from one to the other within a specific time window. This can lead to issues like connection stampedes, where the new primary has to handle tens, hundreds, or even thousands of connections in a short period.

While these types of issues can be mitigated by configuring clients to retry failed connections or operations—where the retry algorithm helps spread retries across time using techniques such as exponential back-off with jitter—there is still room for improvement.

Description of the feature

How can the replication strategy be changed to make this process less prone to issues? Can we shift traffic progressively from the former primary to the new one without introducing additional complications?

One possible approach is to allow the former primary to still accept writes for a period of time while the new primary also accepts writes. The main challenge with this approach is ensuring that writes are not lost, as writes made to the former primary would need to be replicated to the new primary.

The proposal aims to address this scenario by introducing a new replication strategy where the former primary (now demoted) is configured with the following characteristics:

  • Replica of the new primary
  • Still accepting writes for a limited period
  • Replicates all writes done to it to the new primary

This bidirectional replication would imply that the same key could be written to both servers. A simple "last-write-wins" strategy could be used to resolve conflicts between the two.

How could this choreography be articulated?

Redis could accept the following parameters when configuring this replica mode:

  • Window time: The period during which the former primary, now a replica, will continue to accept writes.
  • Optionally help in progressively "closing" client connections: Start with 0% of clients reconnected to the new primary and gradually increase to 100% as the window time reaches its maximum.