Redis [NEW] Improve redis upgrade/downgrade safety regarding rdb versions

The problem/use-case that the feature addresses

Currently an RDB version upgrade introduces a point-of-no-return.

Let's look at the upgrade from Redis 6.2 to 7.0. We have a change in RDB version from 9 to 10. Once a Redis 7.0 replica is promoted, any existing 6.2 replicas will attempt to sync with it, but this sync will fail due to the RDB version incompatiblility.

That means that if there is a serious problem post-upgrade, there is no way to go back. The only option is to revert back to an old RDB dump and incur data loss.

This is not just a theoretical problem. We actually encountered a performance regression when upgrading from 5.0 to 6.0: https://github.com/redis/redis/issues/8668.

Had we not had the option to go back to 5.0, we would have been in serious trouble, possibly incurring a prolonged outage.

Description of the feature

One way to implement this would be to support producing RDB dumps for both current version as well as current version - 1. When syncing, the replica can signal which format it wants, and the primary will produce the correct format.

If there are new data types introduced that are not supported by the old format, the server can either refuse to generate the RDB, or it can produce a dump that does not contain those keys, perhaps with a warning being logged.

This way, as long as none of the new features are in use yet, it is always safe to downgrade. It also makes it possible for replicas to run on the previous version. At least as long as the replication stream does not contain incompatible commands.

Alternatives you've considered

There could be a hook that allows a binary to be invoked on the RDB dump file before it is loaded. That way a RDB downgrader utility could be implemented outside of Redis.

Additional information

Not sure how this affects diskless replication / diskless sync.

Comment From: yossigo

@igorwwwwwwwwwwwwwwwwwwww I don't think the approach of trying to produce backward compatible data files is correct in the long run. Even if we ignore the cases you mentioned when it's not possible, there's also the challenge of making sure this seldom-used, hard-to-test, fragile mechanism is dependable when you need it. This is probably the reason why it's also not a generally common approach in the DB world.

Did you consider other approaches, like: * Deploying new versions as canaries that replicate from the old version and handle mirrored or partial (e.g. read-only) workloads, before upgrading masters. * Creating a rollback mechanism that is based on replaying an RDB file as a sequence of commands that have greater version tolerance.

Comment From: igorwwwwwwwwwwwwwwwwwwww

@yossigo I agree, it's tricky to do well.

Deploying canaries is something we already do, by upgrading a single replica first and setting its replica-priority to 0. This can give a bit of an idea already, but the workload between primaries and replicas is very different. So unfortunately it is still quite risky for us.

I like your suggestion of a protocol-level replay that is less sensitive to the physical representation. This is akin to logical replication in postgres. Would you see this as something Redis could support natively, or would this rather be something custom that connects to a Redis server?

Comment From: gopivalleru

How about using Redis RIOT to do live replication from 7.0 replica to 6.2? This will help if you want to failback.