Describe the bug
We enabled Redis cluster mode and currently the Redis cluster is running on version 5.0.4.
We decided to do the upgrade for the Redis version to 7.2.4. The upgrade plan is:
1. Upgrade the replica node firstly with the latest version of 7.2.4. Supposed the corresponding master is master_1, the current replica node is replica_1.
2. Then after the sync of master_1 -> replica_1 is ready and no issue, Execute failover on the replica_1 node.
3. The role of master_1 would be changed to replica, let us rename it to replica_1_1. Upgrade the replica_1_1 to the latest version of 7.2.4
4. Repeat the same steps as 1->3 util all the nodes in the cluster completed the upgrade.
However, the issue raised in step 2, after we upgraded replica_1 to 7.2.4, the cluster status changed to fail.
18523:S 12 Mar 2024 02:00:08.951 # Cluster state changed: fail
If we checked the typology from the replica view.
19a1de79970d002d71ad1c87260e7e3b3d13ab1f 10.189.131.204:6379@16379 slave,fail? 465a063c4d4b9cb9d848256b92a8d7466c1cda96 1710208793863 1710208793830 6334 connected
a6007999b15307949e1a2ff6f6256e73443f200d 10.189.219.13:6379@16379 slave,fail? 347fa1bb663ffbea93e3f8156100b711b41280ad 1710208793863 1710208793830 6335 connected
cb63bef9fc3737a47c5c3fd84f82018b92cf6659 10.189.122.140:6379@16379 master,fail? - 1710208793863 1710208793830 6336 connected 10923-16383
e603c0750438adfda5dd2700a4db3c2b33ea250b 10.189.218.57:6379@16379 myself,slave cb63bef9fc3737a47c5c3fd84f82018b92cf6659 0 1710208793830 6336 connected
347fa1bb663ffbea93e3f8156100b711b41280ad 10.169.14.62:6379@16379 master,fail? - 1710208793863 1710208793830 6335 connected 0-5460
465a063c4d4b9cb9d848256b92a8d7466c1cda96 10.169.155.21:6379@16379 master,fail? - 1710208793863 1710208793830 6334 connected 5461-10922
More detailed Service log on replica_1
18523:S 12 Mar 2024 01:59:53.867 * Trying a partial resynchronization (request aa7b34da5c55a894474305a21f65cbb3cc9ea15f:1).
18523:S 12 Mar 2024 01:59:53.868 * Full resync from master: 0ad286f834af5afcf9993e5ef33d2d598bca71c0:12084183
18523:S 12 Mar 2024 01:59:54.048 * MASTER <-> REPLICA sync: receiving 6596235 bytes from master to disk
18523:S 12 Mar 2024 01:59:54.059 * Discarding previously cached master state.
18523:S 12 Mar 2024 01:59:54.059 * MASTER <-> REPLICA sync: Flushing old data
18523:S 12 Mar 2024 01:59:54.061 * MASTER <-> REPLICA sync: Loading DB in memory
18523:S 12 Mar 2024 01:59:54.079 * Loading RDB produced by version 5.0.4
18523:S 12 Mar 2024 01:59:54.079 * RDB age 1 seconds
18523:S 12 Mar 2024 01:59:54.079 * RDB memory usage when created 1038.81 Mb
18523:S 12 Mar 2024 01:59:54.093 * Done loading RDB, keys loaded: 6340, keys expired: 0.
18523:S 12 Mar 2024 01:59:54.093 * MASTER <-> REPLICA sync: Finished with success
18523:S 12 Mar 2024 01:59:54.094 * Creating AOF incr file temp-appendonly.aof.incr on background rewrite
18523:S 12 Mar 2024 01:59:54.094 * Background append only file rewriting started by pid 18529
18529:C 12 Mar 2024 01:59:54.147 * Successfully created the temporary AOF base file temp-rewriteaof-bg-18529.aof
18529:C 12 Mar 2024 01:59:54.147 * Fork CoW for AOF rewrite: current 1 MB, peak 1 MB, average 1 MB
18523:S 12 Mar 2024 01:59:54.195 * Background AOF rewrite terminated with success
18523:S 12 Mar 2024 01:59:54.195 * Successfully renamed the temporary AOF base file temp-rewriteaof-bg-18529.aof into appendonly.aof.2.base.rdb
18523:S 12 Mar 2024 01:59:54.195 * Successfully renamed the temporary AOF incr file temp-appendonly.aof.incr into appendonly.aof.2.incr.aof
18523:S 12 Mar 2024 01:59:54.199 * Removing the history file appendonly.aof.1.incr.aof in the background
18523:S 12 Mar 2024 01:59:54.199 * Removing the history file appendonly.aof in the background
18523:S 12 Mar 2024 01:59:54.202 * Background AOF rewrite finished successfully
18523:S 12 Mar 2024 01:59:54.852 # Missing implement of connection type tls
18523:S 12 Mar 2024 02:00:08.951 # Cluster state changed: fail
Correspondingly, if we check the typology from master_1, we find that the replica_1 is in fail status.
19a1de79970d002d71ad1c87260e7e3b3d13ab1f 10.189.131.204:6379@16379 slave 465a063c4d4b9cb9d848256b92a8d7466c1cda96 0 1710209273000 6334 connected
465a063c4d4b9cb9d848256b92a8d7466c1cda96 10.169.155.21:6379@16379 master - 0 1710209274453 6334 connected 5461-10922
e603c0750438adfda5dd2700a4db3c2b33ea250b 10.189.218.57:6379@16379 slave,fail cb63bef9fc3737a47c5c3fd84f82018b92cf6659 1710208443202 1710208442903 6336 connected
cb63bef9fc3737a47c5c3fd84f82018b92cf6659 10.189.122.140:6379@16379 myself,master - 0 1710209272000 6336 connected 10923-16383
347fa1bb663ffbea93e3f8156100b711b41280ad 10.169.14.62:6379@16379 master - 0 1710209275456 6335 connected 0-5460
a6007999b15307949e1a2ff6f6256e73443f200d 10.189.219.13:6379@16379 slave 347fa1bb663ffbea93e3f8156100b711b41280ad 0 1710209273453 6335 connected
Server log on master_1.
7374:M 11 Mar 2024 20:23:07.021 * Replica 10.189.218.57:6379 asks for synchronization
7374:M 11 Mar 2024 20:23:07.021 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '80045b7119cf27cbccd862b289454991e510d06a', my replication IDs are '0ad286f834af5afcf9993e5ef33d2d598bca71c0' and '1fff1f7efea80cc7dd510fa65f50059a9be5cc0e')
7374:M 11 Mar 2024 20:23:07.021 * Starting BGSAVE for SYNC with target: disk
7374:M 11 Mar 2024 20:23:07.021 * Background saving started by pid 15617
15617:C 11 Mar 2024 20:23:07.075 * DB saved on disk
15617:C 11 Mar 2024 20:23:07.075 * RDB: 0 MB of memory used by copy-on-write
7374:M 11 Mar 2024 20:23:07.089 * Background saving terminated with success
7374:M 11 Mar 2024 20:23:07.094 * Synchronization with replica 10.189.218.57:6379 succeeded
7374:M 12 Mar 2024 01:54:03.150 # Connection with replica 10.189.218.57:6379 lost.
7374:M 12 Mar 2024 01:54:18.990 * FAIL message received from 347fa1bb663ffbea93e3f8156100b711b41280ad about e603c0750438adfda5dd2700a4db3c2b33ea250b
7374:M 12 Mar 2024 01:59:53.866 * Replica 10.189.218.57:6379 asks for synchronization
7374:M 12 Mar 2024 01:59:53.866 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'aa7b34da5c55a894474305a21f65cbb3cc9ea15f', my replication IDs are '0ad286f834af5afcf9993e5ef33d2d598bca71c0' and '1fff1f7efea80cc7dd510fa65f50059a9be5cc0e')
7374:M 12 Mar 2024 01:59:53.866 * Starting BGSAVE for SYNC with target: disk
7374:M 12 Mar 2024 01:59:53.867 * Background saving started by pid 604
604:C 12 Mar 2024 01:59:53.961 * DB saved on disk
604:C 12 Mar 2024 01:59:53.962 * RDB: 1 MB of memory used by copy-on-write
7374:M 12 Mar 2024 01:59:54.047 * Background saving terminated with success
7374:M 12 Mar 2024 01:59:54.056 * Synchronization with replica 10.189.218.57:6379 succeeded
To reproduce
- Setup a Redis cluster with version of 5.0.4.
- Upgrade one replica to version 7.2.4
Expected behavior
The replica_1 which has got upgraded to 7.2.4 should be setup a successful connection to master(with 5.0.4) and not running into Cluster status fail.
Additional information
It should be the compatibility issue of Redis upgrade.