Redis [BUG]"corrupted cluster config file" on redis 7.2.3 error when running redis cluster with mixed 7.0 and 7.2 nodes

Describe the bug

"corrupted cluster config file" on redis 7.2 error when running redis cluster with mixed 7.0 and 7.2 nodes.

To reproduce

Create a 3 node cluster with redis 7.0 (I've used 7.0.14)
Create 3 slave nodes with redis 7.2 (I've used 7.2.3)
Stop any of the 7.2 nodes and try to start it again. It fails with the following error:

70878:M 14 Nov 2023 09:47:27.115 # Unrecoverable error: corrupted cluster config file "c5cb6e214d955fe19cb2eb2d5d3e8a35a165f7d2 127.0.0.1:6381@16381,,tls-port=0,shard-id=f5de5cc8bc87f66210d34f1017149ebd89e03ad7 master - 0 1699951301788 3 connected 10923-16383 ".

Expected behavior

7.2 nodes can be restarted and can rejoin the cluster.

Additional information

I've came across this problem when trying to update one of my 7.0 cluster to 7.2. I've updated several slaves, as usual. Everything seemed to work fine, until I had to restart one of them and it failed. So it seems that the rolling update from 7.0 to 7.2 is right now impossible, but maybe I'm doing something wrong?

Corrupted config file generated by redis 7.2 (6379-6381 nodes are 7.0, 36379-36381 are 7.2):

8f542d6e23d29a9106e4d5db7c028dafd67d4649 127.0.0.1:6380@16380,,tls-port=0,shard-id=502af3db7f3f11593bb9754e9fd4f58e309cd719 master - 0 1699951299000 2 connected 5461-10922 f83290999eab469341f81dea9a783eaaec18b96a 127.0.0.1:6379@16379,,tls-port=0,shard-id=653bfbe1b8b4f698f81e93c173a04fb886641e51 master - 0 1699951300780 1 connected 0-5460 9662b7ddd38a17cd7a294e1c2bac692c5fc1cbcc 127.0.0.1:36381@46381,,tls-port=0,shard-id=a6af4619eec99aabd7c4b9e4ccc5070f43f642b8 slave,fail c5cb6e214d955fe19cb2eb2d5d3e8a35a165f7d2 1699951288673 1699951282616 3 disconnected c5cb6e214d955fe19cb2eb2d5d3e8a35a165f7d2 127.0.0.1:6381@16381,,tls-port=0,shard-id=f5de5cc8bc87f66210d34f1017149ebd89e03ad7 master - 0 1699951301788 3 connected 10923-16383 0eabd0c4889de11a25ac373c66251e5021d157a0 127.0.0.1:36380@46380,,tls-port=0,shard-id=98461c28bba1a38fdb7100cba6e98eacbb95e97b slave 8f542d6e23d29a9106e4d5db7c028dafd67d4649 0 1699951300000 2 connected bd1667d21001886891be7def40c298d6873967ce 127.0.0.1:36379@46379,,tls-port=0,shard-id=653bfbe1b8b4f698f81e93c173a04fb886641e51 myself,slave f83290999eab469341f81dea9a783eaaec18b96a 0 1699951301000 1 connected vars currentEpoch 3 lastVoteEpoch 0

I've also used gdb to check why it is failing and it seems to fail in this place when trying to parse shard-id field: https://github.com/redis/redis/blob/7f4bae817614988c43c3024402d16edcbf3b3277/src/cluster.c#L502

Comment From: enjoy-binbin

@madolson this is a issue i mentioned in https://github.com/redis/redis/pull/12604#issuecomment-1759100649 do you have any ideas how should we fix it?

Comment From: rraptorr

Sorry, I didn't have time to test this previously, but it seems that his bug is still unsolved. The same crash, for 7.2 nodes, still occurs on mixed 7.2 and 7.0 clusters. I've just verified it using 7.0.15 and 7.2.4/7.2.5. Both 7.2.4 and 7.2.5 still crash.

Comment From: sundb

@rraptorr did you mean you got the same error info Unrecoverable error: corrupted cluster config file? did you ever delete old nodes*.conf?

Comment From: rraptorr

@rraptorr did you mean you got the same error info Unrecoverable error: corrupted cluster config file? did you ever delete old nodes*.conf?

Yes, I get the same error. I have tested this creating a fresh setup with 7.0.15 masters and 7.2.5 slaves, as described in the original bug report.

Comment From: sundb

@rraptorr is it the same step? i can't reproduce it locally, i'm sure if i miss something.

Comment From: rraptorr

@rraptorr is it the same step? i can't reproduce it locally, i'm sure if i miss something.

Same step. OK, let me give exactly config files and commands I type.

Redis 7.0.15, build from source, 3 masters.

n1-master.conf:

port 7001
cluster-enabled yes
cluster-config-file n1-master.node.conf
dbfilename n1-master.rdb

n2-master.conf:

port 7002
cluster-enabled yes
cluster-config-file n2-master.node.conf
dbfilename n2-master.rdb

n3-master.conf:

port 7003
cluster-enabled yes
cluster-config-file n3-master.node.conf
dbfilename n3-master.rdb

Redis 7.2.5, build from source, 3 slaves.

n1-slave.conf:

port 8001
cluster-enabled yes
cluster-config-file n1-slave.node.conf
dbfilename n1-slave.rdb

n2-slave.conf:

port 8002
cluster-enabled yes
cluster-config-file n2-slave.node.conf
dbfilename n2-slave.rdb

n3-slave.conf:

port 8003
cluster-enabled yes
cluster-config-file n3-slave.node.conf
dbfilename n3-slave.rdb

Start all nodes (masters use 7.0.15 binary, slaves 7.2.5 binary):

./src/redis-server n1-master.conf
./src/redis-server n2-master.conf
./src/redis-server n3-master.conf
./src/redis-server n1-slave.conf
./src/redis-server n2-slave.conf
./src/redis-server n3-slave.conf

Setup the cluster and add slaves:

./src/redis-cli --cluster create localhost:7001 localhost:7002 localhost:7003
./src/redis-cli --cluster add-node localhost:8001 localhost:7001 --cluster-slave
./src/redis-cli --cluster add-node localhost:8002 localhost:7001 --cluster-slave
./src/redis-cli --cluster add-node localhost:8003 localhost:7001 --cluster-slave

Restart any 7.2.5 node, but doing ctrl-c on the terminal and starting it again.

$ ./src/redis-server n3-slave.conf 15101:C 31 May 2024 11:28:05.316 # WARNING: Changing databases number from 16 to 1 since we are in cluster mode 15101:C 31 May 2024 11:28:05.316 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 15101:C 31 May 2024 11:28:05.316 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 15101:C 31 May 2024 11:28:05.316 * Redis version=7.2.5, bits=64, commit=00000000, modified=0, pid=15101, just started 15101:C 31 May 2024 11:28:05.316 * Configuration loaded 15101:M 31 May 2024 11:28:05.316 # You requested maxclients of 10000 requiring at least 10032 max file descriptors. 15101:M 31 May 2024 11:28:05.316 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted. 15101:M 31 May 2024 11:28:05.316 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'. 15101:M 31 May 2024 11:28:05.316 * monotonic clock: POSIX clock_gettime _._ _.-``__ ''-._ _.-`` `. `_. ''-._ Redis 7.2.5 (00000000/0) 64 bit .-`` .-. `\/ _.,_ ''-._ ( ' , .-` | `, ) Running in cluster mode |`-._`-...-` __...-.-.|'_.-'| Port: 8003 |-. ._ / _.-' | PID: 15101-. -._-./ .-' .-'
|-._-. -.__.-' _.-'_.-'| |-.-._ _.-'_.-' | https://redis.io-. -._-..-'.-' .-'
|-._-. -.__.-' _.-'_.-'| |-.-._ _.-'_.-' |-._ -._-..-'.-' .-'
-._-..-' _.-'
-._ _.-'-..-'

15101:M 31 May 2024 11:28:05.317 # Unrecoverable error: corrupted cluster config file "75f3feb14d7cb10148cb8ae02b1fd4fc2da6da23 ::1:7003@17003,,tls-port=0,shard-id=0faf9642000f848dfe3d1216a682b4ab5332917b master - 0 1717147657974 3 connected 10923-16383 ". ```

Comment From: sundb

@rraptorr thanks a log, let me have a try.

Comment From: stevelipinski

We have encountered this same issue, and there are some additional contributing factors, specifically, whether the primary or replica is listed first in the nodes.conf, and if the other nodes are reachable (for gossip). 13428 does not fix these cases.