Redis [BUG] Redis hangs during upgrade from v7.0.10 to v7.2.4

Describe the bug

Redis instance randomly hangs and becomes unresponsive after the upgrade from v7.0.10 to v7.2.4.

To reproduce

The following upgrade process was used when we noticed the issue: 1. Setup Redis cluster running v7.0.10 - in our case, we've provisioned a cluster with 9 masters, 2 replicas per master 2. Install Redis v7.2.4 RPM package 3. Restart all the replicas in a rolling fashion. All replicas are now running v7.2.4, all the masters are still on v7.0.10 4. Failover masters and restart instances in a rolling fashion. At this point masters are transitioning to the instances running v7.2.4. Some of these new masters become unresponsive.

This isn't happening all the time and not all Redis processes hang after the failover. We haven't established any pattern here. When the process hangs it does not respond to any redis-cli commands, for example, PING, INFO, etc. You can't set or get any keys. Running strace on this Redis instance produces no output. It seems like the process is completely stuck. You need to SIGKILL the process to stop it. The logs of the Redis process that hangs:

--> Failover happens
2258807:S 30 May 2024 11:00:15.191 * Manual failover user request accepted.
2258807:S 30 May 2024 11:00:15.192 * Received replication offset for paused master manual failover: 9037
2258807:S 30 May 2024 11:00:15.192 * All master replication stream processed, manual failover can start.
2258807:S 30 May 2024 11:00:15.192 * Start of election delayed for 0 milliseconds (rank #0, offset 9037).
2258807:S 30 May 2024 11:00:15.192 * Starting a failover election for epoch 12.
2258807:S 30 May 2024 11:00:15.193 * Failover election won: I'm the new master.
2258807:S 30 May 2024 11:00:15.193 * configEpoch set to 12 after successful failover
2258807:M 30 May 2024 11:00:15.193 * Connection with master lost.
2258807:M 30 May 2024 11:00:15.193 * Caching the disconnected master state.
2258807:M 30 May 2024 11:00:15.193 * Discarding previously cached master state.
2258807:M 30 May 2024 11:00:15.193 * Setting secondary replication ID to c35c915f9851c84d47057806cfba70f81ee97138, valid up to offset: 9038. New replication ID is b7c7225ef698b5f8da7d8960fde1025435dc3d7c
2258807:M 30 May 2024 11:00:15.196 * Replica 10.37.28.117:6552 asks for synchronization
2258807:M 30 May 2024 11:00:15.196 * Partial resynchronization request from 10.37.28.117:6552 accepted. Sending 0 bytes of backlog starting from offset 9038.
2258807:M 30 May 2024 11:00:15.199 * Replica 10.37.19.113:6552 asks for synchronization
2258807:M 30 May 2024 11:00:15.199 * Partial resynchronization request from 10.37.19.113:6552 accepted. Sending 0 bytes of backlog starting from offset 9038.
2258807:M 30 May 2024 11:00:15.844 * Connection with replica 10.37.28.117:6552 lost.
2258807:M 30 May 2024 11:00:16.447 * Replica 10.37.28.117:6552 asks for synchronization
2258807:M 30 May 2024 11:00:16.447 * Partial resynchronization request from 10.37.28.117:6552 accepted. Sending 0 bytes of backlog starting from offset 9038.
2258807:M 30 May 2024 11:00:22.548 * Failover auth granted to bd552e6f185d8040f02e739286b83d9f39cb795b () for epoch 13
2258807:M 30 May 2024 11:00:23.736 * Manual failover requested by replica c49eed33ca2420fa1634f701df494872d9d278fe ().
2258807:M 30 May 2024 11:00:23.737 * Failover auth granted to c49eed33ca2420fa1634f701df494872d9d278fe () for epoch 14
2258807:M 30 May 2024 11:00:23.738 * Connection with replica 10.37.28.117:6552 lost.
2258807:M 30 May 2024 11:00:23.739 * Configuration change detected. Reconfiguring myself as a replica of c49eed33ca2420fa1634f701df494872d9d278fe ()
2258807:S 30 May 2024 11:00:23.739 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
2258807:S 30 May 2024 11:00:23.739 * Connecting to MASTER 10.37.28.117:6552
2258807:S 30 May 2024 11:00:23.739 * MASTER <-> REPLICA sync started
2258807:S 30 May 2024 11:00:23.741 * Connection with replica 10.37.19.113:6552 lost.
2258807:S 30 May 2024 11:00:23.741 * Non blocking connect for SYNC fired the event.
2258807:S 30 May 2024 11:00:23.741 * Master replied to PING, replication can continue...
2258807:S 30 May 2024 11:00:23.741 * Trying a partial resynchronization (request b7c7225ef698b5f8da7d8960fde1025435dc3d7c:10221).
2258807:S 30 May 2024 11:00:23.741 * Successful partial resynchronization with master.
2258807:S 30 May 2024 11:00:23.741 * Master replication ID changed to 4f857c5e8a419baa6d70bfcd5c571c3c29113857
2258807:S 30 May 2024 11:00:23.741 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.
--> Redis process hangs

Expected behavior

Upgrade succeeds without any issues.

Comment From: sundb

@zygisa do you still have the same problem? is there reproducible steps? i tried serveral time locally but can't reproduce it.

Comment From: zygisa

Hi @sundb, sorry, I don't have any additional details. The steps are listed in the description. Here's the config we're using, maybe that's gonna be helpful:

pidfile /var/run/redis/<cluster_name>/redis_<cluster_name>.pid
port 6749
tcp-backlog 4096
bind <IP>
protected-mode no
timeout 0
tcp-keepalive 300
loglevel notice
syslog-enabled yes
syslog-ident redis-<cluster_name>
syslog-facility local0
databases 16
save 3600 1
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump-<cluster_name>.rdb
dir /var/lib/redis
slave-serve-stale-data yes
slave-read-only yes
repl-diskless-sync yes
repl-diskless-sync-delay 5
slave-read-only yes
repl-ping-slave-period 10
repl-timeout 60
repl-disable-tcp-nodelay no
repl-backlog-size 1mb
repl-backlog-ttl 3600
slave-priority 100
maxclients 131072
appendonly no
appendfilename appendonly-<cluster_name>.aof
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-log-slower-than 2500
slowlog-max-len 10000
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 1024mb 1024mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
cluster-enabled yes
cluster-config-file nodes-<cluster_name>.conf
cluster-node-timeout 5000
aclfile /etc/redis/cluster_<cluster_name>.acl
lazyfree-lazy-user-flush yes
lazyfree-lazy-user-del yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
replica-lazy-flush yes
active-defrag-cycle-max 5
activedefrag yes
oom-score-adj yes
enable-debug-command yes
io-threads 4
io-threads-do-reads yes
cluster-allow-replica-migration no
cluster-migration-barrier 99
repl-backlog-size 10mb
replica-priority 10

We also ran into some issues (specifically https://github.com/redis/redis/issues/13205) in v7.2.4 because activedefrag is enabled in our setup but not sure if this can be related in any way.