We have a Redis cluster with 10 nodes, 5 master nodes, and 5 slave nodes, deployed across 5 servers (with each server hosting two nodes). Due to improper configuration during deployment, two nodes on one server have the same RDB directory and file name. However, this does not affect the normal operation of the system. We are aware of the potential risks associated with this configuration error and plan to rectify it. Our approach involves manually executing the failover command on the slave nodes to switch the master-slave roles. This will allow us to modify the configuration files of the master nodes and have the changes take effect after restarting. Unfortunately, an unexpected event occurred during the switch.
This is the log from the slave node:
16203:S 26 Jan 04:09:37.185 # Manual failover user request accepted. 16203:S 26 Jan 04:09:37.218 # Received replication offset for paused master manual failover: 2397822240146 16203:S 26 Jan 04:09:37.268 # All master replication stream processed, manual failover can start . 16203:S 26 Jan 04:09:37.268 # Start of election delayed for 0 milliseconds(rank #0,offset 2397822240146). 16203:S 26 Jan 04:09:37.369 # Starting afailover election for epoch 46. 16203:S 26 Jan 04:09:37.371 # Currently unable to failover: Waiting for votes, but majority still not reached. 16203:S 26 Jan 04:09:37 371 # Failover election won: I'm the new master . 16203:S 26 Jan 04:09:37.371 # configEpoch set to 46 after successful failover 16203:M 26 Jan 04:09:37.371 # Connection with master lost. 16203:M 26 Jan 04:09:37.372 * Caching the disconnected master state. 16203:M 26 Jan 04:09:37.372 * Discarding previously cached master state. 16203:M 26 Jan 04:09:38.120 * Slave 102.105.10.61:6379 asks for synchronization 16203:M 26 Jan 04:09:38.120 * Full resync reguested by slave 102.105.10.61:6379 16203:M 26 Jan 04:09:38.120 * Starting BGSAVE for SYNC with target: disk 16203:M 26 Jan 04:09:38.142 * Background saving started by pid 22940 22940:C 26 Jan 04:09:43.250 * DB saved on disk 22940:C 26 Jan 04:09:43.268 * RDB: 19 MB of memory used by copy on-write 16203:M 26 Jan 04:09:43.313 * Background saving terminated with success 16203:M 26 Jan 04:09:43.668 * Synchronization with slave 102.105.10.61:6379 succeeded
This is the log from the master node:
20378:M 26 Jan 04:09:37.185 # Manual failover requested by slave 6ae270711d1fe96855bfbcdefdb8ad6d500b2c1b. 20378:M 26 Jan 04:09:37.370 # Failover auth granted to 6ae270711d1fe96855bfbcdefdb8ad6d500b2c1b for epoch 46 20378:M 26 Jan 04:09:37.372 # Connection with slave 102.105.10.62:6479 1ost. 20378:M 26 Jan 04:09:37.404 # Configuration change detected,Reconfiguring myself as a replica of >6ae270711d1fe96855bfbcdefdb8ad6d500b2c1b 20378:M 26 Jan 04:09:38.119 * Connecting to MASTER 102.105.10.62:6479 20378:M 26 Jan 04:09:38.119 * MASTER <-> SLAVE sync started 20378:M 26 Jan 04:09:38.119 * Non blocking connect for SYNC fired the event. 20378:M 26 Jan 04:09:38.119 * Master replied to PING, replication can continue... 20378:M 26 Jan 04:09:38.120 * Partial resynchronization not possible (no cached master) 20378:M 26 Jan 04:09:38.143 * Full resync from master:99235009cfe9cB202fc4e0af95170b602441f1b5:2397022240147 20378:M 26 Jan 04:09:43.314 * MASTER <-> SIAVE sync: receiving 129611218 bytes from master 20378:M 26 Jan 04:09:43.702 * MASTER <-> SLAVE sync: Flushing old data 20378:M 26 Jan 04:09:47.636 * MASTER <-> SIAVE sync: Loading DB in memory 20378:M 26 Jan 04:10:03.170 * MASTER <-> SLAVE sync: finished with success
This is the log from another master node:
20381:M 26 Jan 04:09:37.370 # Failover auth granted to 6ae270711d1fe96055bfbcdefdbBad6d500b2c1b for epoch 46 20381:M 26 Jan 04:09:40.420 # Cluster state changed: fail 20381:M 26 Jan 04:11:05.053 * 10 changes in 300 seconds.saving... 20381:M 26 Jan 04:11:05.068 * Background saving started by pid 5402 5402:C 26 an 04:11:08.271 * DB saved on disk 5402:C 26 Jan 04:11:08.283 * RDB: 63 MB of memory used by copy-on-write 20381:M 26 Jan 04:11:08.400 * Background saving terminated with success
We can see that after the completion of the vote, the cluster status was changed to "fail" by the master node after approximately 3 seconds, and it did not recover for a long time. However, the other three master nodes are functioning normally. I would like to know why this anomaly occurred. Our Redis version is 3.2.12.
Comment From: madolson
We unfortunately don't support 3.2 anymore, and haven't for quite a long time. The latest version is 7.2, so I would recommend to upgrade to that. There are a lot of bug fixes between then and now, including some edge conditions related to sticky failure scenarios.