Describe the bug
In the cluster mode, when the slave nodes in the sharded cluster are pinged by the master node during the execution of the nodeUpdateAddressIfNeeded operation, the getpeername system call may fail due to an error, causing the server.masterhost variable to be incorrectly set to ?. The slave node reports an error every 1 second: "Connecting to MASTER ?:6379". Just at that time, the master node and the slave node experience a network partition, and the master status of the sharded node is marked as PFAILED. At this time, other nodes will send gossip messages to the sharded slave node to correct the IP information of the sharded master node, but the server.masterhost configuration information will not be updated, which will cause the synchronization relationship between the master and slave nodes to not recover after the getpeername system call restores, and the following redis kernel-level error message will be displayed:
To reproduce
remark: - Redis kernel version: 6.2.14 - The redis parameter 'cluster-announce-ip' is not configured
-
Create 3 primary and 3 replication redis clusters
-
To simulate the 'slave0' node system call 'getpeername' error, here in order to quickly simulate the error, the parameter overheat configuration method is directly modified to obtain the ip address as: '? `
-
Simulated 'master0' and 'slave0' node network failures
#The slave0 node added iptables rules
iptables -A INPUT -s {master0-ip} -j DROP
iptables -A OUTPUT -d {master0-ip -j DROP
-
Wait for 'server.cluster-node-timeout' time to restore 'slave0' node system call 'getpeername'
-
Recover 'master0' and 'slave0' node network failures
iptables -D INPUT -s {master0-ip} -j DROP
iptables -D OUTPUT -d {master0-ip -j DROP
Expected behavior
The 'master' and 'slave' synchronization relationship can be restored after 'getpeername' system call and the network is restored
Additional information
- There is no way to simulate 'getpeername' system call exception, so by modifying the source code to simulate.
-
This problem has occurred in our production environment.
-
Adding the code in the red box below should fix the problem
Comment From: sundb
@wstar05 thanks, can you make a PR to fix it?
Comment From: wstar05
@wstar05 thanks, can you make a PR to fix it?
PR: https://github.com/redis/redis/pull/13514