Redis [CRASH] Can't replicate data from master to slave

Crash report

Paste the complete crash log between the quotes below. Please include a few lines from the log preceding the crash report to provide some context.

Redis cluster contains 5 master and 5 slave. And I have found can't replicate data from master to slave and slave promote to master only 1 pair. Please see the log from master node:
-------------------------------------------------------------------------------------------------
34111:M 07 Mar 2021 18:50:00.166 * Background saving terminated with success
34111:M 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
34111:M 07 Mar 2021 18:50:44.889 # Cluster state changed: fail
34111:M 07 Mar 2021 18:50:45.968 # Failover auth granted to ef52c82a131c1ab206379d18d6337adc597ad482 for epoch 11
34111:M 07 Mar 2021 18:50:45.986 # Cluster state changed: ok
34111:M 07 Mar 2021 18:51:01.010 * 10000 changes in 60 seconds. Saving...
34111:M 07 Mar 2021 18:51:01.444 * Background saving started by pid 26351
34111:M 07 Mar 2021 18:55:13.811 * Clear FAIL state for node d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179: master without slots is reachable again.
34111:M 07 Mar 2021 18:55:13.818 # Configuration change detected. Reconfiguring myself as a replica of 122ceb58645dd2b0952abed790122c818ff30f75
34111:S 07 Mar 2021 18:55:13.820 # Connection with replica 10.1.xxx.xxx:8000 lost.
34111:S 07 Mar 2021 18:55:13.820 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
34111:S 07 Mar 2021 18:55:14.475 * Connecting to MASTER 10.1.xxx.xxx:8000
34111:S 07 Mar 2021 18:55:14.475 * MASTER <-> REPLICA sync started
34111:S 07 Mar 2021 18:55:14.475 * Non blocking connect for SYNC fired the event.
34111:S 07 Mar 2021 18:55:14.475 * Master replied to PING, replication can continue...
34111:S 07 Mar 2021 18:55:14.476 * Trying a partial resynchronization (request 2ee7f1c1389150dc756f14ca1e3548b3d17d2427:138317955027).
34111:S 07 Mar 2021 18:55:14.876 * Full resync from master: 6759d702213c0d5ef1bf1733b7b1e21d7ff8d7fe:138319544125
34111:S 07 Mar 2021 18:55:14.876 * Discarding previously cached master state.
26351:C 07 Mar 2021 18:57:49.593 * DB saved on disk
26351:C 07 Mar 2021 18:57:50.025 * RDB: 202 MB of memory used by copy-on-write
34111:S 07 Mar 2021 18:57:50.655 * Background saving terminated with success
34111:S 07 Mar 2021 18:58:47.025 * MASTER <-> REPLICA sync: receiving 20036023309 bytes from master
34111:S 07 Mar 2021 19:01:06.577 * FAIL message received from 122ceb58645dd2b0952abed790122c818ff30f75 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
34111:S 07 Mar 2021 19:02:51.054 * 10 changes in 300 seconds. Saving...
34111:S 07 Mar 2021 19:02:51.455 * Background saving started by pid 26656
34111:S 07 Mar 2021 19:02:53.322 * Clear FAIL state for node d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179: replica is reachable again.
34111:S 07 Mar 2021 19:05:04.236 * Replica is about to load the RDB file received from the master, but there is a pending RDB child running. Killing process 26656 and removing its temp file to avoid any race
34111:S 07 Mar 2021 19:05:05.572 * MASTER <-> REPLICA sync: Flushing old data
34111:S 07 Mar 2021 19:07:09.554 * MASTER <-> REPLICA sync: Loading DB in memory
34111:S 07 Mar 2021 19:09:43.051 * MASTER <-> REPLICA sync: Finished with success
34111:S 07 Mar 2021 19:09:43.065 # Background saving terminated by signal 10

Aditional information

OS distribution and version RedHAT v8.3 on VM and Redis v5.0.5
Steps to reproduce (if any) Can't reproduce.

Comment From: yossigo

@guguri12 I don't see a crash in the logs, just a master node being demoted to replica and performing a full sync from its new master. Please provide more information if you believe there's an issue here.

Comment From: guguri12

@yossigo Please see in below lines. Why the master node fail over to replica node? And Why the master node can't connect to replica node? I can't finding the root cause.

34111:M 07 Mar 2021 18:50:00.166 * Background saving terminated with success 34111:M 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179 34111:M 07 Mar 2021 18:50:44.889 # Cluster state changed: fail 34111:M 07 Mar 2021 18:50:45.968 # Failover auth granted to ef52c82a131c1ab206379d18d6337adc597ad482 for epoch 11 34111:M 07 Mar 2021 18:50:45.986 # Cluster state changed: ok 34111:M 07 Mar 2021 18:51:01.010 * 10000 changes in 60 seconds. Saving... 34111:M 07 Mar 2021 18:51:01.444 * Background saving started by pid 26351 34111:M 07 Mar 2021 18:55:13.811 * Clear FAIL state for node d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179: master without slots is reachable again. 34111:M 07 Mar 2021 18:55:13.818 # Configuration change detected. Reconfiguring myself as a replica of 122ceb58645dd2b0952abed790122c818ff30f75 34111:S 07 Mar 2021 18:55:13.820 # Connection with replica 10.1.xxx.xxx:8000 lost. 34111:S 07 Mar 2021 18:55:13.820 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 34111:S 07 Mar 2021 18:55:14.475 * Connecting to MASTER 10.1.xxx.xxx:8000

Comment From: yossigo

@guguri12 It was instructed by another node to fail over, after a majority of the nodes has determined the node is indeed failing. To see the full picture, you'll need to look at the logs of the other nodes as well.

Comment From: guguri12

@yossigo Do you mean to look at the log file from replica node? In replica node, the message has shown as below:

8674:S 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179 8674:S 07 Mar 2021 18:50:44.889 # Cluster state changed: fail 8674:S 07 Mar 2021 18:50:45.987 # Cluster state changed: ok 30728:C 07 Mar 2021 18:51:26.740 * DB saved on disk 30728:C 07 Mar 2021 18:51:27.198 * RDB: 751 MB of memory used by copy-on-write 8674:S 07 Mar 2021 18:51:27.800 * Background saving terminated with success 8674:S 07 Mar 2021 18:51:28.035 * FAIL message received from b991ee84cfff581481bf630437bea929255681ce about 411331e5641bcbd7b4906e36e8e1783a50064c24 8674:S 07 Mar 2021 18:51:28.035 # Cluster state changed: fail 8674:S 07 Mar 2021 18:51:28.102 # Start of election delayed for 788 milliseconds (rank #0, offset 138317741254). 8674:S 07 Mar 2021 18:51:28.906 # Starting a failover election for epoch 12. 8674:S 07 Mar 2021 18:51:28.911 # Failover election won: I'm the new master. 8674:S 07 Mar 2021 18:51:28.911 # configEpoch set to 12 after successful failover 8674:M 07 Mar 2021 18:51:28.911 # Setting secondary replication ID to 2ee7f1c1389150dc756f14ca1e3548b3d17d2427, valid up to offset: 138317741255. New replication ID is 6759d702213c0d5ef1bf1733b7b1e21d7ff8d7fe 8674:M 07 Mar 2021 18:51:28.911 # Connection with master lost.

Comment From: yossigo

@guguri12 No, the decision to initiate failover is reached by a majority of master nodes. You need to look at the logs of all master nodes to try to understand the sequence of events.

Comment From: guguri12

@yossigo Do you mean view log file of the master node in the same shard or other master node?

Comment From: yossigo

@guguri12 All master instances.

Comment From: guguri12

@yossigo When the redis client connect to redis server cluster, should i set configuration to all instances (master and slave) ?

Comment From: guguri12

@yossigo I have attached the log files from all master instances. After review again, i think there are 2 master instances that found this issue.

Comment From: yossigo

@guguri12 configuration is done locally on a per-instance basis, so yes if you perform CONFIG SET you need to apply it to all relevant Redis instances.

There are no logs attached, have you been able to trace them and get to a resolution? Or do you still suspect there is an issue with Redis here?

Comment From: guguri12

master_node_logs.txt @yossigo I didn't run command config set at that time. So i think there are something in redis server. Please kindly help to review the redis log file from master instances.

Comment From: yossigo

@guguri12 Looks like you have experienced a temporary node/network failure which resulted with a quorum decision to failover the node. Later on the failed over node was again communicating with the other nodes and re-synced with the new master as a replica.

This is where it all started:

34111:M 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
33609:M 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
32484:M 07 Mar 2021 18:50:44.889 * FAIL message received from 195b4a7a2664b34f76833620c633731ffb2d5171 about d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179
33018:M 07 Mar 2021 18:50:44.883 * Marking node d8afbeb06e23e5f5a5f7afe4f6b87f6169d73179 as failing (quorum reached).