Redis Redis/sentinel master - slave constantly loses connection

Redis 5.0.7

I'm running a single master with 2 slave nodes for redis and 3 sentinel nodes and after a short amount of time, redis says that the connection with master / slave was lost and sentinel sets the slave to replicaof 127.0.0.1. Eventually, sentinel reconnects with the real master, replicates the data and then loses connection again and keeps looping.

Database is usually under 1mb up to ~150mb, traffic is low as these are test servers.

Basically the same issue as https://github.com/antirez/redis/issues/1650 I did the same recommended config settings: repl-timeout 1024 client-output-buffer-limit "slave 536870912 536870912 0" which only delayed the issue.

Master redis log:

23728:M 27 Feb 2020 11:46:36.893 * Background saving terminated with success 23728:M 27 Feb 2020 11:46:36.895 * Synchronization with replica 10.6.153.153:6379 succeeded 23728:M 27 Feb 2020 11:46:57.111 * Replica 10.6.156.161:6379 asks for synchronization 23728:M 27 Feb 2020 11:46:57.111 * Full resync requested by replica 10.6.156.161:6379 23728:M 27 Feb 2020 11:46:57.111 * Starting BGSAVE for SYNC with target: disk 23728:M 27 Feb 2020 11:46:57.112 * Background saving started by pid 25745 25745:C 27 Feb 2020 11:46:57.119 * DB saved on disk 25745:C 27 Feb 2020 11:46:57.120 * RDB: 0 MB of memory used by copy-on-write 23728:M 27 Feb 2020 11:46:57.139 * Background saving terminated with success 23728:M 27 Feb 2020 11:46:57.140 * Synchronization with replica 10.6.156.161:6379 succeeded 23728:M 27 Feb 2020 11:49:38.244 # Connection with replica 10.6.153.153:6379 lost.

Slave redis log:

19709:S 27 Feb 2020 11:46:36.813 * Connecting to MASTER 10.6.156.14:6379 19709:S 27 Feb 2020 11:46:36.813 * MASTER <-> REPLICA sync started 19709:S 27 Feb 2020 11:46:36.814 * Non blocking connect for SYNC fired the event. 19709:S 27 Feb 2020 11:46:36.815 * Master replied to PING, replication can continue... 19709:S 27 Feb 2020 11:46:36.817 * Partial resynchronization not possible (no cached master) 19709:S 27 Feb 2020 11:46:36.817 * Full resync from master: dd8d93e15c10deed9341c6e998f4820097625305:214682 19709:S 27 Feb 2020 11:46:36.894 * MASTER <-> REPLICA sync: receiving 378334 bytes from master 19709:S 27 Feb 2020 11:46:36.897 * MASTER <-> REPLICA sync: Flushing old data 19709:S 27 Feb 2020 11:46:36.898 * MASTER <-> REPLICA sync: Loading DB in memory 19709:S 27 Feb 2020 11:46:36.904 * MASTER <-> REPLICA sync: Finished with success 19709:S 27 Feb 2020 11:47:25.013 * 10 changes in 300 seconds. Saving... 19709:S 27 Feb 2020 11:47:25.013 * Background saving started by pid 22006 22006:C 27 Feb 2020 11:47:25.025 * DB saved on disk 22006:C 27 Feb 2020 11:47:25.026 * RDB: 0 MB of memory used by copy-on-write 19709:S 27 Feb 2020 11:47:25.114 * Background saving terminated with success 19709:S 27 Feb 2020 11:49:38.244 # Connection with master lost. 19709:S 27 Feb 2020 11:49:38.244 * Caching the disconnected master state. 19709:S 27 Feb 2020 11:49:38.244 * REPLICAOF 127.0.0.1:6379 enabled (user request from 'id=29 addr=10.6.156.102:36120 fd=8 name= age=554 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=148 qbuf-free=32620 obl=36 oll=0 omem=0 events=r cmd=exec') 19709:S 27 Feb 2020 11:49:38.245 # CONFIG REWRITE executed with success. 19709:S 27 Feb 2020 11:49:39.234 * Connecting to MASTER 127.0.0.1:6379 19709:S 27 Feb 2020 11:49:39.234 * MASTER <-> REPLICA sync started 19709:S 27 Feb 2020 11:49:39.234 * Non blocking connect for SYNC fired the event. 19709:S 27 Feb 2020 11:49:39.234 * Master replied to PING, replication can continue... 19709:S 27 Feb 2020 11:49:39.234 * Trying a partial resynchronization (request dd8d93e15c10deed9341c6e998f4820097625305:320420). 19709:S 27 Feb 2020 11:49:39.235 # Unexpected reply to PSYNC from master: -MASTERDOWN Link with MASTER is down and replica-serve-stale-data is set to 'no'. 19709:S 27 Feb 2020 11:49:39.235 * Discarding previously cached master state. 19709:S 27 Feb 2020 11:49:39.235 * Retrying with SYNC... 19709:S 27 Feb 2020 11:49:39.235 # MASTER aborted replication with an error: MASTERDOWN Link with MASTER is down and replica-serve-stale-data is set to 'no'. 19709:S 27 Feb 2020 11:49:40.236 * Connecting to MASTER 127.0.0.1:6379 19709:S 27 Feb 2020 11:49:40.236 * MASTER <-> REPLICA sync started 19709:S 27 Feb 2020 11:49:40.236 * Non blocking connect for SYNC fired the event. 19709:S 27 Feb 2020 11:49:40.236 * Master replied to PING, replication can continue... 19709:S 27 Feb 2020 11:49:40.237 * Partial resynchronization not possible (no cached master) 19709:S 27 Feb 2020 11:49:40.237 # Unexpected reply to PSYNC from master: -MASTERDOWN Link with MASTER is down and replica-serve-stale-data is set to 'no'.

Comment From: hwware

Hello @bryantcj52 thank you for reporting the issue, may I ask you can you also share your sentinel configuration and log too? thanks

Comment From: bryantcj52

I think I've found the issue, it was twofold. The main issue was I had the sentinel that was on the same server as the master redis set to 127.0.0.1 instead of the private IP of the server so if it ever needed to propagate it's master redis IP, it was sending 127.0.0.1. The 2nd part is that I have the redis master on the same server as the app itself which is behind an AWS network load balancer which has two static private IPs and I set the other sentinels to point to the DNS of the load balacner so the other sentinels were probably getting confused. I've switched everything to the single server private IP and it all seems stable now.