Redis Slave lost connection of master in 2.8.8

Hello, recently i migrate to redis 2.8.8, but this is a problem with replication. slave will lost connection after 900 seconds of first sync with master and will set the master to itself. But If I degrade to 2.6.17, there is not this problem. Anyone has some ideas? Thanks.

Messages in log:

[6294] 02 Apr 19:06:24.192 # Server started, Redis version 2.8.8 [6294] 02 Apr 19:06:24.192 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. [6294] 02 Apr 19:06:24.192 * DB loaded from disk: 0.000 seconds [6294] 02 Apr 19:06:24.192 * The server is now ready to accept connections on port 6379 [6294] 02 Apr 19:06:25.192 * Connecting to MASTER 10.128.11.83:6379 [6294] 02 Apr 19:06:25.193 * MASTER <-> SLAVE sync started [6294] 02 Apr 19:06:25.193 * Non blocking connect for SYNC fired the event. [6294] 02 Apr 19:06:25.194 * Master replied to PING, replication can continue... [6294] 02 Apr 19:06:25.194 * Partial resynchronization not possible (no cached master) [6294] 02 Apr 19:06:25.196 * Full resync from master: fd49b15e45a55e0f7ab1d6e59cb574d33ee13e5b:1 [6294] 02 Apr 19:06:25.219 * MASTER <-> SLAVE sync: receiving 124 bytes from master [6294] 02 Apr 19:06:25.220 * MASTER <-> SLAVE sync: Flushing old data [6294] 02 Apr 19:06:25.220 * MASTER <-> SLAVE sync: Loading DB in memory [6294] 02 Apr 19:06:25.220 * MASTER <-> SLAVE sync: Finished with success [6294] 02 Apr 19:09:35.847 # Connection with master lost. [6294] 02 Apr 19:09:35.847 * Caching the disconnected master state. [6294] 02 Apr 19:09:35.847 * Discarding previously cached master state. [6294] 02 Apr 19:09:35.847 * SLAVE OF 127.0.0.1:6379 enabled (user request) [6294] 02 Apr 19:09:35.848 # CONFIG REWRITE executed with success. [6294] 02 Apr 19:09:36.837 * Connecting to MASTER 127.0.0.1:6379

Comment From: gonzajg

Having the same issue with master and slave in 2.8.8. Continuos disconnection. Had to turn backlog to 500mb in order to avoid downtime. Can't understand why server sends that huge amount of resync data with no problem, but has trouble answering a PING request.

Log from server:

[115250] 22 Apr 22:49:24.224 # Connection with slave 10.155.196.174:6300 lost. [115250] 22 Apr 22:49:24.278 * Slave asks for synchronization [115250] 22 Apr 22:49:24.709 * Partial resynchronization request accepted. Sending 475575343 bytes of backlog starting from offset 3341440037224. [115250] 22 Apr 22:52:42.252 # Connection with slave 10.155.196.174:6300 lost. [115250] 22 Apr 22:52:42.336 * Slave asks for synchronization [115250] 22 Apr 22:52:42.681 * Partial resynchronization request accepted. Sending 484955663 bytes of backlog starting from offset 3342983118227. [115250] 22 Apr 22:53:47.287 # Connection with slave 10.155.196.174:6300 lost. [115250] 22 Apr 22:53:47.340 * Slave asks for synchronization [115250] 22 Apr 22:53:47.664 * Partial resynchronization request accepted. Sending 476326200 bytes of backlog starting from offset 3343499453628. [115250] 22 Apr 22:54:55.313 # Connection with slave 10.155.196.174:6300 lost. [115250] 22 Apr 22:54:55.369 * Slave asks for synchronization [115250] 22 Apr 22:54:55.731 * Partial resynchronization request accepted. Sending 518925163 bytes of backlog starting from offset 3344035506726.

Log from slave:

[3911] 22 Apr 21:49:41.357 # MASTER timeout: no data nor PING received... [3911] 22 Apr 21:49:41.357 # Connection with master lost. [3911] 22 Apr 21:49:41.357 * Caching the disconnected master state. [3911] 22 Apr 21:49:41.357 * Connecting to MASTER 10.155.245.132:6300 [3911] 22 Apr 21:49:41.357 * MASTER <-> SLAVE sync started [3911] 22 Apr 21:49:41.359 * Non blocking connect for SYNC fired the event. [3911] 22 Apr 21:49:41.407 * Master replied to PING, replication can continue... [3911] 22 Apr 21:49:41.410 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3341440037224). [3911] 22 Apr 21:49:41.412 * Successful partial resynchronization with master. [3911] 22 Apr 21:49:41.413 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization. [3911] 22 Apr 21:52:59.391 # MASTER timeout: no data nor PING received... [3911] 22 Apr 21:52:59.391 # Connection with master lost. [3911] 22 Apr 21:52:59.391 * Caching the disconnected master state. [3911] 22 Apr 21:52:59.391 * Connecting to MASTER 10.155.245.132:6300 [3911] 22 Apr 21:52:59.391 * MASTER <-> SLAVE sync started [3911] 22 Apr 21:52:59.392 * Non blocking connect for SYNC fired the event. [3911] 22 Apr 21:52:59.472 * Master replied to PING, replication can continue... [3911] 22 Apr 21:52:59.475 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3342983118227). [3911] 22 Apr 21:52:59.477 * Successful partial resynchronization with master. [3911] 22 Apr 21:52:59.477 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization. [3911] 22 Apr 21:54:04.425 # MASTER timeout: no data nor PING received... [3911] 22 Apr 21:54:04.425 # Connection with master lost. [3911] 22 Apr 21:54:04.425 * Caching the disconnected master state. [3911] 22 Apr 21:54:04.425 * Connecting to MASTER 10.155.245.132:6300 [3911] 22 Apr 21:54:04.425 * MASTER <-> SLAVE sync started [3911] 22 Apr 21:54:04.426 * Non blocking connect for SYNC fired the event. [3911] 22 Apr 21:54:04.474 * Master replied to PING, replication can continue... [3911] 22 Apr 21:54:04.477 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3343499453628). [3911] 22 Apr 21:54:04.479 * Successful partial resynchronization with master. [3911] 22 Apr 21:54:04.479 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

Comment From: charsyam

@jlming6 @gonzajg how many memories do you have?

and what is your client-output-buffer-limit slave in redis.conf?

If you have large memories, you should change your client-output-buffer-limit slave settings.

Comment From: gonzajg

We use a EC2 Amazon instance with 244 GB ram in both (master and slave)

Currenty this is my info memory: used_memory:14147782848 used_memory_human:13.18G used_memory_rss:14405611520 used_memory_peak:18898750272 used_memory_peak_human:17.60G used_memory_lua:33792 mem_fragmentation_ratio:1.02 mem_allocator:jemalloc-3.2.0

config get client-output-buffer-limit 1) "client-output-buffer-limit" 2) "normal 0 0 0 slave 0 0 0 pubsub 33554432 8388608 60"

We don't use pubsub

Comment From: antirez

@gonzajg Hello, please tune your replication timeout, which is the repl-timeout setting in redis.conf. Set it to a larger value, this should fix your issue.

Comment From: antirez

@jlming6 I see this in the logs:

[6294] 02 Apr 19:09:35.847 # Connection with master lost.

So the master closed the connection, this means that we need to look at the master's logs as well to understand why the master is closing the slave connection. 2.6 and 2.8 are very different beasts regarding replication. In Redis 2.6 the master was incapable of sensing a timing out slave, so the timeout was infinite. In Redis 2.8 you have to tune your timeouts in the right way, especially in high latency environments.

Comment From: gonzajg

@antirez Hi, thanks for your answer and Redis! We are using Redis as our only database for our application and use the default 60 seconds repl-timeout and 10 for repl-ping-slave-period . Isn't that long enough? When slave loses ping from server does that mean that there's no data for 60 seconds or that the slave is delayed from the master by more than that time?

From the redis.conf: The following option sets the replication timeout for:

1) Bulk transfer I/O during SYNC, from the point of view of slave. 2) Master timeout from the point of view of slaves (data, pings). 3) Slave timeout from the point of view of masters (REPLCONF ACK pings).

It is important to make sure that this value is greater than the value specified for repl-ping-slave-period otherwise a timeout will be detected every time there is low traffic between the master and the slave.

repl-timeout 60

It says every time there's low traffic, but we have almost 7k ops per second (continously) in the slave only from the data sent by the master

Comment From: gonzajg

UPDATE: Changing my repl-timeout to 600 I get almost the same problem but without the MASTER timeout: no data nor PING received...

I notice that date is different by 1 hour but I have other slaves with no problems.

I don't get that when it says that connection with master is lost, 1 second later the connection is reestablished without any problem.

Master log:

[115250] 23 Apr 11:15:48.757 * Slave asks for synchronization [115250] 23 Apr 11:15:48.757 * Unable to partial resync with the slave for lack of backlog (Slave request was: 3468142567658). [115250] 23 Apr 11:15:48.757 * Starting BGSAVE for SYNC [115250] 23 Apr 11:15:49.074 * Background saving started by pid 93227 [93227] 23 Apr 11:18:07.422 * DB saved on disk [93227] 23 Apr 11:18:07.702 * RDB: 1378 MB of memory used by copy-on-write [115250] 23 Apr 11:18:08.296 * Background saving terminated with success [115250] 23 Apr 11:19:33.704 * Synchronization with slave succeeded

Slave log:

[3911] 23 Apr 10:16:04.555 # Connection with master lost. [3911] 23 Apr 10:16:04.555 * Caching the disconnected master state. [3911] 23 Apr 10:16:05.556 * Connecting to MASTER 10.155.245.132:6300 [3911] 23 Apr 10:16:05.556 * MASTER <-> SLAVE sync started [3911] 23 Apr 10:16:06.553 * Non blocking connect for SYNC fired the event. [3911] 23 Apr 10:16:06.553 * Master replied to PING, replication can continue... [3911] 23 Apr 10:16:06.554 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3468142567658). [3911] 23 Apr 10:16:06.554 * Full resync from master: f85250bedc4b4492c0b2531dcddd01439ce289fc:3471519177316 [3911] 23 Apr 10:16:06.554 * Discarding previously cached master state. [3911] 23 Apr 10:18:26.090 * MASTER <-> SLAVE sync: receiving 6092282341 bytes from master [3911] 23 Apr 10:19:52.632 * MASTER <-> SLAVE sync: Flushing old data [3911] 23 Apr 10:21:30.579 * MASTER <-> SLAVE sync: Loading DB in memory [3911] 23 Apr 10:24:01.467 * MASTER <-> SLAVE sync: Finished with success

Comment From: zweifisch

in my case, it's caused by misconfiguration in redis-sentinel.conf, seems that 127.0.0.1 should not be used here

sentinel monitor mymaster 127.0.0.1 6379 2

Comment From: raviknits

Was this issue solved? I have the exact problem - As soon as master starts auto AOF rewrite , it loses connection to slave and keeps getting sync requests from slave. repl-timeout is 60, but the disconnection happens every second instead of waiting for 60 secs @antirez - your suggestion of tuning the repl-timeout does not seem to work here. Help appreciated.

Comment From: qingping209

I have the same problem? Was this fixed?

Comment From: carlvine500

I have fix the "endless Redis replication loops" ** redis-2.8.17 command:** ./redis-cli -p 6679 config set repl-timeout 1024 ./redis-cli -p 6679 config set client-output-buffer-limit "slave 536870912 536870912 0"

refer: https://redislabs.com/blog/top-redis-headaches-for-devops-replication-buffer?from=groupmessage&isappinstalled=0#.Vx2W6_l96Um

Comment From: linuxholic

will set the master to itself

why this could happen???