The master server starts streaming the RDB into the slave. Within a few mins, the default "client-output-buffer-limit" is reached and slave client connection (psync) getting closed. Even though the master didn't stop the streaming RDB. So, assuming the client buffer keeps on increasing. Which led to the OOM in master. Attaching the logs below.
client-output-buffer-limit "slave 268435456 67108864 60"
Crash Log:
[39075742.256857] Kernel panic - not syncing: Out of memory: compulsory panic_on_oom is enabled [39075742.257168] CPU: 3 PID: 19170 Comm: redis-server Not tainted 3.10.0-693.21.1.el7.x86_64
Master Log:
19170:S 03 Feb 2020 08:06:07.621 * Replica XXX.XX.XXX.XX:6379 asks for synchronization 19170:S 03 Feb 2020 08:06:07.621 * Unable to partial resync with replica XXX.XX.XXX.XX:6379 for lack of backlog (Replica request was: 76512498416396). 19170:S 03 Feb 2020 08:06:07.621 * Delay next BGSAVE for diskless SYNC 19170:S 03 Feb 2020 08:06:13.009 * Starting BGSAVE for SYNC with target: replicas sockets 19170:S 03 Feb 2020 08:06:13.297 * Background RDB transfer started by pid 13574 19170:S 03 Feb 2020 08:09:49.102 # Client id=750 addr=XXX.XX.XXX.XX:36454 fd=9 name= age=222 idle=222 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16359 oll=4750 omem=97394000 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits. 19170:S 03 Feb 2020 08:09:49.199 # Connection with replica XXX.XX.XXX.XX:6379 lost. 13574:C 03 Feb 2020 08:45:28.335 * RDB: 6980 MB of memory used by copy-on-write 19170:S 03 Feb 2020 08:45:29.418 * Background RDB transfer terminated with success 19170:S 03 Feb 2020 08:48:38.513 * Replica XXX.XX.XXX.XX:6379 asks for synchronization 19170:S 03 Feb 2020 08:48:38.513 * Unable to partial resync with replica XXX.XX.XXX.XX:6379 for lack of backlog (Replica request was: 76513089252891).
Slave Log:
7814:S 03 Feb 2020 08:06:07.462 * Connecting to MASTER XXX.XX.XXX.XXX:6379 7814:S 03 Feb 2020 08:06:07.462 * MASTER <-> REPLICA sync started 7814:S 03 Feb 2020 08:06:07.501 * Non blocking connect for SYNC fired the event. 7814:S 03 Feb 2020 08:06:07.539 * Master replied to PING, replication can continue... 7814:S 03 Feb 2020 08:06:07.616 * Trying a partial resynchronization (request 57b6910baf3bec5fdc88673ccafe3e06a815959b:76512498416396). 7814:S 03 Feb 2020 08:06:13.043 * Full resync from master: 57b6910baf3bec5fdc88673ccafe3e06a815959b:76513089252890 7814:S 03 Feb 2020 08:06:13.043 * Discarding previously cached master state. 7814:S 03 Feb 2020 08:06:13.332 * MASTER <-> REPLICA sync: receiving streamed RDB from master 7814:S 03 Feb 2020 08:45:29.078 * MASTER <-> REPLICA sync: Flushing old data 7814:S 03 Feb 2020 08:46:45.679 * MASTER <-> REPLICA sync: Loading DB in memory 7814:S 03 Feb 2020 08:48:38.153 * MASTER <-> REPLICA sync: Finished with success 7814:S 03 Feb 2020 08:48:38.153 # Connection with master lost.
Comment From: oranagra
The fact the output buffer limit was reached during replication and caused the disconnection of the slave is probably normal.. If there's high write traffic during replication you'll need to increase the slave buffer limit.
I'm not sure what the piece of kernel log you attached has to do with any of that, it seems completely unrelated.
The only problem I see is that despite the fact the master disconnected the slave, the replication fork child keeps transferring the rdb to it, and eventually "Finished with success" (which is incorrect)
This bug is already fixed in 5.0.7: https://github.com/antirez/redis/blob/5.0.7/src/networking.c#L819 And also fixed in 6.0 in a completely different way.
Comment From: ganeshkumarganesan
The only problem I see is that despite the fact the master disconnected the slave, the replication fork child keeps transferring the rdb to it, and eventually "Finished with success" (which is incorrect)
So that rdb transfer is the reason for all new commands in the master accumulated into the buffer. Which is the reason for Out Of Memory?
@oranagra Thanks for the fix.
Comment From: oranagra
@ganeshkumarganesan the parent process (master) actually did disconnect the slave and stopped accumulating output buffers for it. But since the child process didn't terminate, Copy On Write kept accumulating, so that could be the reason for the kernel OOM.