Redis [CRASH] Redis Instance is down while master sync

Description of the Problem - We have total of 72 redis cluster instances (36 master & 36 slave) and each instance contains approximately 68 millions (29 GB) data. - In order to check whether the redis instance is up properly from the dump file, we kill the redis process on one of the slave instances and then start it again. However, redis instance is down during the "MASTER <-> REPLICA sync: Flushing old data" step. When we remove the dump file and start the redis instance again, we do not encounter such problem. Dump file's size is 13 GB - We tried to increase the client-output-buffer-limit on this slave node and also its master (both tried the configs 1GB/256MB & 3GB/1GB) but it did not work. CONFIG set client-output-buffer-limit "normal 0 0 0 slave 1073741824 268435456 60 pubsub 33554432 8388608 60"

Do you have any solutions for this problem?

Crash Report We hid the redis host and port information in the below log since it is a corporate information.

25084:C 17 Apr 2024 01:30:24.038 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 25084:C 17 Apr 2024 01:30:24.039 # Redis version=6.0.5, bits=64, commit=00000000, modified=0, pid=25084, just started 25084:C 17 Apr 2024 01:30:24.040 # Configuration loaded 25084:M 17 Apr 2024 01:30:24.041 * Increased maximum number of open files to 10032 (it was originally set to 1024). 25084:M 17 Apr 2024 01:30:24.043 * Node configuration loaded, I'm 39170ea79f2889e3012e860fbad701fb70f80364 25084:M 17 Apr 2024 01:30:24.045 * Running mode=cluster, port=XXX. 25084:M 17 Apr 2024 01:30:24.046 # Server initialized 25084:M 17 Apr 2024 01:30:24.047 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled. 25084:M 17 Apr 2024 01:30:24.048 * Loading RDB produced by version 6.0.5 25084:M 17 Apr 2024 01:30:24.048 * RDB age 1282 seconds 25084:M 17 Apr 2024 01:30:24.049 * RDB memory usage when created 30022.11 Mb 25084:M 17 Apr 2024 01:38:50.610 * DB loaded from disk: 506.562 seconds 25084:M 17 Apr 2024 01:38:50.611 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 25084:M 17 Apr 2024 01:38:50.611 * Ready to accept connections 25084:S 17 Apr 2024 01:38:50.631 * Discarding previously cached master state. 25084:S 17 Apr 2024 01:38:50.632 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 25084:S 17 Apr 2024 01:38:50.633 # Cluster state changed: ok 25084:S 17 Apr 2024 01:38:51.639 * Connecting to MASTER XX.XX.XX.XX:XXXX 25084:S 17 Apr 2024 01:38:51.640 * MASTER <-> REPLICA sync started 25084:S 17 Apr 2024 01:38:51.641 * Non blocking connect for SYNC fired the event. 25084:S 17 Apr 2024 01:38:51.642 * Master replied to PING, replication can continue... 25084:S 17 Apr 2024 01:38:51.643 * Trying a partial resynchronization (request 81d2e9518bbd165853d8c2464ac2a8b2da03503f:2141620117578). 25084:S 17 Apr 2024 01:38:52.683 * Full resync from master: 81d2e9518bbd165853d8c2464ac2a8b2da03503f:2141678892713 25084:S 17 Apr 2024 01:38:52.684 * Discarding previously cached master state. 25084:S 17 Apr 2024 01:44:32.483 * MASTER <-> REPLICA sync: receiving 13043613357 bytes from master to disk 25084:S 17 Apr 2024 01:45:00.170 * MASTER <-> REPLICA sync: Flushing old data

Comment From: sundb

@BuseSolmaz what's the meaning of redis instance is down? the replication crashed? if so, can you provide the fully log include the crash log?

Comment From: BuseSolmaz

Sorry for the misunderstanding. I mean during the master-replica sync process redis instance is restarting. The above redis log is all we had for this problem. How can we get the detailed crash log? @sundb

Comment From: sundb

@BuseSolmaz do you mean the replication was closed after the last log 25084:S 17 Apr 2024 01:45:00.170 * MASTER <-> REPLICA sync: Flushing old data?

Comment From: sundb

@BuseSolmaz sorry for late reply, did you enable config repl_slave_lazy_flush? in theory, flush such a large dataset would take a lot of time.