Describe the bug

For maxmemory=96gb, the replicated data size so huge that it's even breaching 64gb client-output-buffer-limit. For example in the below logs when the application get burst of traffic, though the RDB size is around 17147 MB to 26465 MB, the omem is 68719730600. This eventually impacts latency from master node, we are currently killing slaves to get the application back to normal as replication process won't happen when there are no slaves.

248207:C 30 Nov 2022 08:58:55.552 * RDB: 3894 MB of memory used by copy-on-write
161290:M 30 Nov 2022 08:58:56.473 * Background saving terminated with success
161290:M 30 Nov 2022 08:59:57.013 * 1000 changes in 60 seconds. Saving...
161290:M 30 Nov 2022 08:59:57.767 * Background saving started by pid 248452
248452:C 30 Nov 2022 09:01:14.618 * DB saved on disk
248452:C 30 Nov 2022 09:01:15.271 * RDB: 17087 MB of memory used by copy-on-write
161290:M 30 Nov 2022 09:01:17.482 * Background saving terminated with success
161290:M 30 Nov 2022 09:02:18.002 * 1000 changes in 60 seconds. Saving...
161290:M 30 Nov 2022 09:02:18.973 * Background saving started by pid 249718
249718:C 30 Nov 2022 09:03:34.999 * DB saved on disk
249718:C 30 Nov 2022 09:03:35.818 * RDB: 26465 MB of memory used by copy-on-write
161290:M 30 Nov 2022 09:03:38.596 * Background saving terminated with success
161290:M 30 Nov 2022 09:04:39.010 * 1000 changes in 60 seconds. Saving...
161290:M 30 Nov 2022 09:04:40.164 * Background saving started by pid 250021
250021:C 30 Nov 2022 09:05:58.030 * DB saved on disk
250021:C 30 Nov 2022 09:05:59.018 * RDB: 17147 MB of memory used by copy-on-write
161290:M 30 Nov 2022 09:06:01.563 * Background saving terminated with success
161290:M 30 Nov 2022 09:06:18.832 # Client id=46393178 addr=10.32.134.159:44277 fd=1804 name= age=744368 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=222887 omem=68719730600 events=rw cmd=replconf scheduled to be closed ASAP for overcoming of output buffer limits.
161290:M 30 Nov 2022 09:06:18.847 # Connection with replica 10.32.134.159:6379 lost.

To reproduce

Can't reproduce this in our lower environments with our load tests.

Expected behavior

The compressed data file created by replication fork process to stream to slave should be less than 32gb for maxmemory 96gb assuming best case 3:1 compression ratio.

Additional information

Redis version 5.0.6

Available RAM in server 187gb

from redis.conf

client-output-buffer-limit replica 64gb 64gb 0
save 900 1
save 300 10
save 60 1000
maxmemory-policy volatile-ttl
maxmemory 96gb

Comment From: oranagra

the size of the output buffer is not directly affected by the size of the dataset, but rather indirectly. the main factor here is the write traffic workload during that time (if you have a high rate of writes they'll accumulate). the size of the dataset affects the replication time, which you can maybe reduce by switching to diskless replication (if your network is faster than the disk), please look at repl-diskless-sync and repl-diskless-load. another factor here maybe is COW, which you can maybe reduce by switching to redis 7.0 that has some improvement in that area.

other than that i'll mention that we have some plan to eliminate this buffer buildup in some future version, see this

Comment From: rgampa

Thank you @oranagra for the prompt reply. We are already in the process of upgrading to 7.x, which will be completed by this month end. With the 5.x version can we tune repl-timeout (currently it's set as 60 seconds) and repl-backlog-size (currently it's set as 512 MB) to avoid full sync of data with replica/master was disconnected due to timeouts? If yes, for our data usage what are the recommended values?

Comment From: oranagra

i suppose you can do what you suggested, and you can also try diskless replication. in 5.0, only diskless master (repl-diskless-sync) is supported, the replica will still remain disk-based, but it can still probably reduce replication time. i don't have any numbers though, you'll have to try and figure them out.

Comment From: rgampa

Thanks @oranagra , this issue can be closed now.

We could reproduce the issue with load test, and below settings helped avoiding full sync.

repl-timeout 600sec
repl-backlog-size 5gb