Describe the bug here is valgrind CPU between redis6.x and redis7.x when execute replicationFeedSlaves.

in redis6.x copy argv&argc to echo replica redis-server replicationFeedSlaves 67.42% (1.44%) 1040385×

in redis7.x use one global shared replication buffer, redis-server replicationFeedSlaves 58.22% (0.67%) 4099731×

redis-server feedReplicationBuffer 56.16% (9.35%) 41089161×

To reproduce

replicationFeedSlaves funcation still high flow rate cpu 。 when master have 10 Slaves ,

Expected behavior

replicationFeedSlaves CPU utilization < 10%

Additional information

redis-server.valgrind2.out.svg redis-server.valgrind.buf.out.svg

Comment From: xiaozhitaba

redis-server.valgrind2.out.svg::

redis-server valgrind2 out redis-server.valgrind.buf.out.svg:: redis-server valgrind buf out

Comment From: xiaozhitaba

we call prepareReplicasToWrite in feedReplicationBuffer may be inappropriate.

redis-server prepareReplicasToWrite 38.55% (7.14%) 41089161×

Comment From: oranagra

@xiaozhitaba i'm not sure i follow you. please take a step backwards and explain what you mean to report. is there any regression here compared to previous versions? or is it just that you where expecting some improvement and it didn't materialize? or do you have a suggestion? in which case maybe push a PR?

your reproduction steps are not very clear to me. you mean you have 10 replicas on the one master?

Comment From: xiaozhitaba

@oranagra when we sent 1000k "set key value" , prepareReplicasToWrite in feedReplicationBuffer cost too much cpu. do we have a better way to avoid this problem? when we exec feedReplicationBuffer in replicationFeedSlaves , we need to traverse all Slaves, too waste cpu resources

Comment From: xiaozhitaba

@oranagra I suggest

void feedReplicationBuffer(char *s, size_t len) {

if 0

/* Install write handler for all replicas. */
prepareReplicasToWrite();

endif

...... }

void replicationFeedSlaves(list slaves, int dictid, robj *argv, int argc) { ......

if 1

/* Install write handler for all replicas. */
prepareReplicasToWrite();

endif

}

Comment From: oranagra

are you sure the problem is just the traversal of slaves and CPU utilization? if you're suggesting to move the call to prepareReplicasToWrite to replicationFeedSlaves (and replicationFeedStreamFromMasterStream) seems valid. did you / can you benchmark that impact of this?

please note that in the scenario you're describing, there's a severe amplification problem that is unavoidable. i.e. on every input byte from a client, there are 10 output bytes. each command that is read from a client socket, is written to 10 sockets.

Comment From: xiaozhitaba

@oranagra here is my way void feedReplicationBuffer(char s, size_t len) {

if 0

/ Install write handler for all replicas. */ prepareReplicasToWrite();

endif

...... }

void replicationFeedSlaves(list slaves, int dictid, robj argv, int argc) {

if 1

/ Install write handler for all replicas. */ prepareReplicasToWrite();

endif

...... }

send 1000k "set key value" replication to 10 salves, watch cpu utilization in valgrind. it is much better . replicationFeedSlaves 34.48% (0.99%) 1000096×

10 sockets write . cpu cost only 0.76%. writeToClient 0.76% (0.19%) 393860×

redis-server-10-slaves valgrind out

Comment From: ShooterIT

I will have a look. before 7.0, i noticed it costs much on appending buffer into replicas' output buffer. I thought #9166 resolved it, but from your test, it seems there still is an issue, thanks, i will test ASAP.

Comment From: ShooterIT

Hi @xiaozhitaba except different call number of prepareReplicasToWrite, did you find performance loss? I found your way has better performance, before

Summary:
  throughput summary: 62438.34 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.726     0.096     0.703     1.007     1.383     2.599

after

Summary:
  throughput summary: 65684.89 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.683     0.096     0.671     0.943     1.335     2.215

You can submit a PR, as oranagra said,

call to prepareReplicasToWrite to replicationFeedSlaves (and replicationFeedStreamFromMasterStream) seems valid.

Comment From: ShooterIT

Hi @oranagra BTW, we may call much times of prepareClientToWrite for complex reply.

Comment From: oranagra

Hi @oranagra BTW, we may call much times of prepareClientToWrite for complex reply.

yes, but this is not a conceptually complex of expensive function to call. i.e. it does a series of checks which will end up doing nothing if it is called for the second time. it would be silly to call it in a loop if we can move the call out of the loop. but i don't see any reason to change it or avoid calling it.

Comment From: soloestoy

anything news about this issue? i think 7.0 GA should take this optimization, @xiaozhitaba could you submit a PR to fix it?

Comment From: oranagra

@ShooterIT maybe you want to pick this up?

Comment From: ShooterIT

OK, done this week

Redis [BUG]replicationFeedSlaves still high CPU utilization in redis7.0 when it Replication backlog and replicas use one global shared replication buffer

if 0

endif

if 1

endif

if 0

endif

if 1

endif