Describe the bug here is valgrind CPU between redis6.x and redis7.x when execute replicationFeedSlaves.
in redis6.x copy argv&argc to echo replica redis-server replicationFeedSlaves 67.42% (1.44%) 1040385×
in redis7.x use one global shared replication buffer, redis-server replicationFeedSlaves 58.22% (0.67%) 4099731×
redis-server feedReplicationBuffer 56.16% (9.35%) 41089161×
To reproduce
replicationFeedSlaves funcation still high flow rate cpu 。 when master have 10 Slaves ,
Expected behavior
replicationFeedSlaves CPU utilization < 10%
Additional information
redis-server.valgrind2.out.svg redis-server.valgrind.buf.out.svg
Comment From: xiaozhitaba
redis-server.valgrind2.out.svg::
redis-server.valgrind.buf.out.svg::
Comment From: xiaozhitaba
we call prepareReplicasToWrite in feedReplicationBuffer may be inappropriate.
redis-server prepareReplicasToWrite 38.55% (7.14%) 41089161×
Comment From: oranagra
@xiaozhitaba i'm not sure i follow you. please take a step backwards and explain what you mean to report. is there any regression here compared to previous versions? or is it just that you where expecting some improvement and it didn't materialize? or do you have a suggestion? in which case maybe push a PR?
your reproduction steps are not very clear to me. you mean you have 10 replicas on the one master?
Comment From: xiaozhitaba
@oranagra when we sent 1000k "set key value" , prepareReplicasToWrite in feedReplicationBuffer cost too much cpu. do we have a better way to avoid this problem? when we exec feedReplicationBuffer in replicationFeedSlaves , we need to traverse all Slaves, too waste cpu resources
Comment From: xiaozhitaba
@oranagra I suggest
void feedReplicationBuffer(char *s, size_t len) {
if 0
/* Install write handler for all replicas. */
prepareReplicasToWrite();
endif
...... }
void replicationFeedSlaves(list slaves, int dictid, robj *argv, int argc) { ......
if 1
/* Install write handler for all replicas. */
prepareReplicasToWrite();
endif
}
Comment From: oranagra
are you sure the problem is just the traversal of slaves and CPU utilization?
if you're suggesting to move the call to prepareReplicasToWrite to replicationFeedSlaves (and replicationFeedStreamFromMasterStream) seems valid.
did you / can you benchmark that impact of this?
please note that in the scenario you're describing, there's a severe amplification problem that is unavoidable. i.e. on every input byte from a client, there are 10 output bytes. each command that is read from a client socket, is written to 10 sockets.
Comment From: xiaozhitaba
@oranagra here is my way void feedReplicationBuffer(char s, size_t len) {
if 0
/ Install write handler for all replicas. */ prepareReplicasToWrite();
endif
...... }
void replicationFeedSlaves(list slaves, int dictid, robj argv, int argc) {
if 1
/ Install write handler for all replicas. */ prepareReplicasToWrite();
endif
...... }
send 1000k "set key value" replication to 10 salves, watch cpu utilization in valgrind. it is much better . replicationFeedSlaves 34.48% (0.99%) 1000096×
10 sockets write . cpu cost only 0.76%. writeToClient 0.76% (0.19%) 393860×
Comment From: ShooterIT
I will have a look. before 7.0, i noticed it costs much on appending buffer into replicas' output buffer. I thought #9166 resolved it, but from your test, it seems there still is an issue, thanks, i will test ASAP.
Comment From: ShooterIT
Hi @xiaozhitaba except different call number of prepareReplicasToWrite, did you find performance loss? I found your way has better performance, before
Summary:
throughput summary: 62438.34 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.726 0.096 0.703 1.007 1.383 2.599
after
Summary:
throughput summary: 65684.89 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.683 0.096 0.671 0.943 1.335 2.215
You can submit a PR, as oranagra said,
call to prepareReplicasToWrite to replicationFeedSlaves (and replicationFeedStreamFromMasterStream) seems valid.
Comment From: ShooterIT
Hi @oranagra BTW, we may call much times of prepareClientToWrite for complex reply.
Comment From: oranagra
Hi @oranagra BTW, we may call much times of prepareClientToWrite for complex reply.
yes, but this is not a conceptually complex of expensive function to call. i.e. it does a series of checks which will end up doing nothing if it is called for the second time. it would be silly to call it in a loop if we can move the call out of the loop. but i don't see any reason to change it or avoid calling it.
Comment From: soloestoy
anything news about this issue? i think 7.0 GA should take this optimization, @xiaozhitaba could you submit a PR to fix it?
Comment From: oranagra
@ShooterIT maybe you want to pick this up?
Comment From: ShooterIT
OK, done this week