Describe the bug
Hi all,
I was wondering if anyone might be able to shed some light on this rather unusual behaviour I've been experiencing.
I have a sorted set with 1m entries, and using ZREVRANGE to get the player's ranking, plus the two players above and below. I'm using node-redis on Ubuntu 20.04.3. However, now that I've upgraded from Redis 5.0.7 to 6.2.6, the very same code is taking 50% longer to complete. I'm using the node-redis multi() command, as (for reasons I'm not entirely clear about) it has better performance than batch(), Whether I bunch my requests up into 100k batches, or 10k batches, the result is the same.
On Redis 5.0.7 (the version on apt-get), I can pull the results for every player in 966ms. On 6.2.6 that same set of queries takes 1.415s to complete.
Other queries are unaffected - I also get ZREVRANK for every player, and that took 1.8s before, and 1.8s now.
To reproduce
It should be possible to reproduce with a 1 million entry sorted set. The sorted set is literally just a list of incremental IDs, and a randomly allocated score between 0 and 500.
Here's the code I'm using for the retrieval
function getdata(n){
var loopinc = 100000;
var final = 1000000;
if(n == 1){
var target = loopinc;
var initial = 1;
}else{
var initial = (n - 1) * loopinc;
var target = n * loopinc;
}
console.log('start from: '+initial+', stop at: '+target);
for(i = initial; i < target; i++){
if(i < 2){
multi.zrevrange('lb', 0, 5);
}else{
multi.zrevrange('lb',(i -2), (i + 2));
}
}
var start = performance.now();
multi.exec(function(err, reply){
var end = performance.now();
var redisstart = (parseInt(reply[0][0]) * 1000) + (parseInt(reply[0][1]) / 1000);
var redisend = (parseInt(reply[(reply.length - 1)][0]) * 1000) + (parseInt(reply[(reply.length - 1)][1]) / 1000);
var execution_time = redisend - redisstart; // in milliseconds
console.log('Execution time: '+execution_time);
totalarr.push(execution_time);
console.log('Retrieved in '+(end - start)+' ms');
if(target < final){
n++
getdata(n);
}else{
var total = totalarr.reduce(function(previousvalue, currentvalue) { return previousvalue + currentvalue; }, 0);
console.log('total execution time: '+total);
}
});
}
getdata(1)
Steps to reproduce the behavior and/or a minimal code sample.
Expected behavior
The same performance as on 5.0.7 rather than being 50% slower.
Additional information
Platform: Ubuntu 20.04.3 LTS Redis: 5.0.7 or 6.2.6 Interfacing with node-redis 3.1.2
Comment From: filipecosta90
@v-flashpoint you're right about the regression 172K ops/sec on v5 vs 140K ops/sec on v6.2. the overall regression sits around 25%.
I've added a reproduction in https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-zset-1M-elements-zrevrange-5-elements.yml, and this can be easily tested solely with memtier. This is now being tracked as part of our CI. So this type of regression on this command will be sure to not happen again. thank you @v-flashpoint!
TLDR @redis/core-team the introduction of deferred reply seems to be the main reason for the performance difference.
@redis/core-team I'll go over the numbers using a standalone deployment:
v5.0.7 - 172K ops/sec :
# populate
memtier_benchmark --key-maximum 1000000 --key-prefix "" --command="ZADD lb __key__ __key__" --command-key-pattern P --hide-histogram -t 4 -c 100 -s <SERVER>
# benchmark
memtier_benchmark --command="ZREVRANGE lb 5 10" --hide-histogram --test-time 60 -s <SERVER>
Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 100%, 60 secs] 0 threads: 10315288 ops, 172839 (avg: 171915) ops/sec, 19.45MB/sec (avg: 19.35MB/sec), 1.16 (avg: 1.16) msec latency
4 Threads
50 Connections per thread
60 Seconds
ALL STATS
=====================================================================================================
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
-----------------------------------------------------------------------------------------------------
Zrevranges 171916.03 1.16247 1.07100 2.09500 2.38300 19810.64
Totals 171916.03 1.16247 1.07100 2.09500 2.38300 19810.64
details on v5.0.7:
Here's the flamegraph: https://s3.us-east-2.amazonaws.com/ci.benchmarks.redislabs/redis/redis/profiles//profileprimary-1-of-1perf_2022-02-18-08-44-58.out.flamegraph.svg
One thing to keep in mind is that before we used addReplyMultiBulkLen https://github.com/redis/redis/blob/5.0/src/t_zset.c#L2461
v6.2.6 - 140K ops/sec :
memtier_benchmark --command="ZREVRANGE lb 5 10" --hide-histogram --test-time 60 -s <SERVER>
Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 100%, 60 secs] 0 threads: 8346655 ops, 137191 (avg: 139105) ops/sec, 15.44MB/sec (avg: 15.65MB/sec), 1.46 (avg: 1.44) msec latency
4 Threads
50 Connections per thread
60 Seconds
ALL STATS
=====================================================================================================
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
-----------------------------------------------------------------------------------------------------
Zrevranges 139105.74 1.43686 1.33500 2.65500 2.83100 16029.76
Totals 139105.74 1.43686 1.33500 2.65500 2.83100 16029.76
details on v6.2:
Here's the flamegraph: https://s3.us-east-2.amazonaws.com/ci.benchmarks.redislabs/redis/redis/profiles//profileprimary-1-of-1perf_2022-02-18-08-51-53.out.flamegraph.svg
One thing to keep in mind is that NOW we used addReplyDeferredLen https://github.com/redis/redis/blob/6.2/src/t_zset.c#L3044
And as expected doing a difference between stacks of versions, the one's that pop up (increase in CPU cycles) are related to write+command performance.
libc_write itself consumes 6% more cycles.
potential follow up
@redis/core-team given within the range code we don't change the result cardinality IMHO we can completely avoid this deferred len and issue the proper length size at start. result_cardinality does not change! : https://github.com/redis/redis/blob/6.2/src/t_zset.c#L3054 Agree?
Comment From: oranagra
ZRANGE and ZREVRANGE don't need to use deferred replies (the count is known), but ZRANGEBYSCORE and ZRANGEBYLEX do!
The work that's been done in #7844 made ZRANGE and ZRANGESTORE handle these cases too, so the only way to avoid that regression is to add some specific hacks to the code to avoid deferred reply only when using indexes.
Note that these commands use only one deferred reply per command (unlike what we used to have in CLUSTER SLOTS and COMMAND command, see https://github.com/redis/redis/pull/10056, https://github.com/redis/redis/pull/7123), which isn't expected to cause a big impact, and the reason it does cause a big impact here is because it is used in a pipeline.
I think the solution for this issue is gonna be #9934. @filipecosta90 since you did reproduce this issue, maybe you can test the effect of that PR on this use case.
Comment From: filipecosta90
I think the solution for this issue is gonna be https://github.com/redis/redis/pull/9934. @filipecosta90 since you did reproduce this issue, maybe you can test the effect of that PR on this use case.
@oranagra WRT to pan/use-writev branch impact on this use-case we can see it's small / nearly imperceptible:
- v5.0.7 - 172K ops/sec
- v6.2.6 - 140K ops/sec
- pan/use-writev - 141K ops/sec
$ memtier_benchmark --command="ZREVRANGE lb 5 10" --hide-histogram --test-time 60 -s 10.3.0.175 -p 6380
Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 100%, 60 secs] 0 threads: 8435799 ops, 139497 (avg: 140591) ops/sec, 15.70MB/sec (avg: 15.82MB/sec), 1.43 (avg: 1.42) msec latency
4 Threads
50 Connections per thread
60 Seconds
ALL STATS
=====================================================================================================
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
-----------------------------------------------------------------------------------------------------
Zrevranges 140591.87 1.42170 1.32700 2.63900 2.81500 16201.02
Totals 140591.87 1.42170 1.32700 2.63900 2.81500 16201.02
WRT to:
and the reason it does cause a big impact here is because it is used in a pipeline.
notice that on my simple reproduction I'm not using pipelining.
so the only way to avoid that regression is to add some specific hacks to the code to avoid deferred reply only when using indexes.
Can we move forward with this solution?
Comment From: oranagra
Can we move forward with this solution?
if we have no other choice we can do that, but i'd rather avoid adding explicit hacks for ZRANGE that will not work for ZRANGE BYSCORE.
i wanna try figuring out why writev doesn't solve the problem. how did you conclude that it's a result of the deferred replies? did you do a POC that changes that and saw it fixed?
Comment From: filipecosta90
how did you conclude that it's a result of the deferred replies? did you do a POC that changes that and saw it fixed?
I've noticed an increase in write/read/and deferred code within the command logic:
doing a difference between stacks of versions, the one's that pop up (increase in CPU cycles) are related to write+command performance. libc_write itself consumes 6% more cycles.
WRT to:
did you do a POC that changes that and saw it fixed?
I'll do a quick POC and confirm indeed it's fixed.
Comment From: oranagra
p.s. there might be a small impact from an additional write (per round trip), and that impact might be very visible in case all the surrounding is small (fast command with not a lot of data to write). with a long pipeline or a transaction (like described at the top), it should have a bigger impact, but i expect this impact to be dramatically reduced by writev
My point is that we have quite a few other commands with a single deferred reply per command, and i don't like to consider that pattern an problematic one. So i hope writev can drastically reduce the overhead when these are used with a pipeline, and i hope we can overlook their overhead when no pipeline is used.
Comment From: filipecosta90
@oranagra you're absolutely right about #9934 positive impact when using pipeline.
- v5.0.7 pipeline 16 - 810K ops/sec, p50=4.04ms
- v6.2.6 pipeline 16 - 331K ops/sec, p50=9.59ms
- pan/use-writev pipeline 16 - 540K ops/sec, p50=5.11ms
still, the gap between 540 and 810K ops/sec is relevant right? Meaning we need further changes apart from #9934 correct?
Comment From: oranagra
yes. let's try to figure that out.. please try your POC, taking unstable and avoiding the use of deferred reply (even without pipeline). we'll see if 100% of the regression comes from this, or form another reason.
Comment From: filipecosta90
@oranagra simply going back to 5.0 code on zrevrange ( with the changes to listpack https://github.com/filipecosta90/redis/tree/zset.regression ) gets us back to 170K ops/sec. This is using unstable base commit de6be8850f1cec69263ad4f5713ef06f0274d50e ( meaning the writev is already merged ). Note: there is still a difference between 5.0.7 and filipecosta90/zset.regression so I need to check exactly where the 9K ops/sec difference is still coming from.
######################
5.0.7
######################
Top Hotspots
Function Module CPU Time
-------------------------- --------------- --------
__libc_write libpthread.so.0 39.478s
__libc_read libpthread.so.0 6.332s
read redis-server 1.739s
epoll_wait libc.so.6 1.673s
[Outside any known module] [Unknown] 1.247s
[Others] N/A 9.370s
ALL STATS
=====================================================================================================
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
-----------------------------------------------------------------------------------------------------
Zrevranges 178644.09 1.11903 1.11900 1.41500 1.95900 20585.94
Totals 178644.09 1.11903 1.11900 1.41500 1.95900 20585.94
######################
unstable
######################
Top Hotspots
Function Module CPU Time
-------------------------- --------------- --------
__GI___writev libc.so.6 32.613s
__libc_read libpthread.so.0 5.022s
read redis-server 1.346s
zslGetElementByRank redis-server 1.298s
[Outside any known module] [Unknown] 1.046s
[Others] N/A 18.526s
ALL STATS
=====================================================================================================
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
-----------------------------------------------------------------------------------------------------
Zrevranges 142641.60 1.40161 1.39900 1.86300 2.28700 16437.22
Totals 142641.60 1.40161 1.39900 1.86300 2.28700 16437.22
######################
filipecosta90/zset.regression
######################
ALL STATS
=====================================================================================================
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
-----------------------------------------------------------------------------------------------------
Zrevranges 167960.71 1.19021 1.19100 1.53500 2.07900 19354.85
Totals 167960.71 1.19021 1.19100 1.53500 2.07900 19354.85
Comment From: oranagra
@filipecosta90 i don't understand.. did you use the latest unstable (with writev), and completely revert https://github.com/redis/redis/pull/7844? i assume this it the test without pipeline? anyway, there are two many factors here, i was looking at either taking 6.2 and add some code to avoid using deferred replies, or, maybe slightly easier, taking 5.0 and changing it to use deferred replies.
Comment From: panjf2000
@filipecosta90 i don't understand.. did you use the latest unstable (with writev), and completely revert #7844? i assume this it the test without pipeline? anyway, there are two many factors here, i was looking at either taking 6.2 and add some code to avoid using deferred replies, or, maybe slightly easier, taking 5.0 and changing it to use deferred replies.
It seems that @filipecosta90 added zrangeGenericCommand() back and replace the latest code in unstable with it rather than reverting #7844, see the commits history: https://github.com/filipecosta90/redis/commits/zset.regression
@oranagra
Comment From: oranagra
recipe:
rm dump.rdb ; src/redis-server --save "" &
redis-cli zadd zz 0 a 1 b 2 c 3 d 4 e 5 f
memtier_benchmark --pipeline 16 --command "zrange zz 0 -1" --hide-histogram
on 5.0:
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
--------------------------------------------------------------------------------------------------
Zranges 1003869.41 3.15023 3.00700 4.89500 6.23900 83329.00
without pipeline:
Zranges 195546.03 1.02185 0.91100 1.50300 2.04700 16231.85
on 6.2:
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
--------------------------------------------------------------------------------------------------
Zranges 240024.84 13.32925 12.92700 15.16700 20.73500 19923.94
without pipeline:
Zranges 160567.46 1.24477 1.12700 1.84700 2.31900 13328.35
applying diff on 5.0:
diff --git a/src/t_zset.c b/src/t_zset.c
index 989d5855e..25e8aec9b 100644
--- a/src/t_zset.c
+++ b/src/t_zset.c
@@ -2458,7 +2458,8 @@ void zrangeGenericCommand(client *c, int reverse) {
rangelen = (end-start)+1;
/* Return the result in form of a multi-bulk reply */
- addReplyMultiBulkLen(c, withscores ? (rangelen*2) : rangelen);
+ void *replylen = addDeferredMultiBulkLength(c);
+ long orig_range = rangelen;
if (zobj->encoding == OBJ_ENCODING_ZIPLIST) {
unsigned char *zl = zobj->ptr;
@@ -2520,6 +2521,7 @@ void zrangeGenericCommand(client *c, int reverse) {
} else {
serverPanic("Unknown sorted set encoding");
}
+ setDeferredMultiBulkLength(c, replylen, withscores ? (orig_range*2) : orig_range);
}
void zrangeCommand(client *c) {
````
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Zranges 202284.34 15.81500 15.74300 18.81500 21.88700 16791.18 without pipeline: Zranges 166672.24 1.19906 1.07900 1.75100 2.94300 13835.10
raw patch on 6.2:
```diff
diff --git a/src/t_zset.c b/src/t_zset.c
index 2abc1b49b..f59f96e56 100644
--- a/src/t_zset.c
+++ b/src/t_zset.c
@@ -2870,7 +2870,7 @@ typedef enum {
typedef struct zrange_result_handler zrange_result_handler;
-typedef void (*zrangeResultBeginFunction)(zrange_result_handler *c);
+typedef void (*zrangeResultBeginFunction)(zrange_result_handler *c, long hint);
typedef void (*zrangeResultFinalizeFunction)(
zrange_result_handler *c, size_t result_count);
typedef void (*zrangeResultEmitCBufferFunction)(
@@ -2899,7 +2899,12 @@ struct zrange_result_handler {
};
/* Result handler methods for responding the ZRANGE to clients. */
-static void zrangeResultBeginClient(zrange_result_handler *handler) {
+static void zrangeResultBeginClient(zrange_result_handler *handler, long hint) {
+ if (hint > 0) {
+ addReplyArrayLen(handler->client, hint);
+ handler->userdata = NULL;
+ return;
+ }
handler->userdata = addReplyDeferredLen(handler->client);
}
@@ -2941,12 +2946,14 @@ static void zrangeResultFinalizeClient(zrange_result_handler *handler,
result_count *= 2;
}
- setDeferredArrayLen(handler->client, handler->userdata, result_count);
+ if (handler->userdata)
+ setDeferredArrayLen(handler->client, handler->userdata, result_count);
}
/* Result handler methods for storing the ZRANGESTORE to a zset. */
-static void zrangeResultBeginStore(zrange_result_handler *handler)
+static void zrangeResultBeginStore(zrange_result_handler *handler, long hint)
{
+ UNUSED(hint);
handler->dstobj = createZsetZiplistObject();
}
@@ -3041,7 +3048,7 @@ void genericZrangebyrankCommand(zrange_result_handler *handler,
if (end < 0) end = llen+end;
if (start < 0) start = 0;
- handler->beginResultEmission(handler);
+ handler->beginResultEmission(handler, end-start+1);
/* Invariant: start >= 0, so this test will be true when end < 0.
* The range is empty when start > end or start >= length. */
@@ -3148,7 +3155,7 @@ void genericZrangebyscoreCommand(zrange_result_handler *handler,
client *c = handler->client;
unsigned long rangelen = 0;
- handler->beginResultEmission(handler);
+ handler->beginResultEmission(handler,-1);
/* For invalid offset, return directly. */
if (offset > 0 && offset >= (long)zsetLength(zobj)) {
@@ -3437,7 +3444,7 @@ void genericZrangebylexCommand(zrange_result_handler *handler,
client *c = handler->client;
unsigned long rangelen = 0;
- handler->beginResultEmission(handler);
+ handler->beginResultEmission(handler,-1);
if (zobj->encoding == OBJ_ENCODING_ZIPLIST) {
unsigned char *zl = zobj->ptr;
@@ -3680,7 +3687,7 @@ void zrangeGenericCommand(zrange_result_handler *handler, int argc_start, int st
lookupKeyRead(c->db,key);
if (zobj == NULL) {
if (store) {
- handler->beginResultEmission(handler);
+ handler->beginResultEmission(handler,-1);
handler->finalizeResultEmission(handler, 0);
} else {
addReply(c, shared.emptyarray);
benchmark:
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
--------------------------------------------------------------------------------------------------
Zranges 1455472.72 2.19597 2.27100 3.75900 5.37500 61118.48
without pipeline
Zranges 203077.08 0.98460 0.87900 1.40700 2.11100 8527.65
bottom line: 1. other than the regression about the deferred replies (which does have some impact on non-pipelined too), 6.2 also includes some improvement. 2. 7.0 (unstable), might contain other improvements (or regressions)
Comment From: oranagra
for reference
7.0-RC1 (before writev):
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
--------------------------------------------------------------------------------------------------
Zranges 228047.73 14.02753 13.95100 17.15100 20.99100 18929.74
without pipeline:
Zranges 141687.86 1.41228 1.44700 2.17500 3.90300 11761.20
unstable (with writev):
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
--------------------------------------------------------------------------------------------------
Zranges 656797.63 4.86072 4.51100 8.89500 11.39100 54519.33
without pipeline
Zranges 139986.54 1.42811 1.47900 2.14300 2.94300 11619.98
7.0 with the above patch to avoid using deferred replies (without writev):
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
--------------------------------------------------------------------------------------------------
Zranges 868107.17 3.67039 3.59900 5.59900 7.83900 72059.68
without pipeline:
Zranges 175740.25 1.13698 1.05500 1.69500 2.41500 14587.81
unstable with the above patch to avoid using deferred replies (with writev):
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
--------------------------------------------------------------------------------------------------
Zranges 801576.54 3.96586 3.93500 7.77500 8.95900 66537.12
without pipeline:
Zranges 163546.07 1.22226 1.26300 2.06300 5.56700 13575.60
bottom line: 1. writev doesn't completely eliminate the overheads of deferred replies. 2. there is another regression in unstable compared to 6.2
Comment From: filipecosta90
@oranagra WRT to:
- writev doesn't completely eliminate the overheads of deferred replies.
on unstable ( de6be8850f1cec69263ad4f5713ef06f0274d50e ) even with writev usage there is 2.4% CPU cycles which are within zrange code that are not related to write/writev ( the sprintf of setDefferedAggregatelen is costly )
- there is another regression in unstable compared to 6.2
apply patch and profile? agree?
Comment From: oranagra
yes, obviously other than write system calls, deferred reply involve more heap allocations.
and now i realize they also use sprintf instead of the optimization of using the pre-created shared.mbulkhdr.
we can probably easily fix the second part (use shared.mbulkhdr (and maybe create one for maps too).
feel free to apply my patch (still needs correct handling of withscores and c->resp), code this sprintf optimization and profile / benchmark to see what's left...
we'll surely be left with the heap allocation dereference, and cache misses overheads.
Comment From: filipecosta90
@oranagra using the changes on https://github.com/redis/redis/pull/10337 we see that for pipeline 1 we're at the same level of v5 and that for pipeline 16 we're still with a gap of around 17% of CPU cycles.
We can pinpoint ~8% of extra CPU cycles to the following added logic ( profiling pipeline 16). Please advise if in your opinion we can squeeze further/reduce the overhead of the following features:
| Function | CPU Time: Total | Introduced |
|---|---|---|
| updateClientMemUsage | 3.6% | https://github.com/redis/redis/pull/8687 |
| updateCachedTime | 1.8% | https://github.com/redis/redis/pull/9194 |
| ACLCheckAllUserCommandPerm | 1.2% | https://github.com/redis/redis/pull/9974 |
| updateCommandLatencyHistogram | 0.8% | https://github.com/redis/redis/pull/9462/ |
WRT to cache misses overhead difference between v5 and https://github.com/redis/redis/pull/10337 I was surprised to see that the % of cache misses and the total stall cycles remained the same between v5 and the PR. - v5.0.7 stall cycles % ( uops_executed.stall_cycles / uops_executed.core ): 6.09% - PR stall cycles % ( uops_executed.stall_cycles / uops_executed.core ): 5,90%
This points us that indeed the added logic / added CPU cycles are the cause of regression and that according to the data there seems to be NO change on memory overhead/stall cycles. agree?
perf stat v5.0.7:
Performance counter stats for process id '5063':
59931.618609 cpu-clock (msec) # 0.999 CPUs utilized
213421613490 cpu-cycles # 3.561 GHz (72.72%)
526851248464 instructions # 2.47 insn per cycle (81.81%)
592920536223 uops_executed.core # 9893.284 M/sec (81.81%)
35011134086 uops_executed.stall_cycles # 584.185 M/sec (81.81%)
828763445 cache-references # 13.828 M/sec (81.82%)
709741 cache-misses # 0.086 % of all cache refs (81.83%)
35030337184 cycle_activity.stalls_total # 584.505 M/sec (81.83%)
23099126236 cycle_activity.stalls_mem_any # 385.425 M/sec (81.83%)
28164639 cycle_activity.stalls_l3_miss # 0.470 M/sec (81.82%)
9240003766 cycle_activity.stalls_l2_miss # 154.176 M/sec (81.82%)
10527552768 cycle_activity.stalls_l1d_miss # 175.659 M/sec (72.72%)
60.001101633 seconds time elapsed
perf stat PR:
Performance counter stats for process id '361':
59918.131991 cpu-clock (msec) # 0.999 CPUs utilized
212043287497 cpu-cycles # 3.539 GHz (72.72%)
510785333068 instructions # 2.41 insn per cycle (81.81%)
554512273393 uops_executed.core # 9254.499 M/sec (81.82%)
33775863959 uops_executed.stall_cycles # 563.700 M/sec (81.82%)
740966569 cache-references # 12.366 M/sec (81.82%)
429245 cache-misses # 0.058 % of all cache refs (81.82%)
33779169255 cycle_activity.stalls_total # 563.755 M/sec (81.82%)
21438524734 cycle_activity.stalls_mem_any # 357.797 M/sec (81.82%)
13705439 cycle_activity.stalls_l3_miss # 0.229 M/sec (81.82%)
7519901229 cycle_activity.stalls_l2_miss # 125.503 M/sec (81.82%)
8720668007 cycle_activity.stalls_l1d_miss # 145.543 M/sec (72.72%)
60.001379977 seconds time elapsed
Comment From: oranagra
@filipecosta90 please correct me if i'm wrong.
1. it doesn't matter if #10337 contains #9934 and #10334 or not, since it doesn't use deferred replies.
2. the list of PRs you're showing that introduced performance loss are all not specific for ZRANGE, i.e. they'll affect a pipeline of SETs too