Redis [BUG] ZREVRANGE 50% slower after upgrading from 5.0.7 to 6.2.6

Describe the bug

Hi all,

I was wondering if anyone might be able to shed some light on this rather unusual behaviour I've been experiencing.

I have a sorted set with 1m entries, and using ZREVRANGE to get the player's ranking, plus the two players above and below. I'm using node-redis on Ubuntu 20.04.3. However, now that I've upgraded from Redis 5.0.7 to 6.2.6, the very same code is taking 50% longer to complete. I'm using the node-redis multi() command, as (for reasons I'm not entirely clear about) it has better performance than batch(), Whether I bunch my requests up into 100k batches, or 10k batches, the result is the same.

On Redis 5.0.7 (the version on apt-get), I can pull the results for every player in 966ms. On 6.2.6 that same set of queries takes 1.415s to complete.

Other queries are unaffected - I also get ZREVRANK for every player, and that took 1.8s before, and 1.8s now.

To reproduce

It should be possible to reproduce with a 1 million entry sorted set. The sorted set is literally just a list of incremental IDs, and a randomly allocated score between 0 and 500.

Here's the code I'm using for the retrieval

function getdata(n){
            var loopinc = 100000;
            var final = 1000000;
            if(n == 1){
                var target = loopinc;
                var initial = 1;
            }else{
                var initial = (n - 1) * loopinc;
                var target = n * loopinc;
            }
            console.log('start from: '+initial+',  stop at:  '+target);

           for(i = initial; i < target; i++){
                if(i < 2){
                    multi.zrevrange('lb', 0, 5);
                }else{
                    multi.zrevrange('lb',(i -2), (i + 2));
                }
            }

           var start = performance.now();
           multi.exec(function(err, reply){
                    var end = performance.now();
                    var redisstart = (parseInt(reply[0][0]) * 1000) + (parseInt(reply[0][1]) / 1000);
                    var redisend   = (parseInt(reply[(reply.length - 1)][0]) * 1000) + (parseInt(reply[(reply.length - 1)][1]) / 1000);            
                    var execution_time = redisend - redisstart; // in milliseconds
                    console.log('Execution time: '+execution_time);
                    totalarr.push(execution_time);
                    console.log('Retrieved in '+(end - start)+' ms'); 

                    if(target < final){
                        n++
                        getdata(n);
                    }else{
                        var total = totalarr.reduce(function(previousvalue, currentvalue) { return previousvalue + currentvalue; }, 0);
                        console.log('total execution time: '+total);
                    }
                });
 }

getdata(1)

Steps to reproduce the behavior and/or a minimal code sample.

Expected behavior

The same performance as on 5.0.7 rather than being 50% slower.

Additional information

Platform: Ubuntu 20.04.3 LTS Redis: 5.0.7 or 6.2.6 Interfacing with node-redis 3.1.2

Comment From: filipecosta90

@v-flashpoint you're right about the regression 172K ops/sec on v5 vs 140K ops/sec on v6.2. the overall regression sits around 25%.

I've added a reproduction in https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-zset-1M-elements-zrevrange-5-elements.yml, and this can be easily tested solely with memtier. This is now being tracked as part of our CI. So this type of regression on this command will be sure to not happen again. thank you @v-flashpoint!

TLDR @redis/core-team the introduction of deferred reply seems to be the main reason for the performance difference.

@redis/core-team I'll go over the numbers using a standalone deployment:

v5.0.7 - 172K ops/sec :

# populate
memtier_benchmark --key-maximum 1000000 --key-prefix "" --command="ZADD lb __key__ __key__"  --command-key-pattern P --hide-histogram -t 4 -c 100 -s <SERVER>


# benchmark
memtier_benchmark --command="ZREVRANGE lb 5 10"  --hide-histogram --test-time 60  -s <SERVER>

Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 100%,  60 secs]  0 threads:    10315288 ops,  172839 (avg:  171915) ops/sec, 19.45MB/sec (avg: 19.35MB/sec),  1.16 (avg:  1.16) msec latency

4         Threads
50        Connections per thread
60        Seconds


ALL STATS
=====================================================================================================
Type            Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
-----------------------------------------------------------------------------------------------------
Zrevranges    171916.03         1.16247         1.07100         2.09500         2.38300     19810.64 
Totals        171916.03         1.16247         1.07100         2.09500         2.38300     19810.64

details on v5.0.7:

Here's the flamegraph: https://s3.us-east-2.amazonaws.com/ci.benchmarks.redislabs/redis/redis/profiles//profileprimary-1-of-1perf_2022-02-18-08-44-58.out.flamegraph.svg

One thing to keep in mind is that before we used addReplyMultiBulkLen https://github.com/redis/redis/blob/5.0/src/t_zset.c#L2461

v6.2.6 - 140K ops/sec :

memtier_benchmark --command="ZREVRANGE lb 5 10"  --hide-histogram --test-time 60 -s <SERVER>
Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 100%,  60 secs]  0 threads:     8346655 ops,  137191 (avg:  139105) ops/sec, 15.44MB/sec (avg: 15.65MB/sec),  1.46 (avg:  1.44) msec latency

4         Threads
50        Connections per thread
60        Seconds


ALL STATS
=====================================================================================================
Type            Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
-----------------------------------------------------------------------------------------------------
Zrevranges    139105.74         1.43686         1.33500         2.65500         2.83100     16029.76 
Totals        139105.74         1.43686         1.33500         2.65500         2.83100     16029.76

details on v6.2:

Here's the flamegraph: https://s3.us-east-2.amazonaws.com/ci.benchmarks.redislabs/redis/redis/profiles//profileprimary-1-of-1perf_2022-02-18-08-51-53.out.flamegraph.svg

One thing to keep in mind is that NOW we used addReplyDeferredLen https://github.com/redis/redis/blob/6.2/src/t_zset.c#L3044

And as expected doing a difference between stacks of versions, the one's that pop up (increase in CPU cycles) are related to write+command performance.

libc_write itself consumes 6% more cycles. Redis [BUG] ZREVRANGE 50% slower after upgrading from 5.0.7 to 6.2.6

potential follow up

@redis/core-team given within the range code we don't change the result cardinality IMHO we can completely avoid this deferred len and issue the proper length size at start. result_cardinality does not change! : https://github.com/redis/redis/blob/6.2/src/t_zset.c#L3054 Agree?

Comment From: oranagra

ZRANGE and ZREVRANGE don't need to use deferred replies (the count is known), but ZRANGEBYSCORE and ZRANGEBYLEX do!

The work that's been done in #7844 made ZRANGE and ZRANGESTORE handle these cases too, so the only way to avoid that regression is to add some specific hacks to the code to avoid deferred reply only when using indexes.

Note that these commands use only one deferred reply per command (unlike what we used to have in CLUSTER SLOTS and COMMAND command, see https://github.com/redis/redis/pull/10056, https://github.com/redis/redis/pull/7123), which isn't expected to cause a big impact, and the reason it does cause a big impact here is because it is used in a pipeline.

I think the solution for this issue is gonna be #9934. @filipecosta90 since you did reproduce this issue, maybe you can test the effect of that PR on this use case.

Comment From: filipecosta90

I think the solution for this issue is gonna be https://github.com/redis/redis/pull/9934. @filipecosta90 since you did reproduce this issue, maybe you can test the effect of that PR on this use case.

@oranagra WRT to pan/use-writev branch impact on this use-case we can see it's small / nearly imperceptible:

v5.0.7 - 172K ops/sec
v6.2.6 - 140K ops/sec
pan/use-writev - 141K ops/sec

$ memtier_benchmark --command="ZREVRANGE lb 5 10"  --hide-histogram --test-time 60  -s 10.3.0.175 -p 6380
Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 100%,  60 secs]  0 threads:     8435799 ops,  139497 (avg:  140591) ops/sec, 15.70MB/sec (avg: 15.82MB/sec),  1.43 (avg:  1.42) msec latency

4         Threads
50        Connections per thread
60        Seconds


ALL STATS
=====================================================================================================
Type            Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
-----------------------------------------------------------------------------------------------------
Zrevranges    140591.87         1.42170         1.32700         2.63900         2.81500     16201.02 
Totals        140591.87         1.42170         1.32700         2.63900         2.81500     16201.02

WRT to:

and the reason it does cause a big impact here is because it is used in a pipeline.

notice that on my simple reproduction I'm not using pipelining.

so the only way to avoid that regression is to add some specific hacks to the code to avoid deferred reply only when using indexes.

Can we move forward with this solution?

Comment From: oranagra

Can we move forward with this solution?

if we have no other choice we can do that, but i'd rather avoid adding explicit hacks for ZRANGE that will not work for ZRANGE BYSCORE.

i wanna try figuring out why writev doesn't solve the problem. how did you conclude that it's a result of the deferred replies? did you do a POC that changes that and saw it fixed?

Comment From: filipecosta90

how did you conclude that it's a result of the deferred replies? did you do a POC that changes that and saw it fixed?

I've noticed an increase in write/read/and deferred code within the command logic:

doing a difference between stacks of versions, the one's that pop up (increase in CPU cycles) are related to write+command performance. libc_write itself consumes 6% more cycles.

Redis [BUG] ZREVRANGE 50% slower after upgrading from 5.0.7 to 6.2.6

WRT to:

did you do a POC that changes that and saw it fixed?

I'll do a quick POC and confirm indeed it's fixed.

Comment From: oranagra

p.s. there might be a small impact from an additional write (per round trip), and that impact might be very visible in case all the surrounding is small (fast command with not a lot of data to write). with a long pipeline or a transaction (like described at the top), it should have a bigger impact, but i expect this impact to be dramatically reduced by writev

My point is that we have quite a few other commands with a single deferred reply per command, and i don't like to consider that pattern an problematic one. So i hope writev can drastically reduce the overhead when these are used with a pipeline, and i hope we can overlook their overhead when no pipeline is used.

Comment From: filipecosta90

@oranagra you're absolutely right about #9934 positive impact when using pipeline.

v5.0.7 pipeline 16 - 810K ops/sec, p50=4.04ms
v6.2.6 pipeline 16 - 331K ops/sec, p50=9.59ms
pan/use-writev pipeline 16 - 540K ops/sec, p50=5.11ms

still, the gap between 540 and 810K ops/sec is relevant right? Meaning we need further changes apart from #9934 correct?

Comment From: oranagra

yes. let's try to figure that out.. please try your POC, taking unstable and avoiding the use of deferred reply (even without pipeline). we'll see if 100% of the regression comes from this, or form another reason.

Comment From: filipecosta90

@oranagra simply going back to 5.0 code on zrevrange ( with the changes to listpack https://github.com/filipecosta90/redis/tree/zset.regression ) gets us back to 170K ops/sec. This is using unstable base commit de6be8850f1cec69263ad4f5713ef06f0274d50e ( meaning the writev is already merged ). Note: there is still a difference between 5.0.7 and filipecosta90/zset.regression so I need to check exactly where the 9K ops/sec difference is still coming from.


######################
5.0.7
######################

Top Hotspots
Function                    Module           CPU Time
--------------------------  ---------------  --------
__libc_write                libpthread.so.0   39.478s
__libc_read                 libpthread.so.0    6.332s
read                        redis-server       1.739s
epoll_wait                  libc.so.6          1.673s
[Outside any known module]  [Unknown]          1.247s
[Others]                    N/A                9.370s



ALL STATS
=====================================================================================================
Type            Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
-----------------------------------------------------------------------------------------------------
Zrevranges    178644.09         1.11903         1.11900         1.41500         1.95900     20585.94 
Totals        178644.09         1.11903         1.11900         1.41500         1.95900     20585.94 


######################
unstable
######################

Top Hotspots
Function                    Module           CPU Time
--------------------------  ---------------  --------
__GI___writev               libc.so.6         32.613s
__libc_read                 libpthread.so.0    5.022s
read                        redis-server       1.346s
zslGetElementByRank         redis-server       1.298s
[Outside any known module]  [Unknown]          1.046s
[Others]                    N/A               18.526s



ALL STATS
=====================================================================================================
Type            Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
-----------------------------------------------------------------------------------------------------
Zrevranges    142641.60         1.40161         1.39900         1.86300         2.28700     16437.22 
Totals        142641.60         1.40161         1.39900         1.86300         2.28700     16437.22



######################
filipecosta90/zset.regression
######################

ALL STATS
=====================================================================================================
Type            Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
-----------------------------------------------------------------------------------------------------
Zrevranges    167960.71         1.19021         1.19100         1.53500         2.07900     19354.85 
Totals        167960.71         1.19021         1.19100         1.53500         2.07900     19354.85

Comment From: oranagra

@filipecosta90 i don't understand.. did you use the latest unstable (with writev), and completely revert https://github.com/redis/redis/pull/7844? i assume this it the test without pipeline? anyway, there are two many factors here, i was looking at either taking 6.2 and add some code to avoid using deferred replies, or, maybe slightly easier, taking 5.0 and changing it to use deferred replies.

Comment From: panjf2000

@filipecosta90 i don't understand.. did you use the latest unstable (with writev), and completely revert #7844? i assume this it the test without pipeline? anyway, there are two many factors here, i was looking at either taking 6.2 and add some code to avoid using deferred replies, or, maybe slightly easier, taking 5.0 and changing it to use deferred replies.

It seems that @filipecosta90 added zrangeGenericCommand() back and replace the latest code in unstable with it rather than reverting #7844, see the commits history: https://github.com/filipecosta90/redis/commits/zset.regression @oranagra

Comment From: oranagra

recipe:

rm dump.rdb ; src/redis-server --save "" &
redis-cli zadd zz 0 a 1 b 2 c 3 d 4 e 5 f
memtier_benchmark --pipeline 16 --command "zrange zz 0 -1" --hide-histogram

on 5.0:

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Zranges   1003869.41         3.15023         3.00700         4.89500         6.23900     83329.00 
without pipeline:
Zranges    195546.03         1.02185         0.91100         1.50300         2.04700     16231.85

on 6.2:

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Zranges    240024.84        13.32925        12.92700        15.16700        20.73500     19923.94 
without pipeline:
Zranges    160567.46         1.24477         1.12700         1.84700         2.31900     13328.35

applying diff on 5.0:

diff --git a/src/t_zset.c b/src/t_zset.c
index 989d5855e..25e8aec9b 100644
--- a/src/t_zset.c
+++ b/src/t_zset.c
@@ -2458,7 +2458,8 @@ void zrangeGenericCommand(client *c, int reverse) {
     rangelen = (end-start)+1;

     /* Return the result in form of a multi-bulk reply */
-    addReplyMultiBulkLen(c, withscores ? (rangelen*2) : rangelen);
+    void *replylen = addDeferredMultiBulkLength(c);
+    long orig_range = rangelen;

     if (zobj->encoding == OBJ_ENCODING_ZIPLIST) {
         unsigned char *zl = zobj->ptr;
@@ -2520,6 +2521,7 @@ void zrangeGenericCommand(client *c, int reverse) {
     } else {
         serverPanic("Unknown sorted set encoding");
     }
+    setDeferredMultiBulkLength(c, replylen, withscores ? (orig_range*2) : orig_range);
 }

 void zrangeCommand(client *c) {
````

Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec

Zranges 202284.34 15.81500 15.74300 18.81500 21.88700 16791.18 without pipeline: Zranges 166672.24 1.19906 1.07900 1.75100 2.94300 13835.10


raw patch on 6.2:
```diff
diff --git a/src/t_zset.c b/src/t_zset.c
index 2abc1b49b..f59f96e56 100644
--- a/src/t_zset.c
+++ b/src/t_zset.c
@@ -2870,7 +2870,7 @@ typedef enum {

 typedef struct zrange_result_handler zrange_result_handler;

-typedef void (*zrangeResultBeginFunction)(zrange_result_handler *c);
+typedef void (*zrangeResultBeginFunction)(zrange_result_handler *c, long hint);
 typedef void (*zrangeResultFinalizeFunction)(
     zrange_result_handler *c, size_t result_count);
 typedef void (*zrangeResultEmitCBufferFunction)(
@@ -2899,7 +2899,12 @@ struct zrange_result_handler {
 };

 /* Result handler methods for responding the ZRANGE to clients. */
-static void zrangeResultBeginClient(zrange_result_handler *handler) {
+static void zrangeResultBeginClient(zrange_result_handler *handler, long hint) {
+    if (hint > 0) {
+        addReplyArrayLen(handler->client, hint);
+        handler->userdata = NULL;
+        return;
+    }
     handler->userdata = addReplyDeferredLen(handler->client);
 }

@@ -2941,12 +2946,14 @@ static void zrangeResultFinalizeClient(zrange_result_handler *handler,
         result_count *= 2;
     }

-    setDeferredArrayLen(handler->client, handler->userdata, result_count);
+    if (handler->userdata)
+        setDeferredArrayLen(handler->client, handler->userdata, result_count);
 }

 /* Result handler methods for storing the ZRANGESTORE to a zset. */
-static void zrangeResultBeginStore(zrange_result_handler *handler)
+static void zrangeResultBeginStore(zrange_result_handler *handler, long hint)
 {
+    UNUSED(hint);
     handler->dstobj = createZsetZiplistObject();
 }

@@ -3041,7 +3048,7 @@ void genericZrangebyrankCommand(zrange_result_handler *handler,
     if (end < 0) end = llen+end;
     if (start < 0) start = 0;

-    handler->beginResultEmission(handler);
+    handler->beginResultEmission(handler, end-start+1);

     /* Invariant: start >= 0, so this test will be true when end < 0.
      * The range is empty when start > end or start >= length. */
@@ -3148,7 +3155,7 @@ void genericZrangebyscoreCommand(zrange_result_handler *handler,
     client *c = handler->client;
     unsigned long rangelen = 0;

-    handler->beginResultEmission(handler);
+    handler->beginResultEmission(handler,-1);

     /* For invalid offset, return directly. */
     if (offset > 0 && offset >= (long)zsetLength(zobj)) {
@@ -3437,7 +3444,7 @@ void genericZrangebylexCommand(zrange_result_handler *handler,
     client *c = handler->client;
     unsigned long rangelen = 0;

-    handler->beginResultEmission(handler);
+    handler->beginResultEmission(handler,-1);

     if (zobj->encoding == OBJ_ENCODING_ZIPLIST) {
         unsigned char *zl = zobj->ptr;
@@ -3680,7 +3687,7 @@ void zrangeGenericCommand(zrange_result_handler *handler, int argc_start, int st
         lookupKeyRead(c->db,key);
     if (zobj == NULL) {
         if (store) {
-            handler->beginResultEmission(handler);
+            handler->beginResultEmission(handler,-1);
             handler->finalizeResultEmission(handler, 0);
         } else {
             addReply(c, shared.emptyarray);

benchmark:

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Zranges   1455472.72         2.19597         2.27100         3.75900         5.37500     61118.48 
without pipeline
Zranges    203077.08         0.98460         0.87900         1.40700         2.11100      8527.65

bottom line: 1. other than the regression about the deferred replies (which does have some impact on non-pipelined too), 6.2 also includes some improvement. 2. 7.0 (unstable), might contain other improvements (or regressions)

Comment From: oranagra

for reference

7.0-RC1 (before writev):

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Zranges    228047.73        14.02753        13.95100        17.15100        20.99100     18929.74 
without pipeline:
Zranges    141687.86         1.41228         1.44700         2.17500         3.90300     11761.20

unstable (with writev):

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Zranges    656797.63         4.86072         4.51100         8.89500        11.39100     54519.33 
without pipeline
Zranges    139986.54         1.42811         1.47900         2.14300         2.94300     11619.98

7.0 with the above patch to avoid using deferred replies (without writev):

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Zranges    868107.17         3.67039         3.59900         5.59900         7.83900     72059.68 
without pipeline:
Zranges    175740.25         1.13698         1.05500         1.69500         2.41500     14587.81

unstable with the above patch to avoid using deferred replies (with writev):

Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Zranges    801576.54         3.96586         3.93500         7.77500         8.95900     66537.12 
without pipeline:
Zranges    163546.07         1.22226         1.26300         2.06300         5.56700     13575.60

bottom line: 1. writev doesn't completely eliminate the overheads of deferred replies. 2. there is another regression in unstable compared to 6.2

Comment From: filipecosta90

@oranagra WRT to:

writev doesn't completely eliminate the overheads of deferred replies.

on unstable ( de6be8850f1cec69263ad4f5713ef06f0274d50e ) even with writev usage there is 2.4% CPU cycles which are within zrange code that are not related to write/writev ( the sprintf of setDefferedAggregatelen is costly ) Redis [BUG] ZREVRANGE 50% slower after upgrading from 5.0.7 to 6.2.6

there is another regression in unstable compared to 6.2

apply patch and profile? agree?

Comment From: oranagra

yes, obviously other than write system calls, deferred reply involve more heap allocations. and now i realize they also use sprintf instead of the optimization of using the pre-created shared.mbulkhdr. we can probably easily fix the second part (use shared.mbulkhdr (and maybe create one for maps too).

feel free to apply my patch (still needs correct handling of withscores and c->resp), code this sprintf optimization and profile / benchmark to see what's left... we'll surely be left with the heap allocation dereference, and cache misses overheads.

Comment From: filipecosta90

@oranagra using the changes on https://github.com/redis/redis/pull/10337 we see that for pipeline 1 we're at the same level of v5 and that for pipeline 16 we're still with a gap of around 17% of CPU cycles.

We can pinpoint ~8% of extra CPU cycles to the following added logic ( profiling pipeline 16). Please advise if in your opinion we can squeeze further/reduce the overhead of the following features:

Function	CPU Time: Total	Introduced
updateClientMemUsage	3.6%	https://github.com/redis/redis/pull/8687
updateCachedTime	1.8%	https://github.com/redis/redis/pull/9194
ACLCheckAllUserCommandPerm	1.2%	https://github.com/redis/redis/pull/9974
updateCommandLatencyHistogram	0.8%	https://github.com/redis/redis/pull/9462/

WRT to cache misses overhead difference between v5 and https://github.com/redis/redis/pull/10337 I was surprised to see that the % of cache misses and the total stall cycles remained the same between v5 and the PR. - v5.0.7 stall cycles % ( uops_executed.stall_cycles / uops_executed.core ): 6.09% - PR stall cycles % ( uops_executed.stall_cycles / uops_executed.core ): 5,90%

This points us that indeed the added logic / added CPU cycles are the cause of regression and that according to the data there seems to be NO change on memory overhead/stall cycles. agree?

perf stat v5.0.7:

 Performance counter stats for process id '5063':

      59931.618609      cpu-clock (msec)          #    0.999 CPUs utilized          
      213421613490      cpu-cycles                #    3.561 GHz                      (72.72%)
      526851248464      instructions              #    2.47  insn per cycle           (81.81%)
      592920536223      uops_executed.core        # 9893.284 M/sec                    (81.81%)
       35011134086      uops_executed.stall_cycles #  584.185 M/sec                    (81.81%)
         828763445      cache-references          #   13.828 M/sec                    (81.82%)
            709741      cache-misses              #    0.086 % of all cache refs      (81.83%)
       35030337184      cycle_activity.stalls_total #  584.505 M/sec                    (81.83%)
       23099126236      cycle_activity.stalls_mem_any #  385.425 M/sec                    (81.83%)
          28164639      cycle_activity.stalls_l3_miss #    0.470 M/sec                    (81.82%)
        9240003766      cycle_activity.stalls_l2_miss #  154.176 M/sec                    (81.82%)
       10527552768      cycle_activity.stalls_l1d_miss #  175.659 M/sec                    (72.72%)

      60.001101633 seconds time elapsed

perf stat PR:


 Performance counter stats for process id '361':

      59918.131991      cpu-clock (msec)          #    0.999 CPUs utilized          
      212043287497      cpu-cycles                #    3.539 GHz                      (72.72%)
      510785333068      instructions              #    2.41  insn per cycle           (81.81%)
      554512273393      uops_executed.core        # 9254.499 M/sec                    (81.82%)
       33775863959      uops_executed.stall_cycles #  563.700 M/sec                    (81.82%)
         740966569      cache-references          #   12.366 M/sec                    (81.82%)
            429245      cache-misses              #    0.058 % of all cache refs      (81.82%)
       33779169255      cycle_activity.stalls_total #  563.755 M/sec                    (81.82%)
       21438524734      cycle_activity.stalls_mem_any #  357.797 M/sec                    (81.82%)
          13705439      cycle_activity.stalls_l3_miss #    0.229 M/sec                    (81.82%)
        7519901229      cycle_activity.stalls_l2_miss #  125.503 M/sec                    (81.82%)
        8720668007      cycle_activity.stalls_l1d_miss #  145.543 M/sec                    (72.72%)

      60.001379977 seconds time elapsed

Comment From: oranagra

@filipecosta90 please correct me if i'm wrong. 1. it doesn't matter if #10337 contains #9934 and #10334 or not, since it doesn't use deferred replies. 2. the list of PRs you're showing that introduced performance loss are all not specific for ZRANGE, i.e. they'll affect a pipeline of SETs too