Redis digits10 not applied in networking.c

In util.c, the function ll2string was boosted by optimised algorithm and a digits10 function. In networking.c, the function addReplyBulkLen could also be speeded up by digits10.

Here is my work before the 2.8.14 was released, it is not recursive like digits10.

inline size_t decDigitCount(int64_t d) {
    size_t l = 0;
    if (d < 0) {
        d = -d;
        l = 1;
    } else {
        l = 0;
    }
    if (d > 10000000000) goto its_really_big;
    if (d < 10) return 1 + l;
    if (d < 100) return 2 + l;
    if (d < 1000) return 3 + l;
    if (d < 10000) return 4 + l;
    if (d < 100000) return 5 + l;
    if (d < 1000000) return 6 + l;
    if (d < 10000000) return 7 + l;
    if (d < 100000000) return 8 + l;
    if (d < 1000000000) return 9 + l;
    if (d < 10000000000) return 10 + l;
its_really_big:
    if (d < 100000000000) return 11 + l;
    if (d < 1000000000000) return 12 + l;
    if (d < 10000000000000) return 13 + l;
    if (d < 100000000000000) return 14 + l;
    if (d < 1000000000000000) return 15 + l;
    if (d < 10000000000000000) return 16 + l;
    if (d < 100000000000000000) return 17 + l;
    if (d < 1000000000000000000) return 18 + l;
    return 19 + l;
}

/* Create the length prefix of a bulk reply, example: $2234 */
void addReplyBulkLen(redisClient *c, robj *obj) {
    size_t len;

    if (obj->encoding == REDIS_ENCODING_RAW) {
        len = sdslen(obj->ptr);
    } else {
        long n = (long)obj->ptr;

        /* Compute how many bytes will take this integer as a radix 10 string */
#if 1
        len = decDigitCount(n);
#else
        len = 1;
        if (n < 0) {
            len++;
            n = -n;
        }
        while((n = n/10) != 0) {
            len++;
        }
#endif
    }

    if (len < REDIS_SHARED_BULKHDR_LEN)
        addReply(c,shared.bulkhdr[len]);
    else
        addReplyLongLongWithPrefix(c,len,'$');
}

Comment From: mattsta

If we compile without optimizations, decDigitCount is 40% faster than recursive digits10. decDigitCount is 10x to 30x faster than the existing len++ loop.

If we compile with optimizations (O2 or higher), all of the loops become irrelevant and everything returns immediately. With optimizations enabled, the compiler can generate efficient jump tables instead of looping through everything.

So, it looks like a good change, but I can't detect any performance impact in any compiled version people would be running in live usage.

Comment From: 191919

decDigitCount is faster than the 'divide by 10' loop 2.5x-3x faster even with -O3 optimisation with Intel C Compiler, gcc and clang.

Here is what I extracted from my profiling unit:

inline int decDigitCount_div10(int64_t d) {
    int l;
    if (d < 0) {
        l = 1;
        d = -d;
    } else {
        l = 0;
    }
    while (d) {
        l++;
        d /= 10;
    }
    return l;
}

inline size_t decDigitCount(int64_t d) {
    size_t l = 0;
    if (d < 0) {
        d = -d;
        l = 1;
    } else {
        l = 0;
    }
    if (d > 10000000000) goto its_really_big;
    if (d < 10) return 1 + l;
    if (d < 100) return 2 + l;
    if (d < 1000) return 3 + l;
    if (d < 10000) return 4 + l;
    if (d < 100000) return 5 + l;
    if (d < 1000000) return 6 + l;
    if (d < 10000000) return 7 + l;
    if (d < 100000000) return 8 + l;
    if (d < 1000000000) return 9 + l;
    if (d < 10000000000) return 10 + l;
its_really_big:
    if (d < 100000000000) return 11 + l;
    if (d < 1000000000000) return 12 + l;
    if (d < 10000000000000) return 13 + l;
    if (d < 100000000000000) return 14 + l;
    if (d < 1000000000000000) return 15 + l;
    if (d < 10000000000000000) return 16 + l;
    if (d < 100000000000000000) return 17 + l;
    if (d < 1000000000000000000) return 18 + l;
    return 19 + l;
}

int test_digit_count(int type) {
    srand(1234);
    int64_t r;
    if (type == 1) {
        printf("div10\n");
        for (int i = 0; i < 100000000; ++i) {
            r += decCount_div10((int64_t) rand() * rand());
        }
    } else {
        printf("decDigitCount\n");
        for (int i = 0; i < 100000000; ++i) {
            r += decCount((int64_t) rand() * rand());
        }
    }
    printf("%lld\n", r);
    return 0;
}

With Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn):

$ clang -O3 -o l l.c
div10 = 5.07s decDigitCount = 2.05s

With gcc version 4.9.1 (GCC):

$ /opt/bin/gcc -std=c99 -O3 -o l l.c
div10 = 4.64s decDigitCount = 1.46s

With icc version 15.0.0 (gcc version 4.9.0 compatibility):

$ icc -std=c99 -O3 -o l l.c
div10 = 4.48s decDigitCount = 1.52s

And, decDigitCount is more readable than digits10.

In our production environment, addReplyBulkLen is a hot spot in the flame graph (https://github.com/brendangregg/FlameGraph), so IMHO, any effort to make it less hot is meaningful.

Comment From: mattsta

extracted from my profiling unit:

What profiling unit? It sounds like you're doing something interesting things, but we aren't seeing all of it. (Example: where are the timings coming from? I feel the rand() calls will introduce noticeable latency not related to the actual number counting algorithm.)

Thanks for the more detailed performance details. I found the problem with my test and am getting mostly variable timings now too. (I was using a static number for testing and the compiler happily inlined it into the algorithm up front, generating automatic results with no calculation required.)

There are some optimization oddities here. gcc with O3 generates slower code than under O2, so that's something to double check in your tests (clang is fine on O2 or O3). Also, try removing inline from your functions and check the performance difference. I can tell a difference when I remove the inline qualifier. I think your decDigitCount should have no noticeable delay, so even 1.5s is still too slow.

For me, clang is optimizing decDigitCount to the exact same performance as digits10, and those are the best performing out of all the alternatives. Redis should update digit counting to use one of those from now on (and in the same compilation unit, not called from another file—you can actually speed up Redis a noticeable amount by using LTO; common operations like memory functions and string manipulation functions should be inlined in dozens of places).

Also interesting: the above rewritten version of decDigitCount_div10 is noticeably slower than the original version with assign and test directly in the while condition (while((n = n/10) != 0)).

I put my test program with results from clang O3 at https://gist.github.com/mattsta/65a07e304db4f99604c7 — check it out and compare it with your results!

Feel free to mention your platform details too. All my tests have just been on my i7 OS X laptop with both Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn) and gcc version 4.9.0 20140330 (experimental) (GCC).

Comment From: 191919

Thanks for sharing your thoughts. The profiling unit is what I wrote to test and compare different implementations of one functionality during profiling and optimising redis.

My environment: a Mid-2012 MacBook Retina with Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz, running OS X 10.9.4.

I noticed the difference of decDigitCount_div10 and the original redis way and rewrote it as below:

inline int decCount_div10(int64_t n) {
    size_t len = 1;
    if (n < 0) {
        len++;
        n = -n;
    }
    while((n = n/10) != 0) {
        len++;
    }
    return len;
}

then re-compared it against decDigitCount. The time result is from time(1) utility.

clang with -O3

div10 = 4.93 decDigitCount = 2.04

gcc with -O2

div10 = 4.54 decDigitCount = 1.47

gcc with -O3

div10 = 4.60 decDigitCount = 1.46

icc with -O3

div10 = 4.49 decDigitCount = 1.49

The use of rand is to avoid the optimisations that make the calculations done in compile-time (if the compiler is smart enough, as it is the case of almost all modern optimising compilers, n = decDigitCount(123456789012345); will be substituted with n = 15;, which makes the timing meaningless). srand(1234) is to set the initial random seed to a fixed value, so that the sequence of random numbers is anticipatable. The overheads of rand are there, but the same to both decCount_div10 and decDigitCount.

Comment From: mattsta

will be substituted with n = 15;, which makes the timing meaningless).

Correct! But if we make n depend on something provided at runtime (like argv[1]), then it can't be optimized ahead of time by mistake.

Just for fun, check your rand() overhead by timing a loop of 100 million rand() * rand() calls too. Then subtract out the rand-only overhead from your actual output. You should see the rand() calls dominating your runtime to the point of generating almost zero actual time spend in decDigitCount.

Short version: you're right, but I think you can be even more right.

Comment From: 191919

Yes you are right: rand() takes time.

I modified the test loop as below:

        for (int i = 0; i < 100000000; ++i) {
            r += (int64_t) rand() * rand();
        }

then run the tests. With icc, the time is 1.32s which is the expense of 100000000 times of rand() * rand().

Thus decDigitCount takes 1.49-1.32=0.17s while digits10 takes 4.49-1.32=3.17s which is 18 times of decDigitCount and is the expense of division by 10.

Comment From: madolson

This got merged