We have 10 thousand connections on a redis server . The server is occasionally hang up.

We found in function "clientsCronResizeQueryBuffer", condition (querybuf_size > 1024 && idletime > 2) is matched many times, and a lot of connections release their querybuf at cron job at the same time.The hang up happened at the same time when bulk release.

if (((querybuf_size > PROTO_MBULK_BIG_ARG) &&
     (querybuf_size/(c->querybuf_peak+1)) > 2) ||
     (querybuf_size > 1024 && idletime > 2))
{
    /* Only resize the query buffer if it is actually wasting space. */
    if (sdsavail(c->querybuf) > 1024) {
        c->querybuf = sdsRemoveFreeSpace(c->querybuf);
    }
}

===== As we found querybuff is initially allocated with 32 * 1024 bytes. So we change the condition to "querybuf_size > 1024 * 32 && idletime > 2". The hang up is fixed.

Comment From: antirez

Thank you, I just fixed it in commit cec404f0. What do you think about the new implementation? Thanks.

Comment From: soloestoy

Hi @antirez , after commit cec404f there are still some problem we should fix I think:

c->querybuf_peak has not been updated correctly in readQueryFromClient.

c void readQueryFromClient(aeEventLoop *el, int fd, void *privdata, int mask) { qblen = sdslen(c->querybuf); if (c->querybuf_peak < qblen) c->querybuf_peak = qblen; c->querybuf = sdsMakeRoomFor(c->querybuf, readlen); nread = read(fd, c->querybuf+qblen, readlen); ... sdsIncrLen(c->querybuf,nread);

As we can see, we update c->querybuf_peak before read and sdsIncrLen, it means that qblen here is the last c->querybuf length. But the length of c->querybuf after processInputBuffer is always 0, unless we didn't read the whole request at one time.

So, we should update c->querybuf_peak after sdsIncrLen I think.
We should use sdsalloc instead of sdsAllocSize.

Because we wanna get c->querybuf alloc space except for header, the sdsAllocSize value is always 32 * 1024 + sdsHdrSize + 1 grate than 32 * 1024.
Check if query buffer is > BIG_ARG and too big for latest peak only when c->querybuf_peak is not 0.

Function clientsCronResizeQueryBuffer reset c->querybuf_peak to be 0 no matter resize the query buffer or not.

c int clientsCronResizeQueryBuffer(client *c) { ... /* Reset the peak again to capture the peak memory usage in the next * cycle. */ c->querybuf_peak = 0;

If we don't get any requests after that, next time we will trigger the condition definitely.

Submit PR #5013.

Comment From: soloestoy

Another question.

As we found querybuff is initially allocated with 32 * 1024 bytes. So we change the condition to "querybuf_size > 1024 * 32 && idletime > 2". The hang up is fixed.

@northhurricane Sorry I cannot understand this fix.

If your fix is right it means that 1024 < querybuf_size <= 1024 * 32. But I can't find any scenario could make it happen.

As you said querybuff is initially allocated with 32 * 1024 bytes:

void readQueryFromClient(aeEventLoop *el, int fd, void *privdata, int mask) {

    readlen = PROTO_IOBUF_LEN;

    c->querybuf = sdsMakeRoomFor(c->querybuf, readlen);

That's right, sdsMakeRoomFor can enlarge querybuf. But except for clientsCronResizeQueryBuffer, we have only one situation to resize querybuf:

int processMultibulkBuffer(client *c) {
            if (pos == 0 &&
                c->bulklen >= PROTO_MBULK_BIG_ARG &&
                sdslen(c->querybuf) == (size_t)(c->bulklen+2))
            {
                c->argv[c->argc++] = createObject(OBJ_STRING,c->querybuf);
                sdsIncrLen(c->querybuf,-2); /* remove CRLF */
                /* Assume that if we saw a fat argument we'll see another one
                 * likely... */
                c->querybuf = sdsnewlen(SDS_NOINIT,c->bulklen+2);
                sdsclear(c->querybuf);
                pos = 0;
            }

But we get a c->querybuf greater than PROTO_MBULK_BIG_ARG(32 * 1024).

Am I misunderstanding? Do you have any idea @antirez ?

Comment From: oranagra

@soloestoy can we close this one? or is there something more to improve?

Redis seconds hang up report

We found in function "clientsCronResizeQueryBuffer", condition (querybuf_size > 1024 && idletime > 2) is matched many times, and a lot of connections release their querybuf at cron job at the same time.The hang up happened at the same time when bulk release.