Hi,
Lazyfree is an amazing feature, redis can delete keys in background without blocking, but I just find a problem which can lead to memory leak.
We just use sds instead of robj to store fields for OBJ_HASH, OBJ_SET, OBJ_ZSET, OBJ_LIST. But for OBJ_STRING the value is still robj, it may be shared with others, e.t. slowlog entry:
slowlog.c
slowlogEntry *slowlogCreateEntry(client *c, robj **argv, int argc, long long duration) {
...
/* Trim too long strings as well... */
if (argv[j]->type == OBJ_STRING &&
sdsEncodedObject(argv[j]) &&
sdslen(argv[j]->ptr) > SLOWLOG_ENTRY_MAX_STRING)
{
sds s = sdsnewlen(argv[j]->ptr, SLOWLOG_ENTRY_MAX_STRING);
s = sdscatprintf(s,"... (%lu more bytes)",
(unsigned long)
sdslen(argv[j]->ptr) - SLOWLOG_ENTRY_MAX_STRING);
se->argv[j] = createObject(OBJ_STRING,s);
} else {
se->argv[j] = argv[j];
incrRefCount(argv[j]);
}
...
}
Oops, if sdslen is smaller than SLOWLOG_ENTRY_MAX_STRING, an slowlogEntry just point to the same object and incr the refcount.
I just did an experiment:
127.0.0.1:6379> config set slowlog-log-slower-than 1
OK
127.0.0.1:6379> set foo bar
OK
127.0.0.1:6379> object refcount foo
(integer) 2
The refcount is 2 now, that is dangerous! Memory leak may happen, let me explain step by step:
-
Do
FLUSHALL ASYNCorUNLINKtips:
UNLINKonly works when value's length is larger than LAZYFREE_THRESHOLD,OBJ_STRINGis always considered as 1, so we useFLUSHALL ASYNC.Now, the bio
lazyfreethread holds the valuebar, but we don't know when it really calldecrRefCount(). -
Do
SLOWLOG resetNow the
slowlogEntrywhich point to the valuebarwill be freed bydecrRefCount(), iflazyfreethread calldecrRefCount()at the same time, memory leak appears:void decrRefCount(robj *o) { if (o->refcount == 1) { switch(o->type) { case OBJ_STRING: freeStringObject(o); break; case OBJ_LIST: freeListObject(o); break; case OBJ_SET: freeSetObject(o); break; case OBJ_ZSET: freeZsetObject(o); break; case OBJ_HASH: freeHashObject(o); break; case OBJ_MODULE: freeModuleObject(o); break; default: serverPanic("Unknown object type"); break; } zfree(o); } else { if (o->refcount <= 0) serverPanic("decrRefCount against refcount <= 0"); if (o->refcount != OBJ_SHARED_REFCOUNT) o->refcount--; } }tips: if we reach the
server.slowlog_max_lenthat will also freeslowlogEntry
I just do a little change to fix it, see #4324, do you think it is ok? Or make a big change for all STRING commands is necessary?
Comment From: antirez
That's a very interesting finding @soloestoy! I never thought that string objects could be reclaimed in the background, because normally UNLINK will never do that since for non-aggregate objects, the deletion actually happens synchronously (because of the deallcation effort being 1). However as you noted FLUSHALL ASYNC changes the game here... Your fix seems sensible to me, but now what I'm worried about, is if there are other places where the object can end being shared. The client argv vector should be safe since, as soon as the command is executed and the object shared, the client argument vectors gets reset before any other client could call FLUSHALL. Did you tried to analyze other scenarios where this could happen? Thanks.
Comment From: soloestoy
I have checked scenarios about multi.c and script.c, I think they are safe and will do my best to analyze other scenarios.
Comment From: antirez
Thanks @soloestoy, maybe we can later identify some place where an assert() could be added to verify that refcount == 1.
Comment From: antirez
Fix merged, comment clarified. Good work 👍
Comment From: soloestoy
solved & merged