Redis [BUG] Memory fragmentation when using log-normal value sizes

Describe the bug

There is a memory fragmentation issue in Redis. Given a log-normal value size distribution, it can lead to out of memory conditions. The problem is solved, if you upgrade the jemalloc bundled within Redis from 5.1.0 to 5.2.0.

The issue happens with read-only and read-write workloads.

The only requirement to reproduce the issue is to load the database with values of log-normal size.

To reproduce

Here is a Python script to load the database:

#!/usr/bin/env python3

import redis
import scipy.stats

def main():
    scipy.random.seed(1)
    dist = scipy.stats.lognorm(3.43, scale=206)
    conn = redis.Redis()

    for i in range(100000):
        key = 'key:' + str(i).zfill(12)
        value_len = max(1, int(dist.rvs()))
        value = 'x' * value_len
        conn.set(key, value)

if __name__ == "__main__":
    main()

Here are the reproduction steps:

git clone --depth=1 https://github.com/redis/redis.git
cd redis
make -sj
pip3 install redis scipy
src/redis-server --save '' &
./load_redis.py
timeout 10 src/redis-benchmark -c 500 -e -n 9999999 -r 100000 -t get
sleep 1
src/redis-cli info | grep mem_fr

The output will look like this:

mem_fragmentation_ratio:1.68
mem_fragmentation_bytes:4666941768

A ratio > 1 means that the memory is fragmented.

Expected behavior

Here are the steps to upgrade jemalloc:

cd deps
mv jemalloc jemalloc.original
wget https://github.com/jemalloc/jemalloc/releases/download/5.2.0/jemalloc-5.2.0.tar.bz2
tar xf jemalloc-5.2.0.tar.bz2
mv jemalloc-5.2.0 jemalloc
cd ..
make distclean
make -sj

If you run the benchmark again, the output will look like this:

mem_fragmentation_ratio:0.93
mem_fragmentation_bytes:-485601336

Additional information

For your information, these are the commits in jemalloc that fix the issue: https://github.com/jemalloc/jemalloc/commit/fb56766ca9b398d07e2def5ead75a021fc08da03 https://github.com/jemalloc/jemalloc/commit/350809dc5d43ea994de04f7a970b6978a8fec6d2

Comment From: oranagra

@prekageo thanks for that tip. we do indeed need to upgrade jemalloc sooner or later.

From what i understand, this benchmark (python part) generates about 6GB of memory usage, And the redis-benchmark execution generates high client output buffer consumption, during which redis reaches a peak of some 12GB usage. From what i can tell, at neither of these points in time there's no fragmentation (the process RSS matches the used_memory, and the mem_fragmentation_ratio is near 1.0).

Then when the clients disconnect, redis releases all the output buffer memory, and redis's memory usage goes back to 6GB. allocator_active and even allocator_resident (both sampled from jemalloc metrics) also show only 6GB (there's no actual memory "fragmentation").

If we look at MEMORY MALLOC-STATS, we can see that the retained memory is very high, and that's why the process RSS is still high. I.E. jemalloc returned memory with MADV_FREE (not with the immediate MADV_DONTNEED), so the pages are still mapped to the process and still shown as RSS. (note that this is not fragmentation!).

Then if you'll wait a minute, the mem_fragmentation_ratio (which is misnamed since it doesn't really show fragmentation), will get back to normal, when the kernel is done reclaiming these pages and the RSS is back to 6GB.

Comment From: oranagra

@prekageo putting aside what this test actually achieves. By looking at your script, it seems you're aiming for a very specific pathological edge case (considering the use of lognorm). So i suppose that one may exist and your script just didn't manage to expose it (yet)? although in that case, i have to wonder why are the redis-benchmark and sleep needed.

Are you trying to reproduce a problem which you observed somewhere? or maybe something you learned by reading the code?

P.s. as far as i know jemalloc 5.2 no longer uses MADV_FREE by default, so that's probably one reason why this test would produce different results (i didn't check the two commits you referred to).

Comment From: prekageo

Hi @oranagra. Thanks for the detailed explanation.

I have observed in real-world situations that Redis was running out of memory while doing a read-only workload. That was unexpected for me so I started looking deeper into the problem. Then, I realized that the problem happens when using a heavy tailed distribution for the value sizes (such as log-normal). The script and reproduction steps are just a minimal example that uncovers this behavior.

I have not read the code of jemalloc to understand why this behavior happens. I have bisected its git history and I've come up with the 2 mentioned commits.

The execution of the redis-benchmark is the one that causes this behavior. If you actually remove the timeout 10 and let it run for a longer amount of time, Linux will kill either Redis or redis-benchmark due to OOM. The sleep is part of my reproduction script just to be sure that the benchmark has finished. It might not be necessary.

Comment From: oranagra

@prekageo thank you for clearing it up. Read operations in redis do indeed consume memory and can either trigger key eviction (if maxmemory is set), or cause the process to grow which can lead to an OOM kill. There is a client-output-buffer-limit config that can be used to disconnect clients when they're output buffer grows too much, and we also have other plans for improving this in the future, but all of that has nothing to do with fragmentation.

As i said, from what i can see with your specific reproduction scenario, the difference between old and new jemalloc here is just because new one tells the kernel to immediately release pages, and old one tells it to do it later, but in both cases these pages are eventually freed. Also, this specific problem would happen regardless of the size distribution of the data, it's just a matter of querying big values.

Maybe what you saw in the real case is indeed a fragmentation problem, in which case you may want to try reproducing it again in a different way.

Comment From: prekageo

@oranagra you are right. I've tried with a fixed value size of 300 MB and I observe the same behavior. And indeed the memory is eventually freed (as long as you stop the client before OOM).

Maybe it's my misunderstanding that this behavior is caused by fragmentation. I supposed so because one of the commits of jemalloc that fixes this behavior mentions "...which significantly improves VM fragmentation...".

In any case, this behavior prohibits reading from Redis large values for a long period of time with long lasting connections. I don't know how common scenario that is and if it's worth upgrading jemalloc to fix it. It is counter-intuitive, though, that Redis will consume more and more memory while running a read-only workload.

Comment From: oranagra

if client-output-buffer-limit and maxmemory are not set, and the client sends an uncontrolled pipeline, it can cause OOM kill anyway, the new version of jemalloc doesn't change that. There are also other problems with client output buffers (even if limited), see #7676, we hope to get to handle that some day soon.

Comment From: oranagra

Closing this as it turns out to be a different issues (client output buffers and MADV_FREE). feel free to reopen if needed.