Hello,

I have a few master-slave setups where we have a master redis that receives all the writes, and replicates to between 3 and 19 slaves that receive only reads. We set expirations on all keys, and have a maxmemory limit of 16gb with a maxmemory-policy of volatile-ttl.

I'm experiencing an issue where when the master redis reaches the maxmemory limit, it begins evicting keys (as expected), but evicts all the keys in the database. The keys are URLs, and the values are short strings; we have roughly 66 million keys in the database when the maxmemory limit is reached. We are seeing this in at least 8 different redis replication setups where we are reaching maxmemory and having this eviction problem. All have a 16gb or higher maxmemory limit.

The other related symptom is that once the master evicts all the keys, some of the slaves do not delete their keys and still have e.g. 65 million keys after finishing a sync with a master that has less than 200 thousand.

The gist linked below has logs for master (master.log), the slave that failed to re-sync properly (bad_slave.log) and the slave that did re-sync properly (good_slave.log). Logs were gathered at the verbose level.

Info:

$ uname -a
Linux <servername> 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ redis-server -v
Redis server v=4.0.1 sha=00000000:0 malloc=jemalloc-4.0.3 bits=64 build=59320dbc052344f

Master/slave configs and logs from the time when this occurred are at: https://gist.github.com/joshuawscott/f9e238aff68417292d6d8d107409d071

Please let me know if there's any other info you need to help get to the bottom of this issue

Thanks!

Comment From: joshuawscott

When I was able to reproduce this in a smaller setup, reducing the client-output-buffer-limit slave setting seemed to fix the mass-eviction problem. Do these buffers count toward the maxmemory limit? It seems like hitting the memory limit when under a heavy pipelined write load with 5+ slaves is the only way I can reproduce.

My suspicion is that when maxmemory is hit, it evicts keys, and puts the DEL commands into the output buffer, which causes further eviction to be needed, issuing more DEL commands, and so forth. It doesn't seem like this is the expected behavior?

Comment From: joshuawscott

After running a git bisect with a test setup, it looks like afc4b9241c37f37d1ca15be1ec3130c6a9c04a2a is the commit that introduced this behavior. We've rolled back to 3.2.9 for now, but would really like to use the 4.0 features (especially LFU eviction). Also, the problem seems to occur when the hashtable slots are doubled from 64M to 128MB, and CPU usage goes to 100% on the master redis, and it becomes completely unresponsive for about 5 minutes. Once it returns, there are no keys left.

Comment From: tdterry

This is an old ticket, but I just ran into a similar issue with redis in AWS Elasticache, so I thought I would share what we ran into. Client buffers share memory space with the database itself. If you have clients requesting a lot of data and not consuming it quickly enough, client buffers can back up quickly. Without client-output-buffer-limit set, the client buffer will continue to grow. In our case, we had a database size of approximately 1.5GB but under heavy load, our redis node got network throttled, our node servers continued to request data, and we were not able to read it. In the span of a minute, the memory usage grew to 20GB, and Redis evicted almost the entire DB. Redis was still responsive, but because all of the memory was consumed by output buffers, it wouldn't accept any new objects.

https://redislabs.com/blog/top-redis-headaches-for-devops-client-buffers/

Comment From: oranagra

@joshuawscott the problem of client output buffers causing eviction is a well known one, but a few things in your report don't make sense to me. 1. what does a bugfix in DEBUG DIGEST has to do with it? (you mentioned git bisect pointed to it). 2. slave output buffers are (and AFAIK always were) excluded from the memory by which eviction works. so eviction should not induce more eviction by itself.

i suspect the reason why some slaves retained the evicted keys was because they got disconnected (possibly by timeout or output buffer limit), so they didn't get the DELs.

Comment From: joshuawscott

  1. This is not necessarily the commit that introduced the problem, but it is the first commit that compiled after the problem was introduced, if I recall correctly

The evictions seem to occur when the key dict is being re-hashed based on what I remember digging into this a few years ago.

Comment From: yossigo

@joshuawscott You are right about that, the hash table size is also accounted for, so rehashing can create a memory usage spike that consequentially triggers an eviction spike. However, I believe output buffer related over-eviction is by far more common.

Related to #7676