Redis Redis Streams with Consumer Groups and memory management.

Hi everyone. I'm facing some issues to deal with the memory increase in Redis 5.0.9. Our setup consists of 3 VM's, master, slave, and a Sentinel.

Our application pipeline consists in:

XADD mystream * sensor-id 1234 temperature 19.8

We work with 12 different streams.

XREADGROUP GROUP mygroup Alice COUNT 150 STREAMS mystream >

XTRIM 500000 mystream

We understand that Redis does not release memory instantly using the XDEL command. So we changed to use XTRIM and evict OOM errors.

Analyzing our metrics, we detected a correlation between XREADGROUP commands and memory increase. Remembering that we perform an XTRIM command. But memory continues to increase.

Some data to help us discover what going on:

The slave node doesn't increase memory equal master node, our replication is working fine. We are using Golang client with MinConnIdle config was set as 100 per host. (our default is 10 nodes). Our number of consumers is about 10 per consumer node (about 6 nodes dedicated to consumers). Looking for pending messages is zero. XLEN of our streams is equal to XTRIM value. Our XADD rate is about 6k per second.

Redis INFO:

# Server
redis_version:5.0.9
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:3d4d32bb8c35cf6a
redis_mode:standalone
os:Linux 4.1.12-61.1.28.el6uek.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:sync-builtin
gcc_version:4.4.7
process_id:26640
run_id:b50f2a78f33faa60d6d2cf48a6939aac075fcc34
tcp_port:6379
uptime_in_seconds:177932
uptime_in_days:2
hz:20
configured_hz:10
lru_clock:4794776
executable:/usr/local/redis/src/redis-server
config_file:/data/redis.conf

# Clients
connected_clients:2136
client_recent_max_input_buffer:4
client_recent_max_output_buffer:123072
blocked_clients:45

# Memory
used_memory:7442392864
used_memory_human:6.93G
used_memory_rss:7431725056
used_memory_rss_human:6.92G
used_memory_peak:7452463192
used_memory_peak_human:6.94G
used_memory_peak_perc:99.86%
used_memory_overhead:1114066798
used_memory_startup:783544
used_memory_dataset:6328326066
used_memory_dataset_perc:85.04%
allocator_allocated:7442401520
allocator_active:7460909056
allocator_resident:7569297408
total_system_memory:33461932032
total_system_memory_human:31.16G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:25769803776
maxmemory_human:24.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.00
allocator_frag_bytes:18507536
allocator_rss_ratio:1.01
allocator_rss_bytes:108388352
rss_overhead_ratio:0.98
rss_overhead_bytes:-137572352
mem_fragmentation_ratio:1.00
mem_fragmentation_bytes:-10407736
mem_not_counted_for_evict:0
mem_replication_backlog:1073741824
mem_clients_slaves:16922
mem_clients_normal:39523900
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:1112266617
rdb_bgsave_in_progress:0
rdb_last_save_time:1615229752
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:294912
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:9213369
total_commands_processed:575153203
instantaneous_ops_per_sec:4189
total_net_input_bytes:374022659065
total_net_output_bytes:718721749681
instantaneous_input_kbps:2587.72
instantaneous_output_kbps:5761.27
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:1
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:22961146
keyspace_misses:3707
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:368
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:master
connected_slaves:1
master_replid:ba3b6fd43ed2f0f336b4166e3a2ee5060a1151c5
master_replid2:ceccbc431f46f04ea7c32882de68c26d1f1f6c3b
master_repl_offset:8873053474828
second_repl_offset:8443132203955
repl_backlog_active:1
repl_backlog_size:1000000000
repl_backlog_first_byte_offset:8872053474829
repl_backlog_histlen:1000000000

# CPU
used_cpu_sys:22945.887250
used_cpu_user:24607.789431
used_cpu_sys_children:0.001750
used_cpu_user_children:0.000000

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=12,expires=0,avg_ttl=0

Comment From: itamarhaber

Hello @vsouza

If I understand correctly, the reason for submitting this issue is that the replica's memory usage doesn't correspond to that of the master's, and that there appears to be an increase in the master's memory usage. If so, please share the full output of INFO all's from both master and replica taken approximately at the same time so it can be demonstrated.

Also, it would be helpful, if possible, to see these metrics over time and the correlation you've identified.

One thing that does stand out above is the master's 1GB replication backlog (and the fact there's only a single replica), which means there's an active (slow?) sync going on.

Comment From: hwware

@vsouza I found this issue disappeared in 5.0.10 version. After I check the difference between 5.0.9 and 5.0.10, I found it was solved in the commit https://github.com/redis/redis/commit/a3ca53e4a792a5da18444fb587531029ed62f99b @itamarhaber @oranagra @yossigo Could you please provide more background for this commit? Thanks a lot

Comment From: oranagra

The background is that alsoPropagate only worked in the context of call so when used in an blocked command (when it gets unblocked), it would have been ignored.

IIRC this commit is related some to some issue or PR that @guybe7 opened.

Comment From: mohoch1

We are experiencing a similar issue with Redis version 6.2.7. We have a cluster with 4 master nodes and 4 slave nodes.

All the keys are written and expire as expected, and the memory usage is stable. But when we use the Redis Streams, the memory keeps building up. With each XADD we also put XTRIM with MAXLEN 10000 and this limit is actually enforced. But gradually the memory usage increases, although the number of messages and their size are constant.

The node holding this key reached 19GB and more, even when no other key was stored on that node.

Applying a XTRIM of 10 reduced the number of messages, but not the memory.

The only way to take down the memory, was to DEL the key.

Any idea?

Comment From: oranagra

@mohoch1 please share your config, specifically stream-node-max-entries and stream-node-max-bytes. also, maybe can share XINFO STREAM <key> on that key when it has very few entries and holds huge amount of memory. i also wonder what MEMORY USAGE <key> says on that key. @guybe7 do you have any other ideas?

Comment From: mohoch1

Hi.

the config value for stream-node-max-entries is 100
the config value for stream-node-max-bytes is 4096 (which is the default)
I cleared the stream and let it run for a while. Using Redis Insight memory analyzer, I can see that with 150 entries the key's memory usage is over 800 MB, while the overall node usage is still over 3 GB (when this is the only key on the node)
An example of XINFO STREAM <key> with only 40 entries:

1) "length"

2) (integer) 40 3) "radix-tree-keys" 4) (integer) 4 5) "radix-tree-nodes" 6) (integer) 10 7) "last-generated-id" 8) "1681645685860-0" 9) "groups" 10) (integer) 1 11) "first-entry" 12) 1) "1681645682372-0" 2) 1) "database" 2) "default" 3) "names" 4) "my-stats" 5) "time" 6) "2023-04-16 11:48:02" 7) "tags" 8) "{\"machine-name\":\"MACHINE-NAME-04\",\"process-name\":\"MY.AgentService\",\"process-id\":27036}" 9) "fields" 10) "{\"retrievals\":0,\"retrieval-duration-max\":0,\"retrieval-duration-min\":0,\"retrieval-duration-avg\":0,\"retrievalErrors\":0,\"lockTimeouts\":0,\"lock-duration-max\":0}" 13) "last-entry" 14) 1) "1681645685860-0" 2) 1) "database" 2) "default" 3) "names" 4) "my-stats" 5) "time" 6) "2023-04-16 11:48:05" 7) "tags" 8) "{\"machine-name\":\"MACHINE-NAME-02\",\"process-name\":\"MY.AgentService\",\"process-id\":35784}" 9) "fields" 10) "{\"retrievals\":0,\"retrieval-duration-max\":0,\"retrieval-duration-min\":0,\"retrieval-duration-avg\":0,\"retrievalErrors\":0,\"lockTimeouts\":0,\"lock-duration-max\":0}"

Comment From: rishbard

Hi @mohoch1 we're facing a similar problem with our ElasticCache cluster.

Can you confirm if you have activedefrag parameter set to true for your cluster? Could that play a role in the memory not being cleared for stream/s after XTRIM.

Comment From: oranagra

activedefrag can fix problems with allocator_frag_bytes, and some cases of high RSS, but in this case it's in used_memory, so it's either a leak, or some inefficiency in the stream stricture (not fragmentation). considering that it is released when deleting the key, then it's not a leak.

Comment From: guybe7

@mohoch1 let's try to narrow down the problem... do you use consumer groups? can you try do build the same scenario without consumer groups?

one theory is that once an entry has been read by a consumer and is held in a PEL, it is not trimmed even if the original entry in the actual stream rax was trimmed

Comment From: mohoch1

@guybe7, yes I am reading the stream using a consumer group. I will try to remove that, as this is indeed not required in my scenario.

I will update after we finish checking.

Thanks.

Comment From: mohoch1

Well - consuming without a consumer group did not help. We are seeing the same memory leak

Comment From: piavgh

@mohoch1 : we are facing the same issue with Redis v7 What is your solution?

Comment From: mohoch1

Actually, for now I don't have any solution. From time to time I just delete the stream.

Comment From: guybe7

@mohoch1 what do you mean by "consuming without a consumer group"?

AFAICT it is true that deleting/trimming unacknowledged entries will not free the NACK entries from the PELs - but this is most likely by design - you should not delete/trim entries that were consumed by a consumer and not acknowledged (feels like a bug in the application? why would you delete an entry before the consumer processed it?)

a nasty workaround is to use XAUTOCLAIM every once in a while (XAUTOCLAIM will clear any deleted entries it stumbles upon from the PEL)

Comment From: oranagra

@mohoch1 please post the output of XINFO GROUPS <key>, so we can see if it's an ACK problem or something else.

Comment From: mohoch1

@guybe7, as you suggested, I tried consuming without a consumer group, using XREAD. This resulted with the same leak.

The reason I trimmed the stream is because I want to limit the size of the stream - in case no one will consume my messages, I don't want it to keep growing infinitely .

If by design I am not supposed to trim the stream while messages are being consumed - how can I ensure that the stream will not swell up?

Comment From: mohoch1

@oranagra, the current result of XINFO GROUPS shows this:

 1) "name"
   2) "Processor"
   3) "consumers"
   4) (integer) 38636293
   5) "pending"
   6) (integer) 340
   7) "last-delivered-id"
   8) "1688914813124-0"
````

**Comment From: oranagra**

thanks. it would be nice to see the memory usage that accompanies it (to know how much excess there is).
anyway, i was expecting `pending` to be very high, but instead i see you have quite a lot of `consumers`.
maybe there's some bug in the app that leaks consumers, i.e. keep creating new ones instead of re-using them?

it's a little hard to estimate the memory usage of 38m consumers. in the past you said that deleting the stream key released the memory, maybe you can try to delete the consumer group, or better yet, write some script that deletes the excess consumers (using XINFO CONSUMERS will give you the list, but note that it'll be a very long list, which can cause the server to freeze for a moment)

**Comment From: mohoch1**

Well, I don't understand why we have so many consumers. We have only 1 application consuming the stream.
We use StackExchange.Redis and this is the code:

while (!_disposed) { var entries = _redis.StreamRead(streamName, StreamPosition.NewMessages);

if (entries.Any())
    ProcessEntries(entries);


Thread.Sleep(pollIntervalMs);

}


The `StreamRead` method uses the _XREAD_ command.
What am I missing?
Is there a way of creating only one consumer?

**Comment From: oranagra**

@mgravell maybe you can help here?

**Comment From: mohoch1**

Is there aby suggestion?

**Comment From: oranagra**

@NickCraver can you please have a [look](https://github.com/redis/redis/issues/8635#issuecomment-1628304700)?

**Comment From: mpcathey**

@vsouza Are you still seeing this?  I was looking at using Streams and this has me concerned.

**Comment From: mohoch1**

Does anyone have an updated solution for this?
We are still having the same issue

**Comment From: sundb**

@mohoch1 can you share the output of `XINFO GROUPS`? are there still 38m consumers?

**Comment From: mohoch1**

Currently I have 200M consumers and the size of the stream is around 35GB.
I haven't cleared it out for a while.

1) "name" 2) "Processor" 3) "consumers" 4) (integer) 200598882 5) "pending" 6) (integer) 340 7) "last-delivered-id" 8) "1722240329720-0" ```

Comment From: sundb

@mohoch1 can you share more detail about this.

Comment From: mohoch1

We are using the StackExchange.Redis library by @mgravell .

We have a constant loop that tries to read from the stream and process the entries received. If there is nothing there - we just wait for half a second or so, and then try again.

We call the StreamRead method, which under the hood calls Redis's XREAD command.

This should be straight forward - but it seems something is not cleared.

What am I missing? Do I need to Dispose a consumer? Do I need to acknowledge the message I read?

Comment From: sundb

@mohoch1 the core question is why so many consumers were created. 200M consumer will consume at least 11gb just for streamConsumer struct.

What am I missing? Do I need to Dispose a consumer? Do I need to acknowledge the message I read?