As you can see below, when the data set of an instance shrinks, the used_memory_lua skews the fragmentation ratio due to it not including it in the used_memory. I assume used_memory is only including the memory allocated for data because the allocator is different. So perhaps a more reliable (less scary) fragmentation figure would be (RSS / (used_memory + used_memory_lua)) rather than the current (RSS / used_memory)? If this is how it is intended to be reported, feel free to close :)

127.0.0.1:6877> info memory
# Memory
used_memory:1062544
used_memory_human:1.01M
used_memory_rss:114487296
used_memory_peak:1140032096
used_memory_peak_human:1.06G
used_memory_lua:91031552
mem_fragmentation_ratio:107.75
mem_allocator:jemalloc-3.6.0
127.0.0.1:6877> script flush
OK
127.0.0.1:6877> info memory
# Memory
used_memory:1061312
used_memory_human:1.01M
used_memory_rss:3485696
used_memory_peak:1140032096
used_memory_peak_human:1.06G
used_memory_lua:36864
mem_fragmentation_ratio:3.28
mem_allocator:jemalloc-3.6.0

And since we're on the topic, would it be possible to also display VIRT? I imagine there might be a good reason for its omission, though. I need to be able to monitor fragmentation, phy, virt, (and frag_max, phy_max, virt_max) etc. for my redis instances, so this would be nice to be able to just issue an INFO command for all this.

More info from the same server if it helps:

# Server
redis_version:3.0.4
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:31bcb47bce320346
redis_mode:standalone
os:Linux 2.6.32-504.23.4.el6.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.4.7
process_id:7257
run_id:2e2c20c0262d3f541cb566b2110691de668c6acb
tcp_port:6877
uptime_in_seconds:165348
uptime_in_days:1
hz:10
lru_clock:9426321
...
...
# Commandstats
...
cmdstat_eval:calls=146,usec=8705779,usec_per_call=59628.62
cmdstat_script:calls=2,usec=396809,usec_per_call=198404.50

Comment From: yoav-steinberg

You are right regarding the inaccuracy of the fragmentation information. See: https://github.com/redis/redis/blob/f041990f2acde5ae1ef67351c6a505f3ef6fcf52/src/server.c#L4883-L4887 It seems like a good idea to exclude LUA allocation from here. @oranagra WDYT?

And since we're on the topic, would it be possible to also display VIRT?

I'm curious what would you need VIRT info for?

Comment From: oranagra

since this issue was opened, we've added real fragmentation metrics:

  • allocator_frag_ratio
  • allocator_frag_bytes
  • allocator_rss_ratio
  • allocator_rss_bytes
  • rss_overhead_ratio
  • rss_overhead_bytes

documented here: https://redis.io/commands/info so the situation is already greatly improved.

this old "fragmentation" metric is wrong in many aspects, and in essence it's just a ratio of used_memory and used_memory_rss. i don't wanna deduct the lua memory from rss, and at this point i also don't wanna include it in used_memory. i'm not sure about just deducting it when calculating the pseudo fragmentation metric. @yoav-steinberg WDYT?

Comment From: yoav-steinberg

Seems like we have decent metrics and we can think of phasing out the old one. Then we won't need to have any specific code handling the lua allocations when showing these ratios. So I think: 1. We can close this ticket. 2. Maybe think of deprecating mem_fragmentation_ratio.

Comment From: oranagra

Don't know how to depreciate the old metric. the way I see it, it can maybe serve as a red flag to tell you that there may be some issue, and then you need to look at other metrics to know what exactly it is. But as pointed out in the top description, there are cases where there's no issue at all.

However, maybe the more severe problem with this metric is that people (and even monitoring software) are looking at it on its own, without the context of the mem_fragmentation_bytes, so that sometimes on a completely empty process, they see very high frag ratio and get stressed over nothing (e.g. 2mb). See: https://github.com/redis/redis/issues/9256

So considering that, even fixing this Lua issue, won't really solve the problem with this metric.

Comment From: yoav-steinberg

the way I see it, it can maybe serve as a red flag to tell you that there may be some issue, and then you need to look at other metrics to know what exactly it is.

It can probably server as a red flag, but in most cases allocator_frag_ratio is a better flag and combined with rss_overhead_ratio I think it covers all the cases mem_fragmentation_ratio handles. What am I missing?

Comment From: oranagra

there's also allocator_rss_ratio, i.e. there are 3 metrics that are more or less, a breakdown of mem_fragmentation_ratio. but anyway, too many people are already looking at that one, i don't think we can remove it, and i'm not sure how to fix it in a way that will be better.