Context
Two Redis clusters with same hardware and de-fragmentation configuration:
activedefrag yes
active-defrag-threshold-lower 30
active-defrag-threshold-upper 50
active-defrag-cycle-max 10
Timed write to these two clusters with pipeline. One writes string with pipe.Set and other one writes hash with pipe.HSet.
Problems
- String Write Fragmentation Change
- Hash Write Fragmentation Change
Questions
- Why do these two writings behave oppositely?
- Why the fragmentation ratio of the hash cluster has been maintained at around 1.5?
- Under this circumstance, how should we optimize our de-frag configuration? (To ensure latency, the current active-defrag-cycle-max should not be increased)
Comment From: oranagra
Hi.
I'm not entirely sure why at all you have fragmentation and what's the traffic / pattern looks like.
i.e.
1. what's pipe.HSet?
2. do you have eviction or expiration? (fragmentation only occurs when you delete stuff)
3. even with eviction and expiration, if all writes are of objects of the same sizes, they'll plug the holes created by the deletions, so i don't expect any fragmentation.
4. maybe these graphs will even look the same with activedefrag disabled (maybe it's just the pattern of allocations, releases, and then new allocations that plug the wholes)?
5. maybe it's not fragmentation at all, but some other RSS overhead. i.e. mem_fragmentation_ratio is a bad metric, you should look at allocator_frag_ratio and allocator_frag_bytes
6. you're not showing any other metrics, only the ratio. maybe it's just a few megabytes that are influenced by something else (i.e. ratio of 1.5 is just 50% of 2mb).
7. i'd also like to see the defragger metrics (hits misses etc).
regardless of all the questions above, i think the main focus can be to just see why on one pattern you have 50% fragmentation, and then either understand where it comes from, or why the defragger isn't eliminating it. (i.e. focus only on your second test)
for that, please post your full INFO all, MEMORY malloc-stats, and maybe even /proc/<pid>/smaps.
Comment From: ghost
Hi,
Sorry for not the incomplete context. Let me explain it in detail:
1. pipe.Hset is from go-redis sdk, it means pipeline running HSET command.
2. We do have eviction. Every time we write data, the old data would be deleted. While the logic here is to write the new data(around 6000000 hashes) this time, then delete all the previous data when writing completes.
3. Each hash is an item profile, it contains fields like price, embedding, title, embedding is a 32*float32 array, all convert to protobuf byte array. Each hash might have different size, but generally similar.(as for the plug the wholes pattern, I am not sure if the above logic follows it).
4. We will do this experiment later and see the effect.
5. Our current allocator_frag_ratio is 1.55 and mem_fragmentation_ratio is 1.61. Thus I guess they almost the same for our cases.
6. Our memory size of each node should be around 4GB, please refer to the configurations at the bottom.
7. How should we get the hits misses? Could you please share the related commands?
Here are some configurations as you request:
# Memory
used_memory:3527086552
used_memory_human:3.28G
used_memory_rss:5668630528
used_memory_rss_human:5.28G
used_memory_peak:6368233288
used_memory_peak_human:5.93G
used_memory_peak_perc:55.39%
used_memory_overhead:916112936
used_memory_startup:1507808
used_memory_dataset:2610973616
used_memory_dataset_perc:74.06%
allocator_allocated:3527647200
allocator_active:5457444864
allocator_resident:5495586816
total_system_memory:133919506432
total_system_memory_human:124.72G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:32212254720
maxmemory_human:30.00G
maxmemory_policy:allkeys-lru
allocator_frag_ratio:1.55
allocator_frag_bytes:1929797664
allocator_rss_ratio:1.01
allocator_rss_bytes:38141952
rss_overhead_ratio:1.03
rss_overhead_bytes:173043712
mem_fragmentation_ratio:1.61
mem_fragmentation_bytes:2141564992
mem_not_counted_for_evict:0
mem_replication_backlog:536870912
mem_clients_slaves:0
mem_clients_normal:225608
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0
___ Begin jemalloc statistics ___
Version: "5.1.0-0-g0"
Build-time option settings
config.cache_oblivious: true
config.debug: false
config.fill: true
config.lazy_lock: false
config.malloc_conf: ""
config.prof: false
config.prof_libgcc: false
config.prof_libunwind: false
config.stats: true
config.utrace: false
config.xmalloc: false
Run-time option settings
opt.abort: false
opt.abort_conf: false
opt.retain: true
opt.dss: "secondary"
opt.narenas: 64
opt.percpu_arena: "disabled"
opt.metadata_thp: "disabled"
opt.background_thread: false (background_thread: true)
opt.dirty_decay_ms: 10000 (arenas.dirty_decay_ms: 10000)
opt.muzzy_decay_ms: 10000 (arenas.muzzy_decay_ms: 10000)
opt.junk: "false"
opt.zero: false
opt.tcache: true
opt.lg_tcache_max: 15
opt.thp: "default"
opt.stats_print: false
opt.stats_print_opts: ""
Arenas: 64
Quantum size: 8
Page size: 65536
Maximum thread-cached size class: 229376
Number of bin size classes: 55
Number of thread-cache bin size classes: 55
Number of large size classes: 180
Allocated: 3527746568, active: 5454430208, metadata: 40341136 (n_thp 0), resident: 5495062528, mapped: 5500895232, retained: 1322188800
Background threads: 2, num_runs: 233348, run_interval: 12308867420 ns
n_lock_ops n_waiting n_spin_acq n_owner_switch total_wait_ns max_wait_ns max_n_thds
background_thread 62439312 0 0 3 0 0 0
ctl 124878347 0 0 1 0 0 0
prof 0 0 0 0 0 0 0
Merged arenas stats:
assigned threads: 2
uptime: 2872286262112012
dss allocation precedence: "N/A"
decaying: time npages sweeps madvises purged
dirty: N/A 0 170459 463588 2173115
muzzy: N/A 5 159218 449893 2116183
allocated nmalloc ndalloc nrequests
small: 2855674888 6495012815 6435156494 19537633801
large: 672071680 8 3 8
total: 3527746568 6495012823 6435156497 19537633809
active: 5454430208
mapped: 5500895232
retained: 1322188800
base: 40225928
internal: 115208
metadata_thp: 0
tcache_bytes: 690760
resident: 5495062528
n_lock_ops n_waiting n_spin_acq n_owner_switch total_wait_ns max_wait_ns max_n_thds
large 59967139 0 0 2 0 0 0
extent_avail 60612946 2 0 165506 0 0 1
extents_dirty 62421104 16 217 341283 0 0 1
extents_muzzy 61446762 0 2 355302 0 0 0
extents_retained 61395083 0 11 297560 0 0 0
decay_dirty 67901948 0 25 468229 0 0 0
decay_muzzy 67743111 0 26 446503 0 0 0
base 119966150 0 0 5 0 0 0
tcache_list 59967141 0 0 3 0 0 0
bins: size ind allocated nmalloc ndalloc nrequests curregs curslabs regs pgs util nfills nflushes nslabs nreslabs n_lock_ops n_waiting n_spin_acq n_owner_switch total_wait_ns max_wait_ns max_n_thds
8 0 92076560 892933164 881423594 1207540533 11509570 2226 8192 1 0.631 27208199 8931475 51785 16861739 12846168374 0 0 107767412 0 0 0
16 1 135684400 995310447 986830172 2573481909 8480275 3583 4096 1 0.577 10817071 9554144 37077 44312941 13245235444 0 0 108562154 0 0 0
24 2 453582480 2300646131 2281746861 3478989016 18899270 3825 8192 3 0.603 24675749 21685008 47030 62169749 28775983935 0 0 311984105 0 0 0
32 3 166355008 403902757 398704163 1817220876 5198594 4279 2048 1 0.593 11095154 4340597 38719 7981833 7475578799 0 0 75091788 0 0 0
40 4 64494520 132234017 130621654 555447521 1612363 300 8192 5 0.656 4205789 2468360 8282 1162127 1541505245 0 0 7758890 0 0 0
48 5 101394576 164834827 162722440 6620614719 2112387 775 4096 3 0.665 4862320 2433319 30886 120442 2174554201 0 0 3943632 0 0 0
56 6 237804112 761759652 757513150 1576661420 4246502 1026 8192 7 0.505 12027278 8045582 1748 2563271 3312947646 0 0 7666130 0 0 0
64 7 2016576 244468964 244437455 628473312 31509 74 1024 1 0.415 7109834 3460271 943 19044655 76857556 0 0 2180118 0 0 0
80 8 3047440 12249449 12211356 154989954 38093 24 4096 5 0.387 1047202 1976508 93 349012 62992531 0 0 3038 0 0 0
96 9 367658112 289856827 286027055 308439826 3829772 3272 2048 3 0.571 7203311 3228387 57853 4556250 6053214063 0 0 50857948 0 0 0
112 10 4701424 6512247 6470270 10235579 41977 20 4096 7 0.512 889759 1482897 58 360012 62339891 0 0 2 0 0 0
128 11 2764544 3422565 3400967 5827700 21598 80 512 1 0.527 675819 1108682 492 325316 61752544 0 0 2 0 0 0
160 12 4374240 3896768 3869429 2951151 27339 26 2048 5 0.513 488207 1066378 92 43799 61521882 0 0 2 0 0 0
192 13 4416 680307 680284 781938 23 1 1024 3 0.022 126525 239919 426 %
Comment From: oranagra
ok, so we have about 50% fragmentation that's total of nearly 2gb (allocator_frag_ratio:1.55, allocator_frag_bytes:1929797664)
we currently don't see active defrag working (active_defrag_running:0) but i assume it does (the frag is above the thresholds (active-defrag-ignore-bytes and active-defrag-threshold-lower).
if we want to see what it can achieve we need INFO stats output, possibly sampled twice at some interval.
indeed we see high fragmentation in the allocator bins, for instance, the bin for 96 bytes has about 360mb in it, and is at 0.571 utilization (so wasting about 1.5gb).
i think the next thing to understand is your access pattern cycle. you said you do have eviction, but also that you delete data of previous run, so maybe you only have a short burst of deletions creating a large debt of fragmentation, that the defragger takes long to handle.
maybe all that needs to be done is increase the cycle-min and cycle-max to let it work harder (may come on the expense of latency).
it would be best to stop all write traffic, and see if then the defragger is able to eventually undo all the fragmentation.
it could be nice to plot a few other metrics next to the fragmentation graph on the timeline, like active_defrag_running, active_defrag_hits, and others...
reaching back to your original claim, which suggested that for strings everything is fine, and for hashes it isn't. how did you test that?
it is interesting to see if your hashes are ziplist encoded or dict encoded. try doing OBJECT ENCODING <key on a few of different sizes.