Redis [QUESTION] Fragmentation change differently between writing hash and writing string

Context

Two Redis clusters with same hardware and de-fragmentation configuration:

    activedefrag yes
    active-defrag-threshold-lower 30
    active-defrag-threshold-upper 50
    active-defrag-cycle-max 10

Timed write to these two clusters with pipeline. One writes string with pipe.Set and other one writes hash with pipe.HSet.

Problems

String Write Fragmentation Change
Hash Write Fragmentation Change

Questions

Why do these two writings behave oppositely?
Why the fragmentation ratio of the hash cluster has been maintained at around 1.5?
Under this circumstance, how should we optimize our de-frag configuration? (To ensure latency, the current active-defrag-cycle-max should not be increased)

Comment From: oranagra

Hi. I'm not entirely sure why at all you have fragmentation and what's the traffic / pattern looks like. i.e. 1. what's pipe.HSet? 2. do you have eviction or expiration? (fragmentation only occurs when you delete stuff) 3. even with eviction and expiration, if all writes are of objects of the same sizes, they'll plug the holes created by the deletions, so i don't expect any fragmentation. 4. maybe these graphs will even look the same with activedefrag disabled (maybe it's just the pattern of allocations, releases, and then new allocations that plug the wholes)? 5. maybe it's not fragmentation at all, but some other RSS overhead. i.e. mem_fragmentation_ratio is a bad metric, you should look at allocator_frag_ratio and allocator_frag_bytes 6. you're not showing any other metrics, only the ratio. maybe it's just a few megabytes that are influenced by something else (i.e. ratio of 1.5 is just 50% of 2mb). 7. i'd also like to see the defragger metrics (hits misses etc).

regardless of all the questions above, i think the main focus can be to just see why on one pattern you have 50% fragmentation, and then either understand where it comes from, or why the defragger isn't eliminating it. (i.e. focus only on your second test) for that, please post your full INFO all, MEMORY malloc-stats, and maybe even /proc/<pid>/smaps.

Comment From: ghost

Hi, Sorry for not the incomplete context. Let me explain it in detail: 1. pipe.Hset is from go-redis sdk, it means pipeline running HSET command. 2. We do have eviction. Every time we write data, the old data would be deleted. While the logic here is to write the new data(around 6000000 hashes) this time, then delete all the previous data when writing completes. 3. Each hash is an item profile, it contains fields like price, embedding, title, embedding is a 32*float32 array, all convert to protobuf byte array. Each hash might have different size, but generally similar.(as for the plug the wholes pattern, I am not sure if the above logic follows it). 4. We will do this experiment later and see the effect. 5. Our current allocator_frag_ratio is 1.55 and mem_fragmentation_ratio is 1.61. Thus I guess they almost the same for our cases. 6. Our memory size of each node should be around 4GB, please refer to the configurations at the bottom. 7. How should we get the hits misses? Could you please share the related commands?

Here are some configurations as you request:

# Memory
used_memory:3527086552
used_memory_human:3.28G
used_memory_rss:5668630528
used_memory_rss_human:5.28G
used_memory_peak:6368233288
used_memory_peak_human:5.93G
used_memory_peak_perc:55.39%
used_memory_overhead:916112936
used_memory_startup:1507808
used_memory_dataset:2610973616
used_memory_dataset_perc:74.06%
allocator_allocated:3527647200
allocator_active:5457444864
allocator_resident:5495586816
total_system_memory:133919506432
total_system_memory_human:124.72G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:32212254720
maxmemory_human:30.00G
maxmemory_policy:allkeys-lru
allocator_frag_ratio:1.55
allocator_frag_bytes:1929797664
allocator_rss_ratio:1.01
allocator_rss_bytes:38141952
rss_overhead_ratio:1.03
rss_overhead_bytes:173043712
mem_fragmentation_ratio:1.61
mem_fragmentation_bytes:2141564992
mem_not_counted_for_evict:0
mem_replication_backlog:536870912
mem_clients_slaves:0
mem_clients_normal:225608
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

___ Begin jemalloc statistics ___
Version: "5.1.0-0-g0"
Build-time option settings
  config.cache_oblivious: true
  config.debug: false
  config.fill: true
  config.lazy_lock: false
  config.malloc_conf: ""
  config.prof: false
  config.prof_libgcc: false
  config.prof_libunwind: false
  config.stats: true
  config.utrace: false
  config.xmalloc: false
Run-time option settings
  opt.abort: false
  opt.abort_conf: false
  opt.retain: true
  opt.dss: "secondary"
  opt.narenas: 64
  opt.percpu_arena: "disabled"
  opt.metadata_thp: "disabled"
  opt.background_thread: false (background_thread: true)
  opt.dirty_decay_ms: 10000 (arenas.dirty_decay_ms: 10000)
  opt.muzzy_decay_ms: 10000 (arenas.muzzy_decay_ms: 10000)
  opt.junk: "false"
  opt.zero: false
  opt.tcache: true
  opt.lg_tcache_max: 15
  opt.thp: "default"
  opt.stats_print: false
  opt.stats_print_opts: ""
Arenas: 64
Quantum size: 8
Page size: 65536
Maximum thread-cached size class: 229376
Number of bin size classes: 55
Number of thread-cache bin size classes: 55
Number of large size classes: 180
Allocated: 3527746568, active: 5454430208, metadata: 40341136 (n_thp 0), resident: 5495062528, mapped: 5500895232, retained: 1322188800
Background threads: 2, num_runs: 233348, run_interval: 12308867420 ns
                           n_lock_ops       n_waiting      n_spin_acq  n_owner_switch   total_wait_ns     max_wait_ns  max_n_thds
background_thread            62439312               0               0               3               0               0           0
ctl                         124878347               0               0               1               0               0           0
prof                                0               0               0               0               0               0           0
Merged arenas stats:
assigned threads: 2
uptime: 2872286262112012
dss allocation precedence: "N/A"
decaying:  time       npages       sweeps     madvises       purged
   dirty:   N/A            0       170459       463588      2173115
   muzzy:   N/A            5       159218       449893      2116183
                            allocated     nmalloc     ndalloc   nrequests
small:                     2855674888  6495012815  6435156494 19537633801
large:                      672071680           8           3           8
total:                     3527746568  6495012823  6435156497 19537633809
active:                    5454430208
mapped:                    5500895232
retained:                  1322188800
base:                        40225928
internal:                      115208
metadata_thp:                       0
tcache_bytes:                  690760
resident:                  5495062528
                           n_lock_ops       n_waiting      n_spin_acq  n_owner_switch   total_wait_ns     max_wait_ns  max_n_thds
large                        59967139               0               0               2               0               0           0
extent_avail                 60612946               2               0          165506               0               0           1
extents_dirty                62421104              16             217          341283               0               0           1
extents_muzzy                61446762               0               2          355302               0               0           0
extents_retained             61395083               0              11          297560               0               0           0
decay_dirty                  67901948               0              25          468229               0               0           0
decay_muzzy                  67743111               0              26          446503               0               0           0
base                        119966150               0               0               5               0               0           0
tcache_list                  59967141               0               0               3               0               0           0
bins:           size ind    allocated      nmalloc      ndalloc    nrequests      curregs     curslabs regs pgs   util       nfills     nflushes       nslabs     nreslabs      n_lock_ops       n_waiting      n_spin_acq  n_owner_switch   total_wait_ns     max_wait_ns  max_n_thds
                   8   0     92076560    892933164    881423594   1207540533     11509570         2226 8192   1  0.631     27208199      8931475        51785     16861739     12846168374               0               0       107767412               0               0           0
                  16   1    135684400    995310447    986830172   2573481909      8480275         3583 4096   1  0.577     10817071      9554144        37077     44312941     13245235444               0               0       108562154               0               0           0
                  24   2    453582480   2300646131   2281746861   3478989016     18899270         3825 8192   3  0.603     24675749     21685008        47030     62169749     28775983935               0               0       311984105               0               0           0
                  32   3    166355008    403902757    398704163   1817220876      5198594         4279 2048   1  0.593     11095154      4340597        38719      7981833      7475578799               0               0        75091788               0               0           0
                  40   4     64494520    132234017    130621654    555447521      1612363          300 8192   5  0.656      4205789      2468360         8282      1162127      1541505245               0               0         7758890               0               0           0
                  48   5    101394576    164834827    162722440   6620614719      2112387          775 4096   3  0.665      4862320      2433319        30886       120442      2174554201               0               0         3943632               0               0           0
                  56   6    237804112    761759652    757513150   1576661420      4246502         1026 8192   7  0.505     12027278      8045582         1748      2563271      3312947646               0               0         7666130               0               0           0
                  64   7      2016576    244468964    244437455    628473312        31509           74 1024   1  0.415      7109834      3460271          943     19044655        76857556               0               0         2180118               0               0           0
                  80   8      3047440     12249449     12211356    154989954        38093           24 4096   5  0.387      1047202      1976508           93       349012        62992531               0               0            3038               0               0           0
                  96   9    367658112    289856827    286027055    308439826      3829772         3272 2048   3  0.571      7203311      3228387        57853      4556250      6053214063               0               0        50857948               0               0           0
                 112  10      4701424      6512247      6470270     10235579        41977           20 4096   7  0.512       889759      1482897           58       360012        62339891               0               0               2               0               0           0
                 128  11      2764544      3422565      3400967      5827700        21598           80  512   1  0.527       675819      1108682          492       325316        61752544               0               0               2               0               0           0
                 160  12      4374240      3896768      3869429      2951151        27339           26 2048   5  0.513       488207      1066378           92        43799        61521882               0               0               2               0               0           0
                 192  13         4416       680307       680284       781938           23            1 1024   3  0.022       126525       239919          426     %

Comment From: oranagra

ok, so we have about 50% fragmentation that's total of nearly 2gb (allocator_frag_ratio:1.55, allocator_frag_bytes:1929797664) we currently don't see active defrag working (active_defrag_running:0) but i assume it does (the frag is above the thresholds (active-defrag-ignore-bytes and active-defrag-threshold-lower).

if we want to see what it can achieve we need INFO stats output, possibly sampled twice at some interval.

indeed we see high fragmentation in the allocator bins, for instance, the bin for 96 bytes has about 360mb in it, and is at 0.571 utilization (so wasting about 1.5gb).

i think the next thing to understand is your access pattern cycle. you said you do have eviction, but also that you delete data of previous run, so maybe you only have a short burst of deletions creating a large debt of fragmentation, that the defragger takes long to handle. maybe all that needs to be done is increase the cycle-min and cycle-max to let it work harder (may come on the expense of latency).

it would be best to stop all write traffic, and see if then the defragger is able to eventually undo all the fragmentation.

it could be nice to plot a few other metrics next to the fragmentation graph on the timeline, like active_defrag_running, active_defrag_hits, and others...

reaching back to your original claim, which suggested that for strings everything is fine, and for hashes it isn't. how did you test that?

it is interesting to see if your hashes are ziplist encoded or dict encoded. try doing OBJECT ENCODING <key on a few of different sizes.