The problem

When inserting or deleting elements in the middle of ziplist, ziplist may have a cascading effect. ziplist is already used in streams, t_zset and quicklist should also use listpack instead of ziplist.

Description of the feature

Changes: 1) quicklist is compatible with the old ziplist storage format. 1) Use listpack instead of ziplsit in t_zset.c. 2) Use listpack instead of ziplsit in quicklist. 2) Use listpack instead of ziplsit in hash. 3) Personality of rdb format. - If deep sanitization is disable, does not convert when reading rdb, because the conversion from ziplist to listpack will cause slows downs rdb loading, We should convert when quicklistNode is being accessed. - if deep sanitization is enabled, direct ziplist to listpack conversion.

Comment From: zuiderkwast

If we're going to do take on this effort, we could as well take the step to be even more efficient than listpack.

I've figured we can save even more space by utilizing the suffix not only for storing a reverse-encoded length, but to tag the value itself and put payload bits in it. I have draft specs (and an incomplete proof-of-concept) here:

  • Encoding: https://github.com/zuiderkwast/bidipack/blob/main/bidipack.md
  • Size comparison with ziplist and listpack: https://github.com/zuiderkwast/bidipack/blob/main/comparison.md#element-encoding-sizes

Comment From: sundb

@zuiderkwast The reason for using listpack instead of ziplist is that ziplist may cause cascading updates when insert and delete in middle, which is the biggest problem. capacity is incremented by 1, 2 or 4, this can lead to a lot of memory waste, ziplsit and listpack are designed to save memory, so is it a loss or gain?

Comment From: zuiderkwast

Thanks for reading it. :-) I know cascading updates is the main potential problem and that there was a bug which was fixed and that's why listpack was designed. Now, it doesn't seem to be a problem in practice anymore though.

Listpack also stores numbers up to 127 in a slightly more compact way. The compactness can be explored further. That's my point. My "bidipack" is not a complete spec yet.

I'll clarify the capacity idea. It's not incremented by 1, 2 or 4 times. (It was badly explained by me.) It is the input to the capacity formula, which if you increment it by 1 gives the underlying allocation sizes of the allocations used by jemalloc (8, 16, 32, 48, ..., 8 KiB, 10 KiB, 12 KiB, .., 160 KiB, 192 KiB, 224 KiB, 256 KiB, ...) as per http://jemalloc.net/jemalloc.3.html#size_classes i.e. four sizes before the allocation is doubled. There is no point to allocate something in between these sizes, since it would be unused anyway. It's the sizes returned by malloc_size() for any allocation.

The problems this allocation scheme is designed to solve is the extra memory moves caused by realloc followed by memmove for every insertion. It's a problem which has been mentioned e.g. by @oranagra in this comment This is not the main part idea of the bidipack though and it can be completely skipped.

Comment From: oranagra

@zuiderkwast i guess you meant this comment

i didn't have time to read the entire thing yet, but from the portion i read i think that: 1. the idea of doing prereallocation is nice speedup, but it obviously comes on the expense of memory. for quicklist and stream it may be ok, since we'll trim that before creating a new node and the excess may be insignificant, but in hash, and zset it's probably not a good idea. 2. the idea of encoding growth strategy inside the listpack is probably not useful, even if we add some global config for that strategy (unlikely), we're very unlikely to add a per-key config. 3. encoding the capacity may not be needed, since we can call malloc_size, or even just do a realloc and count on it being very fast when it's a NOP. that's what i means in that other comment, when IIRC at that time i misunderstood that PR (that we should benchmark and see if there's reason to avoid these NOP calls). 4. one of the reasons behind listpack, was not only to avoid the cascading updates of ziplist, but also to simplify the code and make it less bug prone (we suspect there's still a bug in ziplist causing corruption, but we can't found it).

Comment From: zuiderkwast

Thanks for finding the relevant comment. I suppose many of those things in my draft are not needed, e.g. alloc strategy (unless to optimize for append only, i.e. streams), but the way I wrote that description was as a stand-alone idea and malloc_size isn't widely used. Avoiding calls to malloc_size could be a small speedup and avoiding realloc when it does move data is good to avoid moving data twice, but all that can be skipped for simplicity.

Comment From: huangzhw

@zuiderkwast Very interesting idea. Maybe a little complicated.

Comment From: zuiderkwast

Thanks @huangzhw, I think we can pick only a small part of it, to make it simple. Or just use listpack since it's already battle-tested in streams.

It could be useful to store a version tag somewhere to be able to update stuff in the future and to support multiple encodings.

Comment From: oranagra

you can't obviously change the encoding of listpacks in streams without breaking compatibility with old rdb files.

what we can do (either now, or in the future), is use different rdb opcodes when we store the new format like create RDB_TYPE_LIST_QUICKLIST_LISTPACK and RDB_TYPE_STREAM_BIDIPACK (same as the RDB_TYPE_ZSET_2 we have now).

then, we can either convert all payloads at rdb load time (if it's just about adding a version header), or support both side by side (i.e. support both listpack and bidipack, or a listpack with version header field, like we're gonna do for supporting ziplist side by side with listpack).

some formats, like hash, can simply have another (3rd or 4th) encoding, others like quicklist, have an "encoding" (container) field per node (see QUICKLIST_NODE_CONTAINER_ZIPLIST), so we can support both listpack and bidipack.