Redis Improved memory efficiency for small sets

Right now sets only have two encodings, intset and hashmap. The the intset is a fairly memory dense structure, but only supports integers. The hashmap alternative however has 32 bytes of overhead per entry, as well as more overhead for the dictionary itself. This means that for small sets, you are paying a lot of memory for overhead. Compared to other collections, like hash or sorted sets, the set has no compact memory representation for when the set is small.

The proposed solution here is to support a listpack encoding for sets. When a set is below some number of items or value size, the entries will just be stored as a lispack of sequential items. Once the threshold is breached, the listpack will be expanded into either an intset or a hashmap depending on the content.

Comment From: zuiderkwast

When a set is below some number of items or value size, the entries will just be stored as a lispack of sequential items. Once the threshold is breached, the listpack will be expanded into either an intset or a hashmap depending on the content.

We'd have to scan through the listpack to check if all elements are integers. For integers, I think it seems more strait-forward to start off with an intset and then convert it to either listpack or hashtable when it grows or non-integers are added. For non-integers, obviously we'll start off with a listpack and convert to hashtable when it grows large.

Currently, we never convert a hashtable back to intset when elements are removed, so I think we don't need to convert back to listpacks. We don't do it for e.g. hashes either.

Comment From: madolson

We'd have to scan through the listpack to check if all elements are integers. For integers, I think it seems more strait-forward to start off with an intset and then convert it to either listpack or hashtable when it grows or non-integers are added. For non-integers, obviously we'll start off with a listpack and convert to hashtable when it grows large.

The scan should still be fast, but I think that either option makes sense.