The known problems with the hash table:

  • huge allocations makes malloc/free freeze the system (#8611)
  • even more memory is used while incremental rehashing is ongoing
  • (anything else?)

It has been discussed to replace the hash table with a rax, but I think a HAMT is a better idea.

Hash Array Mapped Trie (HAMT) is a kind of nested hash table or hash tree.

It's a hash table of (a fixed number, typically) 32 slots, where each slot is either empty, an entry or (on collition) a subnode with the same structure. For each level, 5 bits of the hashed key is used as an index into the 32 slots. To save memory, each 32-slot node is stored in a compact way using a bitmap to flag the existence of each of the 32 potential slots.

  • Introduction: https://en.wikipedia.org/wiki/Hash_array_mapped_trie
  • Detailed paper: https://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf

Benefits

  • No rehashing needed.
  • No huge allocations.
  • It's automatically balanced (if our hash function is good, which it is).
  • Speed comparable to classic dict.
  • Memory usage less than classic dict (roughly 2 words less per entry).
  • SCAN and random key are simple to implement.
  • All operations are safe during iteration.
  • Fairly simple design.

Why not a Radix Tree (RAX)?

  • The rax needs much memory for large keys without a shared prefix.
  • There are no known SCAN and fair random key selection algorithms.

The HAMT is similar to a RAX, but instead of the key itself, the hashed key is used as the path in the tree. Since a hashed key is a number, it can also be used as a cursor. A random number used as a cursor picks a random key.

Comment From: madolson

A note about the RAX is that it's ordered, which allows some cool stuff like deleting all keys with a given prefix.

Comment From: hpatro

Should we be making this modular to allow the customer choose the mechanism ?

  • Hash Table [Default]
  • HAMT [Provided with 7.0]
  • Radix Tree [If somebody implements it]

This would avoid any possible bug causing issues having nothing to fallback to. Also if the customers have usecase of lot's of keys with same prefix, they could take advantage of Radix tree.

Comment From: Super-long

Should we be making this modular to allow the customer choose the mechanism ?

  • Hash Table [Default]
  • HAMT [Provided with 7.0]
  • Radix Tree [If somebody implements it]

This would avoid any possible bug causing issues having nothing to fallback to. Also if the customers have usecase of lot's of keys with same prefix, they could take advantage of Radix tree.

I think this idea looks great, so we seem to have the ability to subsequently introduce more customer-specific data sets with special optimizations. It seems that changing the first member table in the struct dict to an enum is a feasible method.