The problem/use-case that the feature addresses

I found the hash function MurmurHash64A is the most hotspot when running benchmark for pfadd command. The benchmark I used is https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-pfadd-4KB-values-pipeline-10.yml

According to existing implementation of MurmurHash64A, each loop will process 8 bytes data and it has data dependency k *= m; k ^= k >> r; k *= m; h ^= k; h *= m; which make it execute in sequential order, and can't benefit from Instruction Level Parallelism of modern CPU.

Description of the feature

Provide a more efficiency hash function for hyperloglog, Cityhash is an option, it is widely used in industry. According to the CityHash64WithSeed implementation , in the core loop, when data length is larger than 64 bytes, each loop will process 64 bytes data, and it can be more parallel compared with MurmurHash64A.

In my local test environment (Intel Xeon Platinum 8380), the score can increase 19% when data size is 4096 bytes. I also test the small data size use above test suite by adjust the test data size. The larger data size, the better performance.

Redis [NEW] Performance improvement of hash function for hyperloglog

Hash Quality https://rurban.github.io/smhasher/ shows that the cityhash quality is better than Murmurhash.

Comment From: lipzhu

Above data is consider from performance perspective, if we consider the hash function from data quality perspective, which is the key for hyperloglog accuracy, we can find the wyhash and farmhash are better (they all passed all the SMHasher tests), On the contrary, MurmurHash64A failed 5 tests in the SMHasher test, especially the collisions test which may have impact the hyperloglog algorithm accuracy.

Conclusion: From performance and data quality perspectives, wyhash is preferred, and it is also the default hash function in go lang.

Redis [NEW] Performance improvement of hash function for hyperloglog

Bellow table list the detailed SMHasher comparison report for 4 hash functions: | Hash function | MiB/sec |cycl./hash|cycl./map | size| Quality problems | |:----------------------------------------------|-------------:|---------:|-----------:|----:|--------------------------------| |Murmur2B | 5919.38 | 38.18 |215.96 (3) |433 |UB, 1.8% bias, collisions, 3.4% distrib, BIC| |City64 | 13887.84 | 46.32 |239.77 (3) |1120 |Sparse, TwoBytes | |FarmHash64 | 12845.53 | 47.11 |251.58 (3) |3758 | | |wyhash | 22540.23 | 28.87 |236.16 (8) |474 | |

Comment From: lipzhu

@oranagra @madolson Is it possible to add an option for user to select a hash now if server is from scratch? What are your opinions?

Comment From: LiorKogan

I would consider Yann Collet's xxHash (XXH3) as an alternative. It is equally as fast. It also seems that the collision properties were tested much more rigorously and were proven to be much better than wyhash. See https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison with a comparison to wyhash.

(Yann Collet is also the author of https://github.com/lz4/lz4)