Redis [NEW] Optimize HyperLogLog by SIMD acceleration

We found that the dense-encoding part of HyperLogLog can be significantly accelerated by SIMD instructions.

Our benchmark tests the performance of merging 3 dense hll structures.

pfcount key1 key2 key3
pfmerge keyall key1 key2 key3

The result shows a 12x speedup compared to the scalar version.

======================================================================================================
Type             Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------
PFCOUNT-scalar    5280.76        37.93893        34.81500        69.63100        73.72700       283.63
PFCOUNT-avx2     69802.99         2.85844         2.75100         5.53500         6.97500      3749.18
------------------------------------------------------------------------------------------------------
PFMERGE-scalar    9445.56        21.17554        20.09500        38.91100        41.21500       590.35
PFMERGE-avx2    120642.02         1.65367         1.59100         3.21500         5.11900      7540.13
------------------------------------------------------------------------------------------------------

CPU:    13th Gen Intel® Core™ i9-13900H × 20
Memory: 32.0 GiB
OS:     Ubuntu 22.04.5 LTS

Fork: https://github.com/Nugine/redis/tree/hll-simd
Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh

I'm not sure whether this change is acceptable to maintainers. I will create a PR if it is ok.

Comment From: sundb

@Nugine nice!! i saw the changed you made in https://github.com/Nugine/redis/commit/92dc40caba4e04840918804ebc1768cc0e4833f8. seems good to me, although I'm not familiar with SIMD, thanks.