i recently started using the hyperloglog functionality of Redis and have run into a rather odd issue when adding entries via bulk addition. This all occurs when using PFADD. Here are the three methods I've used (unsuccessfully):

  1. Mass Insertion using the format cat data | mawk -F '|' '{ printf "*3\r\n" "$5\r\nPFADD\r\n" "$"length("test")"\r\ntest\r\n" "$"length($1)"\r\n"$1"\r\n" }' | redis-cli --pipe

  2. Doing a large query echo "PFADD test 1 2 3 4 ... N" | redis-cli

  3. Using lua to copy from a data key, where data was created using mass insertion via SADD local matches = {} redis.replicate_commands() redis.set_repl(redis.REPL_NONE) matches = redis.call("SSCAN", data, 0, "COUNT", 1000000000) for key,value in pairs(matches[2]) do redis.call("PFADD", test, value) end

In all 3 cases, if I do around 25k entries, I'm able to do a PFADD to the test key after the fact and successfully add another entry. If I push it closer to 50k, all PFADDs done after the fact start to fail, no matter the value. I' have been able to reproduce this issue using 3.2.6 and 4.0 on two separate VMs.

Any help is greatly appreciated. I can post full coding examples as needed with a sample file of 1mil entries. The data I've been using is a text file of 5 million md5 hashes.

Comment From: georgepsarakis

The first method, using Redis protocol, seems to be the recommended as suggested here.

Could you clarify this process you are describing as separate steps:

In all 3 cases, if I do around 25k entries, I'm able to do a PFADD to the test key after the fact and successfully add another entry. If I push it closer to 50k, all PFADDs done after the fact start to fail, no matter the value.

Also, I think you do not mention what type of failure/error you encounter. Is it a timeout?

Comment From: fsaintjacques

Would it be possible to do a PFADD with raw hll encoded values instead, e.g. I compute the hyperloglog counters in my ETL and simple do a add (or union) with the counters constant in size as opposed to linear with the number of keys I want to add?

Comment From: masoud-msk

@fsaintjacques Yes! as the documentation states:

The HyperLogLog, being a Redis string, can be retrieved with GET and restored with SET. Calling PFADD, PFCOUNT or PFMERGE commands with a corrupted HyperLogLog is never a problem, it may return random values but does not affect the stability of the server. Most of the times when corrupting a sparse representation, the server recognizes the corruption and returns an error.