Are there any practical limitations to maximum number of entries in HLL depending on input data ? I was using HLL data structure to count 32-bytes hex strings. Right now PFCOUNT hll always gives me 244503751 response and PFADD hll <ANY VALUE> always gives me 0 regardless of what I'm trying to add. If it helps I can post an output of GET hll.

Is there some practical limitation that is not described anywhere or am I missing something fundamental? I would assume that for example generated UUID should (almost) always increase the counter until the number reaches somewhere closer to 2**64

Comment From: sundb

Are there any practical limitations to maximum number of entries in HLL depending on input data ? I was using HLL data structure to count 32-bytes hex strings. Right now PFCOUNT hll always gives me 244503751 response and PFADD hll <ANY VALUE> always gives me 0 regardless of what I'm trying to add. If it helps I can post an output of GET hll.

in theory, hyperloglog can support the maximum of 2^64 and 0.081% standard error. this means, in the worst case, you may need to add approximately 244503751 * 0.0081 items to see a change.

Is there some practical limitation that is not described anywhere or am I missing something fundamental? I would assume that for example generated UUID should (almost) always increase the counter until the number reaches somewhere closer to 2**64

using UUID doesn't mean that there is no error, the uuid is still hashed internally.

Comment From: viktorvsk

this means, in the worst case, you may need to add approximately 244503751 * 0.0081 items to see a change.

Does it mean that the error grows with the number of entries/changes? I thought that if PFCOUNT returns X it means that ACTUAL_UNIQUE_ENTRIES is somewhere between X - (X*0.0081) < ACTUAL < X + X(X*0.0081) in other words if PFCOUNT gives 100, the real number could be lets say between 90 and 110 but no way could it be 200.

Do I get it wrong and in my case with 244503751 it means there could actually be billions of unique entries? Given that each PFADD chance to not change the result is not only 0.81% but also depends on this value 244503751 ?

Comment From: sundb

this means, in the worst case, you may need to add approximately 244503751 * 0.0081 items to see a change.

Does it mean that the error grows with the number of entries/changes? I thought that if PFCOUNT returns X it means that ACTUAL_UNIQUE_ENTRIES is somewhere between X - (X*0.0081) < ACTUAL < X + X(X*0.0081) in other words if PFCOUNT gives 100, the real number could be lets say between 90 and 110 but no way could it be 200.

AFAIK, 0.81% is the upper limit, not the lower limit, which means ACTUAL > X + (X*0.0081) 0.81% is only derived from probability, but it does not mean that 200 will not happen, and the probability of deviation from 0.81% will become smaller.

Do I get it wrong and in my case with 244503751 it means there could actually be billions of unique entries? Given that each PFADD chance to not change the result is not only 0.81% but also depends on this value 244503751 ?

yes.

Comment From: oertl

Does it mean that the error grows with the number of entries/changes? I thought that if PFCOUNT returns X it means that ACTUAL_UNIQUE_ENTRIES is somewhere between X - (X0.0081) < ACTUAL < X + X(X0.0081) in other words if PFCOUNT gives 100, the real number could be lets say between 90 and 110 but no way could it be 200.

0.81% is the relative standard error, which means that the error is within those boundaries only with a probability of 68.2%.

Comment From: sundb

0.81% is the relative standard error,

nice, thanks for figure out.

which means that the error is within those boundaries only with a probability of 68.2%.

yes.

Comment From: viktorvsk

@sundb @oertl thanks for the explanation! Just to confirm, would it be fair to say something like:

“if I have a billion of entries and I want to know how many of them are unique and if my hard requirement is to never be wrong for more than 10% (i.e. X * 0.9 < ACTUAL < X * 1.1) than HLL is not the right tool for this job” ?

Comment From: sundb

@viktorvsk hyperloglog is statistical through probability, so anything can happen, it will not meet your hard requirement.

Comment From: viktorvsk

Got it, thanks a lot @sundb