| Apr | MAY | Jun |
| 25 | ||
| 2021 | 2022 | 2023 |
COLLECTED BY
Collection: Common Crawl
At the end of the event Tommy comes to me with his figure (17) and I tell him
that I had a word with 46 people: I clearly am the winner, but Tommy’s frustrated
as he thinks I’ve counted the same people multiple times, as he only saw me with
15/20 people in total. So, the wager’s off and we decide that,
for our next event, we’ll be taking down names instead, so that we’re sure we’re
going to be counting unique people, and not just the total number of conversations.
At the end of the following conference we enthusiastically meet each other with
a very long list of names and, guess what, Tommy had a couple more encounters
than I did! We both laugh it off and while discussing our approach to counting
uniques, Tommy comes up with a great idea:
Alex, you know what? We can’t go around with pen and paper and track down a list of names, it’s really impractical!
Today I spoke to 65 different people and counting their names on this paper was a real pain in the back…I lost count 3 times and had to start from scratch!
Yeah, I know, but do we even have an alternative?
What if, for our next conference, instead of asking for names, we ask people the last 5 digits of their phone number?
Now, follow me: instead of winning by counting their names, the winner will be the one who spoke to someone with the longest sequence of leading zeroes in those digits.
Wait Tommy, you’re going too fast! Slow down a second and give me an example…
Sure, just ask people for those last 5 digits, ok? Let’s suppose you get 54701.
No leading zero, so the longest sequence of zeroes for you is 0.
The next person you talk to tells you it’s 02561 – that’s a leading zero! So your longest sequence comes to 1.
You’re starting to make sense to me…
Yeah, so if we speak to a couple people, chances are that are longest zero-sequence will be 0. But if we talk to ~10 people, we have more chances of it being 1.
Now, imagine you tell me your longest zero-sequence is 5 – you must have spoken to thousands of people to find someone with 00000 in their phone number!
Dude, you’re a damn genius!
And that, my friends, is how HyperLogLog fundamentally works: it allows us to
estimate uniques within a large dataset by recording the longest sequence of
zeroes within that set. This ends up creating an incredible advantage over keeping
track of each and every element in the set, making it an incredibly efficient way
to count unique values with relatively high accuracy:
The HyperLogLog algorithm can estimate cardinalities well beyond 10^9 with a relative accuracy (standard error) of 2% while only using 1.5kb of memory.
8% Right: Cardinality Estimation for Big Data
Since this is the usual me oversimplifying things
that I find hard to understand, let’s have a look at some more details of HLL.
1 2 3 | |
1 2 3 4 5 6 7 8 9 | |
00004 — jackpot! You might have won a wager
against Tommy, but if you use this method in real life chances are that specific
data in your set will negatively influence the estimation.
Fear no more, as this is a problem HLL was born to solve: not many are aware that Philippe
Flajolet, one of the brains behind HLL, was quite involved in cardinality-estimation
problems for a long time, long enough to have come up with the Flajolet-Martin
algorithm in 1984 and
(super-)LogLog in 2003,
which already addressed some of the problems with outlying hashed values by dividing
measurements into buckets, and (somewhat) averaging values across buckets.
If you got lost here, let me go back to our original example: instead of just
taking the last 5 digits of a phone number, we take 6 of them and store the longest
sequence of leading zeroes together with the first digit (the bucket). This means
that our data will look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
APPROX_COUNT_DISTINCT())
●Reddit, where it’s used to calculate how many unique views a post has gathered
In particular, see how HLL impacts queries on BigQuery:
1 2 3 4 5 6 7 | |