Why Your Database Hates COUNT(DISTINCT) and Why HyperLogLog is the Cure

#algorithms #probabilisticdatastructures #backendengineering

TL;DR: HyperLogLog (HLL) is a probabilistic data structure that estimates unique counts by analyzing the bit patterns of hashed IDs. Instead of storing every user ID, it tracks the maximum number of leading zeros in hashed values, allowing you to estimate billions of unique views using about 12KB of memory with ~2% error.

Scaling unique view counts is a silent database killer. If you try to track every user_id for every post on a platform with millions of users, your infrastructure costs will eventually eclipse the value of the feature itself. You're effectively burning RAM to show a number on a UI that doesn't even need to be 100% precise.

I’ve seen plenty of teams try the naive route: a dedicated table of user IDs and a big COUNT(DISTINCT) query. At a certain scale, that stops being a query and starts being a resource exhaustion event. If you want to count millions of unique views across millions of posts without your database screaming for mercy, you have to stop storing data and start using math.

Why is the naive approach to unique counts so expensive?

Count distinct operations scale linearly with cardinality. Storing a billion 64-bit IDs consumes 8GB of RAM—unsustainable when you’re tracking millions of individual posts simultaneously.

The math is brutal. If you have 1,000,000 posts and each has 1,000 unique views, you aren't just storing a million numbers; you're storing a billion relations. Even if you hash those IDs to save space, the storage footprint remains massive. You are paying for 100% accuracy in a context where 98% accuracy is indistinguishable to the end user. Most platforms don't need to know a post has exactly 1,004,202 views; they just need to know it's around 1M.

How does the leading zeros math actually work?

HyperLogLog hashes incoming data into a string of bits where zeros and ones are equally likely. By tracking the longest sequence of leading zeros observed across all hashes, the algorithm uses the probability of "coin flipping" to estimate total cardinality.

Think of it as a statistical shortcut. If I tell you I flipped a coin and got a sequence starting with one zero (Heads), you wouldn’t be surprised; that happens 1 in 2 times. If I tell you I found a sequence starting with ten zeros in a row, you’d correctly guess that I’ve probably flipped that coin about 1,024 times.

Here is how HLL applies that to your data:

Run a user_id through a hash function to get a binary string.
Count the continuous zeros at the start of that string.
Keep track of the maximum number of zeros you’ve seen for that specific post.

If the maximum number of continuous zeros you've seen is 10, the math (2^10) tells you that you’ve likely seen around 1,000 unique users. You aren't storing the IDs; you're just storing one small integer: the "max zeros" count.

When should I choose an estimate over a precise count?

Choose probabilistic structures when the cost of storage outweighs the value of perfect precision. For high-volume metrics like video views, social media likes, or unique site visitors, a 2% error rate is a fair trade for a 99% reduction in memory usage.

In a real-world system like Redis, HyperLogLog structures are capped at about 12KB. This is a constant memory footprint. It doesn't matter if you are counting a hundred users or a hundred billion; that 12KB doesn't grow.

Scaling Strategy: Exact vs. Estimated

Metric Type	Accuracy Required	Recommended Tool
Financial Transactions	100% (Strict)	SQL COUNT(DISTINCT)
Unique Website Visitors	~98% (High)	HyperLogLog (HLL)
Social Media View Counts	~98% (High)	HyperLogLog (HLL)
Real-time Trending Tags	~95% (Medium)	HLL or Top-K Algorithms

FAQ

Can HyperLogLog handle merging counts from different time periods?
Yes. One of the best features of HLL is that it is "mergeable." If you have one HLL for Monday’s visitors and another for Tuesday’s, you can combine them to get a unique count for the whole 48-hour period without re-processing the raw data.

What happens if I have a very small number of users?
HyperLogLog is actually quite smart about this. Most implementations use "Linear Counting" for small sets to keep the error rate near zero, only switching to the probabilistic "leading zeros" math once the volume hits a certain threshold.

Does the choice of hash function matter?
Absolutely. The hash function must be uniform, meaning every bit in the output has an equal 50/50 chance of being a 0 or a 1. If your hash function is biased, your estimates will be consistently wrong.

Cheers,
Doogal