How Big Tech Scales View Counts: The Power of HyperLogLog and Harmonic Means

#datastructures #scalability #computerscience

TL;DR: Scaling unique view counts for millions of posts requires more than just a COUNT(DISTINCT) query. Modern platforms use HyperLogLog, a probabilistic data structure that estimates cardinality using hashing and bucketing. By applying a harmonic mean across thousands of independent buckets, engineers can maintain high accuracy with a tiny memory footprint.

When we talk about scale, we often focus on throughput or latency, but memory consumption for unique metrics is a silent killer. If you are building a platform with millions of users and millions of posts, your first instinct might be to store a set of User IDs for every post to ensure you don't count the same person twice. If you have 10 million unique visitors on a post, and each user ID is a 64-bit integer, you are burning roughly 80MB of memory just for one post's view count. Multiply that by a million posts, and you are looking at an 80TB memory requirement just for view counts. It is simply not feasible.

To solve this, we use HyperLogLog (HLL). It allows us to estimate the number of unique items—what we call cardinality—without storing the actual items. Instead of keeping every User ID, we hash the IDs and look for patterns in the bits. Specifically, we look at the number of leading zeros in the hashed value.

How does HyperLogLog estimate millions of unique views?

HyperLogLog estimates cardinality by observing the maximum number of leading zeros in the hashed binary representations of incoming data. It operates on the mathematical probability that in a random distribution of bits, a sequence of n leading zeros will occur once every 2^n elements.

Think of it like flipping a coin. If I tell you that I flipped a coin until I got heads, and it took me ten tries (meaning I got nine tails in a row), you can reasonably guess that I’ve been flipping coins for a while. If I see a hashed User ID that starts with 10 zeros, I can estimate that I have probably processed around 2^10 unique users. I don't need to know who the users are; I just need to record the highest number of leading zeros I've seen so far. This allows HLL to represent billions of unique items using only a few kilobytes of state.

Why is a single HyperLogLog estimate often inaccurate?

A single HLL estimate relies on the maximum number of leading zeros in a hash, meaning it can only scale in powers of two. Because the estimate is derived from a single maximum value, one "lucky" hash with an unusually long string of zeros can cause the entire estimate to jump from 1,024 to 2,048 instantly.

I like to think of this as a ladder where the rungs are spaced further apart the higher you go. If your only markers are at 1,024, 2,048, and 4,096, you have no way to represent 1,500. If you are tracking a post with 1,025 views, but one user’s ID happens to hash into a value that starts with 11 zeros by pure chance, your system will report 2,048 views. You are off by nearly 100% because of a single outlier. This variance is the primary weakness of a raw HLL implementation.

How does bucketing fix the estimation variance?

Bucketing divides the incoming data into thousands of independent streams, calculating a separate estimate for each to ensure that one outlier cannot corrupt the entire result. By using the first few bits of a hash to assign a User ID to a specific bucket, we distribute the "luck" of the hashes across a wider range of data points.

When a User ID comes in, we might use the first 10 bits of its hash to choose one of 1,024 buckets. We then perform the HLL leading-zero count on the remaining bits for just that bucket. Instead of one giant, shaky estimate for the entire post, we now have 1,024 small estimates. If one bucket gets a "lucky" hash and over-estimates its portion of the traffic, it only represents 1/1,024th of the total data. The other 1,023 buckets will likely be much more accurate, diluting the impact of the outlier.

Why use a harmonic mean instead of a regular average?

The harmonic mean is used to aggregate bucket estimates because it is significantly more resilient to large outliers than a standard arithmetic mean. It effectively "ignores" the buckets that over-report due to chance, keeping the final count grounded in the majority of the data.

In a standard arithmetic mean, if you have ten buckets where nine report 100 and one reports 1,000, your average is 190. That single outlier has pulled your average nearly 100% higher than the reality. The harmonic mean—calculated as the reciprocal of the average of the reciprocals—weights smaller values more heavily. For that same set of numbers, the harmonic mean would be much closer to 100. Since HLL outliers are almost always over-estimations (the "lucky" hashes), the harmonic mean is the perfect tool to pull the estimate back toward the true count.

Feature	Traditional Set (COUNT DISTINCT)	HyperLogLog with Bucketing
Memory Scaling	O(n) - Grows with every new user	O(log log n) - Nearly constant
Accuracy	100% (Exact)	~98.1% (Probabilistic)
Storage Size	Megabytes to Gigabytes	~12KB per counter
Best Use Case	Financial transactions / Billing	View counts / Unique visitors

FAQ

Can I use HyperLogLog to see if a specific user has viewed a post?
No. HyperLogLog is a "write-only" structure for the original data. It forgets the actual IDs immediately after processing them to save space. If you need to check for the existence of a specific ID, you would need a Bloom Filter.

What is the standard error rate I can expect?
For a typical implementation using 16,384 buckets (which takes about 12KB of space), the standard error is roughly 0.81%. For view counts on a social media post, this level of error is virtually indistinguishable to the end user.

Is the harmonic mean expensive to calculate?
Compared to the massive CPU and I/O cost of querying millions of rows in a traditional database, the harmonic mean is negligible. It involves a single pass over your buckets (usually a few thousand integers) and a bit of division, which modern CPUs handle in microseconds.