Yuvraj Raghuvanshi

Posted on May 19 • Originally published at yuvrajraghuvanshis.Medium on May 15

Counting a Billion Things With 1.5 Kilobytes

#algorithms #backend #datastructures

Here’s a problem that sounds trivial until you think about it for a moment.

You have a stream of data, user IDs hitting your API, search queries, IP addresses, product views. The stream is enormous: hundreds of millions of events per day. Someone asks: how many unique users did we serve today?

The naive answer is a set. Throw every user ID into a set, count its size at the end of the day. This is correct. It is also catastrophically expensive. A set of 100 million 64-bit integers takes roughly 800 megabytes of memory. Scale to a billion users and you’re at 8 gigabytes, just for one counter, just for one day. Redis running PFCOUNT on a key with a billion unique members would consume none of that. It uses 1.5 kilobytes.

That number is not a typo. The algorithm behind it is called HyperLogLog, it was invented by a French mathematician named Philippe Flajolet in 2007, and it is one of the most satisfying things I’ve encountered in a while. Let’s build it from scratch.

The Problem Has a Name

This is called the count-distinct problem , or cardinality estimation. You have a multiset (a collection where elements can repeat) and you want to know how many distinct elements it contains, without storing all of them.

The exact answer always requires memory proportional to the number of distinct elements. There’s no way around that. But if you’re willing to accept a small, predictable error (say, within 2% of the true count) you can do something remarkable: use memory proportional to the logarithm of the logarithm of the count. That’s where the “LogLog” in HyperLogLog comes from. It’s not a marketing name. It’s a description of the memory complexity.

To understand how this is possible, we need to start with a coin flip.

The Coin Flip Observation

Imagine flipping a fair coin repeatedly until you get heads, and recording how many tails you saw before the first heads. Call that number k.

If k = 0, you got heads immediately. Probability: 1/2. If k = 1, you got one tail then heads. Probability: 1/4. If k = 2, two tails then heads. Probability: 1/8.

The probability of seeing k leading tails before the first heads is 1 / 2^(k+1). Which means: if the longest run of leading tails you've ever seen across many experiments is k, you've probably run roughly 2^k experiments.

See where this is going?

If we hash every element in our stream to a sequence of bits, those bits behave like coin flips, roughly half start with 0, a quarter start with 00, an eighth start with 000. If the longest run of leading zeros we’ve seen in any hash is k, our estimate for how many distinct elements we've processed is 2^k.

Let’s check the intuition with real numbers. Say we have 1,000 distinct elements. The probability that at least one of them hashes to a value starting with 10 leading zeros is very high, because 2^10 = 1024 ≈ 1000. The probability that any of them hashes to 20 leading zeros is astronomically low (2^20 = 1,048,576) much larger than our actual count.

So the maximum run of leading zeros is a surprisingly good estimator of the order of magnitude of distinct elements seen. The key insight is that we only need to store one number (the current maximum) regardless of how many elements we’ve seen.

Here’s the simplest possible version:

# util.py
import hashlib

def count_leading_zeros(hash_bits: int, bit_length: int = 32) -> int:
    """Count leading zero bits in a hash value."""
    if hash_bits == 0:
        return bit_length
    count = 0
    for i in range(bit_length - 1, -1, -1):
        if (hash_bits >> i) & 1:
            break
        count += 1
    return count

def hash_element(element: str) -> int:
    """Hash an element to a 32-bit integer."""
    h = hashlib.md5(element.encode()).hexdigest()
    return int(h[:8], 16) # Take first 32 bits

# naive.py
from util import count_leading_zeros, hash_element

class NaiveEstimator:
    def __init__ (self):
        self.max_leading_zeros = 0

    def add(self, element: str):
        h = hash_element(element)
        k = count_leading_zeros(h)
        self.max_leading_zeros = max(self.max_leading_zeros, k)

    def estimate(self) -> int:
        return 2**self.max_leading_zeros

Let’s test it:

# test.py
import random
import string

from naive import NaiveEstimator

def random_id():
    return "".join(random.choices(string.ascii_lowercase, k=10))

naive_estimator = NaiveEstimator()
true_set = set()

for _ in range(100_000):
    elem = random_id()
    naive_estimator.add(elem)
    true_set.add(elem)

print(f"True count: {len(true_set)}\n")

est = naive_estimator.estimate()
error = abs(est - len(true_set)) / len(true_set) * 100

print(f"[naive] \tEstimated count: {est}")
print(f"[naive] \tError: {error:.2f}%\n")

Running this a few times gives results like:

True count: 100000
[naive] Estimated count: 32768 ← 2^15
[naive] Error: 67.23%

True count: 100000
[naive] Estimated count: 65536 ← 2^16
[naive] Error: 34.46%

It’s correct to within an order of magnitude, consistently. But notice the problem: the estimate can only ever be a power of 2. It jumps from 32,768 to 65,536 with no values in between. The estimate has extremely high variance, a single unlucky hash producing extra leading zeros throws everything off. This is the Flajolet-Martin algorithm from 1984. It works, but it’s rough.

Reducing Variance: Buckets

The solution to high variance is the same as it always is in statistics: take more samples and average them.

One approach: run multiple independent hash functions and average the results. But hashing everything multiple times is expensive.

A smarter approach, from Flajolet and Durand’s 2003 LogLog paper: use a single hash, but split it into two parts. Use the first b bits to choose a bucket (one of m = 2^b buckets), and run the leading-zero counter on the remaining bits.

Each bucket independently estimates the cardinality from the subset of elements that landed in it. We then combine the buckets by averaging. We’ve effectively gotten m independent estimates from a single hash function.

# loglog.py
from util import count_leading_zeros, hash_element

class LogLogEstimator:
    def __init__ (self, b: int = 8):
        """
        b: number of bits used for bucket index
        m: number of buckets = 2^b
        """
        self.b = b
        self.m = 2**b
        self.buckets = [0] * self.m

    def add(self, element: str):
        h = hash_element(element)

        # First b bits → bucket index
        bucket_index = h >> (32 - self.b)

        # Remaining 32-b bits → count leading zeros
        remaining = h & ((1 << (32 - self.b)) - 1)
        leading_zeros = count_leading_zeros(remaining, 32 - self.b)

        # Keep the maximum for this bucket
        self.buckets[bucket_index] = max(
            self.buckets[bucket_index],
            leading_zeros,
        )

    def estimate(self) -> float:
        # LogLog: geometric mean across buckets
        avg = sum(self.buckets) / self.m
        return self.m * (2**avg)

With b = 8, we have 256 buckets, each storing one small integer (max ~32). Total memory: 256 bytes. And the estimates are dramatically more stable than the naive version. The standard error of LogLog is 1.3 / sqrt(m), with 256 buckets, that's about 8%.

But we can do better. Still in the same 2003 paper, Flajolet noticed that the largest bucket values are outliers that inflate the estimate. He suggested keeping only the bottom 70% of bucket values for the average. This is SuperLogLog , and it reduces the error to 1.05 / sqrt(m), about 6.5% with 256 buckets, with no memory increase.

Then came 2007.

The Harmonic Mean: The HyperLogLog Insight

The jump from SuperLogLog to HyperLogLog is a single change: replace the geometric mean with the harmonic mean.

The harmonic mean of a set of values is:

n / (1/v1 + 1/v2 + ... + 1/vn)

Why does this help? The harmonic mean is less sensitive to large outliers than the geometric mean. When one bucket has seen an unusually large leading-zero count, because one element happened to hash to something that starts with twelve zeros, that bucket’s contribution to the harmonic mean is 1/2^12, which is tiny. It barely moves the needle. The geometric mean (averaging the exponents) would give it far more weight.

The HyperLogLog estimate is:

estimate = correction_factor * m^2 * 1 / sum(2^(-bucket[i]) for all i)

Where correction_factor is a constant that depends on m (approximately 0.7213 for large m), and the whole formula is just the harmonic mean of 2^bucket[i] values, scaled.

# hyperloglog.py
import math

from util import count_leading_zeros, hash_element

# Correction factors per the original paper
CORRECTION_FACTORS = {
    16: 0.673,
    32: 0.697,
    64: 0.709,
}

class HyperLogLog:
    def __init__ (self, b: int = 8):
        """
        b: number of index bits (between 4 and 16)
        m = 2^b buckets
        Standard error ≈ 1.04 / sqrt(m)
        """
        assert 4 <= b <= 16, "b must be between 4 and 16"
        self.b = b
        self.m = 2**b
        self.buckets = [0] * self.m

        if self.m in CORRECTION_FACTORS:
            self.alpha = CORRECTION_FACTORS[self.m]
        else:
            # for m >= 128
            self.alpha = 0.7213 / (1 + 1.079 / self.m)

    def add(self, element: str):
        h = hash_element(element)

        # First b bits → bucket index
        bucket_index = h >> (32 - self.b)

        # Remaining bits → position of leftmost 1-bit
        # (= 1 + number of leading zeros in remaining bits)
        remaining = h & ((1 << (32 - self.b)) - 1)
        rho = count_leading_zeros(remaining, 32 - self.b) + 1

        self.buckets[bucket_index] = max(self.buckets[bucket_index], rho)

    def estimate(self) -> float:
        # Harmonic mean of 2^bucket values
        harmonic_sum = sum(2.0 ** (-b) for b in self.buckets)
        raw_estimate = self.alpha * (self.m**2) / harmonic_sum

        # Small range correction: use LinearCounting when estimate is small
        if raw_estimate <= 2.5 * self.m:
            zeros = self.buckets.count(0)
            if zeros > 0:
                return self.m * math.log(self.m / zeros)

        # Large range correction (for 32-bit hashes)
        if raw_estimate > (1.0 / 30.0) * (2**32):
            return -(2**32) * math.log(1 - raw_estimate / (2**32))

        return raw_estimate

Two corrections are applied at the edges:

Small range correction : When the estimate is less than 2.5x the number of buckets, many buckets are still empty (zero). In that regime, a separate algorithm called LinearCounting is more accurate. LinearCounting uses the number of empty buckets: m * ln(m / empty_buckets). HyperLogLog switches to it automatically.

Large range correction : When the estimate approaches the maximum value representable by a 32-bit hash (about 4 billion), hash collisions start causing systematic undercount. A logarithmic correction compensates.

Now let’s test this properly:

# test.py
import random
import string

from hyperloglog import HyperLogLog

def random_id():
    return "".join(random.choices(string.ascii_lowercase, k=10))

hyper_log_log = HyperLogLog(b=10) # 1024 buckets, ~1KB
true_set = set()

for _ in range(100_000):
    elem = random_id()
    hyper_log_log.add(elem)
    true_set.add(elem)

print(f"True count: {len(true_set)}\n")

est = hyper_log_log.estimate()
error = abs(est - len(true_set)) / len(true_set) * 100

print(f"[hyper log log] \tEstimated count: {est:,.0f}")
print(f"[hyper log log] \tError: {error:.2f}%")
print(f"[hyper log log] \tMemory (approx): {hyper_log_log.m} bytes for buckets")

Output:

True count: 100000

[hyper log log] Estimated count: 97,291
[hyper log log] Error: 2.71%
[hyper log log] Memory (approx): 1024 bytes for buckets

1024 bytes. About 2% error in this particular run. On 100,000 unique elements.

With b = 10 (1024 buckets), the theoretical standard error is 1.04 / sqrt(1024) = 3.25%. In practice it often lands well within that. With b = 12 (4096 buckets, still only 4KB), the standard error drops to 1.04 / sqrt(4096) = 1.6%.

The Memory Arithmetic

Let’s be precise about why this is so small.

Each bucket stores the maximum run of leading zeros it’s seen. With a 32-bit hash and b bits used for the index, the remaining 32 - b bits are used for the zero count. The maximum possible count is 32 - b, which for b = 10 is 22. You need 5 bits to represent numbers up to 22.

So each bucket needs 5 bits. With 1024 buckets: 1024 * 5 = 5120 bits = 640 bytes.

In practice Redis uses 6 bits per register and 16,384 registers (b = 14): 16384 * 6 = 98,304 bits = 12,288 bytes = 12KB. With this configuration the standard error is 1.04 / sqrt(16384) = 0.81%. Under 1% error, for a 12KB data structure, counting up to billions of distinct elements.

Redis also uses a dense/sparse encoding: when few elements have been added, only the non-zero buckets are stored. The 12KB limit is only reached with a large number of distinct elements.

The Property That Makes It Production-Useful

There’s one more thing that makes HyperLogLog more than just a clever approximation: HyperLogLog sketches are mergeable.

If you have a HyperLogLog for Monday’s traffic and one for Tuesday’s traffic, you can merge them into a HyperLogLog for Monday+Tuesday’s combined unique users by taking the element-wise maximum of the two bucket arrays:

def merge(hll1: HyperLogLog, hll2: HyperLogLog) -> HyperLogLog:
    assert hll1.b == hll2.b, "Can only merge HLLs with same b"
    merged = HyperLogLog(hll1.b)
    merged.buckets = [
        max(b1, b2)
        for b1, b2 in zip(hll1.buckets, hll2.buckets)
    ]
    return merged

This takes O(m) time and produces an estimate with the same error guarantee as a fresh HyperLogLog that had seen both datasets. No re-processing. No storing the original data.

This is why it appears in distributed systems everywhere. You can compute HyperLogLog sketches independently on shards, machines, or time windows, then merge them instantly. Reddit uses this for per-post unique view counts distributed across servers. BigQuery uses it for APPROX_COUNT_DISTINCT(). Facebook Presto, Apache Druid, Amazon Redshift - they all have it.

In Redis, PFADD adds elements and PFMERGE merges two HLLs. The PF prefix is a tribute to Philippe Flajolet, who died in 2011, four years after publishing the paper that named the algorithm.

What I Actually Took From This

There’s a category of algorithms where the core idea is so simple that it seems like it can’t possibly work, and the whole experience of learning it is going from skepticism to surprise to understanding. Counting distinct elements by tracking coin-flip statistics is in that category.

The thing I keep thinking about is the memory arithmetic. A set() in Python holding 100,000 integers uses roughly 4MB. The HyperLogLog above used 1KB and got within 2%. The set uses 4,000x more memory for exact precision. Whether that precision is worth 4,000x the memory depends entirely on what you're building.

For a unique user counter that needs to answer in real time, served from Redis, on billions of events, it’s not worth it. For a billing system that needs to charge per unique user to the cent, it might be. Knowing which situation you’re in is the actual engineering skill. HyperLogLog is just a tool. A very elegant one.

This article is rewritten using AI chatbots.