Lucas Lee

Posted on Jan 25

Counting Billions of Unique Items with the Magic of HyperLogLog

#datascience #database #programming

1. Introduction: Starting with a Classic Interview Question

In the realm of massive data processing, there is a classic interview question that has been asked ad nauseam:

"Given 1 billion IP addresses and a memory limit of only 1MB, how do you calculate the number of unique IPs (Cardinality Estimation)?"

If you attempt to use std::set or HashSet, simply storing 1 billion IPs will instantly exhaust your memory. Even using a Bitmap offers insufficient compression if the data is sparse.

Today, let's talk about HyperLogLog (HLL). It is a probabilistic algorithm that can estimate the cardinality of datasets in the billions with a standard error of less than 2%, all while using just a few kilobytes of memory.

2. The Intuition: The "Black Magic" of Coin Flipping

Imagine you are playing a simple coin-tossing game: Tails is 0, Heads is 1. The rule is simple: you keep flipping the coin until you get the first Heads (1), at which point this round of the game ends.

According to probability theory, we observe the following phenomena:

The probability of getting Heads on the 1st try is 1/2.
The probability of getting Heads after 2 consecutive Tails is 1/8.
The probability of getting Heads after n consecutive Tails is 1 / 2^(n+1).

There is a core logic hidden here: If you observe an extremely rare sequence—for example, flipping 20 consecutive Tails before seeing a Heads (a probability of roughly one in a million)—your intuition tells you: to encounter such a "stroke of luck," you must have been silently attempting this millions of times in the background.

We transform this intuition into an inference formula:

If the maximum length of "consecutive Tails" observed in our experiment is n, we can boldly speculate that approximately 2^(n+1) rounds of experiments have been conducted.

Note: This is a "reasonable estimate," not an exact value. Just as winning the lottery once doesn't prove you are a billionaire, HyperLogLog uses this probability distribution to make an approximate estimation.

3. From Coin Flips to Bitstreams: The Mathematical Foundation

In the computer world, everything is bits (0s and 1s). We can cleverly view the binary representation of data as the result of a series of "coin flips": 0 represents Tails, 1 represents Heads.

We define the rule: Scan from right to left (low bit to high bit) until the first 1 (Heads) is encountered. This ends the "game."

Sequence ...01: Heads on the first try. Game over. Value = 1.
Sequence ...010: First Tails, second Heads. Value = 2. (Note: Bits after the first 1 are ignored in this probabilistic model).
Sequence ...1000: Three Tails, then Heads. Value = 4.
Sequence ...1 followed by n 0s: Indicates n consecutive Tails.

Now, let's solve a probability problem: In a randomly distributed bitstream, what is the probability of needing n consecutive 0s to end the game? The answer is simple: 1 / 2^(n+1).

Reverse Thinking: Suppose I have a pile of binary strings, and the longest run of consecutive zeros I observe is n. Based on our previous conclusion, we can infer: to stumble upon this "miracle," I must have looked at approximately 2^(n+1) distinct binary strings.

This is the core principle of HyperLogLog:

Observe Extremes: Instead of recording the data itself, we only record the maximum length of consecutive zeros observed across all data (denoted as ρ).
Infer Cardinality: Use this maximum length to back-calculate the number of unique elements (cardinality).

TL;DR: If you see n consecutive 0s, it suggests that approximately 2^(n+1) distinct items have passed through.

4. Engineering Challenges: From Theory to Reality

4.1 Challenge 1: Breaking Patterns (Why We Need Hashing)

The mathematical derivation above relies on a "Strong Assumption": the probability of 0 and 1 appearing in the bitstream must be strictly 1:1, and bits must be independent. In other words, it must be a "perfectly fair coin."

However, in real-world engineering, raw data is often extremely "unfair." Imagine counting a sequence of User IDs: 1000, 1001, 1002... Their binary representations are almost identical in their long prefixes:

1000 -> ...001111101000
1001 -> ...001111101001

If these IDs naturally contain many consecutive zeros (e.g., low-value IDs), HLL will hallucinate, believing it has observed a "rare event" with a tiny probability, leading to an astronomically incorrect estimate.

To fix this loophole, we must introduce a Hash Function.

While we won't dive deep into hash implementation here, remember its core mission: to "scatter" patterned data into a uniformly distributed, randomly perturbed bitstream. In the world of HyperLogLog, hashing is not for encryption, but to ensure every "coin" we flip is "fair."

4.2 Challenge 2: Eliminating the "Outlier" Bias

Even with a fair coin, probability still allows for extreme cases. What if you are incredibly lucky and flip 30 consecutive 0s on your very first try?

The algorithm would naively assume you have 1 billion items, when in reality, you inserted just one. This deviation caused by a single point of failure is catastrophic in statistics.

To bring the results back to rationality, HyperLogLog introduces two major "noise reduction" tools: Bucketing and the Harmonic Mean.

4.2.1 Bucketing: Don't Put All Your Eggs in One Basket

Since one person's "luck" is unreliable, we look at the average level of a crowd.

We split the 64-bit hash value into two parts:

High b bits (Bucket Index): Determines which "room" the data goes to. If b = 10, we have m = 2^10 = 1024 independent buckets (Registers).
Remaining bits (Coin Flip): Used to record the maximum run of consecutive zeros seen within that specific bucket.

This way, one "giant guess" is broken down into 1024 "independent small guesses." Even if one bucket records an anomalously high value due to sheer luck, it only affects 1/1024th of the decision weight, preventing it from skewing the overall picture.

4.2.2 Harmonic Mean: A Dimensional Strike on Outliers

When summarizing the results of these 1024 buckets, using a simple Arithmetic Mean is dangerous.

Example:

Suppose 1023 buckets have a value of 1, but one bucket—due to extreme luck—has a value of 100.

Arithmetic Mean: (1 * 1023 + 100) / 1024 ≈ 1.1. This looks stable, right? Wrong. HLL's estimation is exponential (2^R). The difference between 2^1 and 2^100 is astronomical.

To solve this, HLL uses the Harmonic Mean:

Estimate H = m / Σ(1 / 2^R_i)
(Where m is the number of buckets, and R_i is the value in each bucket)

The Harmonic Mean has a crucial property: It is very insensitive to large outliers (the "nouveau riche") but sensitive to small values.

By doing this, extreme values generated by accidental "good luck" are effectively smoothed out, allowing the estimate to land firmly near the true cardinality.

4.2.3 Correction Parameters: Eliminating "Inherent Bias"

Even with Bucketing and Harmonic Mean, mathematicians discovered that the raw HLL formula still has a systematic inherent bias.

Simply put, the algorithm logically tends to "overestimate" or "underestimate" the true cardinality depending on the number of buckets m.

To make the estimate approximate the true value as closely as possible, HLL introduces a correction constant α_m. This constant varies with bucket precision:

When m = 16, α_m = 0.673
When m = 32, α_m = 0.697
When m ≥ 128, α_m stabilizes at: 0.7213 / (1 + 1.079 / m)

The number 0.7213 / (1 + 1.079 / m) you see in the code acts like the final "Zero Calibration" on a precision scale. Without it, the entire error curve of HLL would be shifted.

5. Concrete Implementation

When implementing HyperLogLog, we need to focus on two engineering details: Efficient Bitwise Operations and Small Range Correction.

Core Implementation Highlights:

Hardware Acceleration: We use the GCC builtin __builtin_ctzll. This invokes CPU-level instructions (like BSF or TZCNT on x86) to count trailing zeros, which is significantly faster than writing a loop.
Memory Efficiency: Each Register only needs 8 bits (uint8_t) because the run of consecutive zeros in a 64-bit integer cannot exceed 64. With 1024 buckets, the entire data structure takes up just 1 KB of memory.
Linear Counting Correction: Probabilistic algorithms have higher error rates when data volume is extremely small (buckets aren't full yet). Therefore, when the estimate is low, we switch to Linear Counting, using the ratio of empty buckets to calculate cardinality, ensuring accuracy across the full range.

#include <stdio.h>
#include <stdint.h>
#include <math.h>
#include <string.h>

// Precision parameter k (Number of buckets = 2^10 = 1024)
#define HLL_PRECISION 10
#define HLL_REGISTERS (1 << HLL_PRECISION)
#define HLL_MASK (HLL_REGISTERS - 1)

typedef struct {
    uint8_t registers[HLL_REGISTERS];
} HyperLogLog;

/**
 * Simple 64-bit Hash Function (MurmurHash3 mixing part)
 * In production, use full MurmurHash3 or CityHash/FarmHash.
 */
uint64_t hash_64(const void *key, int len) {
    uint64_t h = 0x12345678abcdefLL;
    const uint8_t *data = (const uint8_t *)key;
    for (int i = 0; i < len; i++) {
        h ^= data[i];
        h *= 0xbf58476d1ce4e5b9LL;
        h ^= h >> 33;
    }
    return h;
}

/**
 * Get the position of the first '1' (ρ value)
 * Discard bucket bits, count trailing zeros in the remaining bits.
 */
uint8_t get_rho(uint64_t hash) {
    uint64_t v = hash >> HLL_PRECISION;
    if (v == 0) return 64 - HLL_PRECISION;
    // Use builtin to count trailing zeros. +1 implies 1-based index of first '1'
    return (uint8_t)__builtin_ctzll(v) + 1;
}

// Initialize: Zero out all registers
void hll_init(HyperLogLog *hll) {
    memset(hll->registers, 0, HLL_REGISTERS);
}

// Add element: Update the max ρ value for the corresponding bucket
void hll_add(HyperLogLog *hll, const void *data, int len) {
    uint64_t hash = hash_64(data, len);
    int index = hash & HLL_MASK; // Extract low bits for bucket index
    uint8_t rho = get_rho(hash);
    if (rho > hll->registers[index]) {
        hll->registers[index] = rho;
    }
}

// Estimate Cardinality: Harmonic Mean + Bias Correction
double hll_count(HyperLogLog *hll) {
    double m = HLL_REGISTERS;
    // αm correction coefficient
    double alpha_m = 0.7213 / (1 + 1.079 / m);
    double sum = 0.0;

    for (int i = 0; i < HLL_REGISTERS; i++) {
        // Calculate 1 / 2^R[i]
        sum += 1.0 / (1ULL << hll->registers[i]);
    }

    double estimate = alpha_m * m * m / sum;

    // Small Range Correction (Linear Counting):
    // If estimate < 2.5 * m, correct using empty buckets ratio
    if (estimate <= 2.5 * m) {
        int zeros = 0;
        for (int i = 0; i < HLL_REGISTERS; i++) {
            if (hll->registers[i] == 0) zeros++;
        }
        if (zeros > 0) {
            estimate = m * log(m / (double)zeros);
        }
    }
    return estimate;
}

int main() {
    HyperLogLog hll;
    hll_init(&hll);

    printf("Simulating insertion of 100,000 unique elements...\n");
    for (int i = 0; i < 100000; i++) {
        char buf[32];
        sprintf(buf, "element_%d", i);
        hll_add(&hll, buf, strlen(buf));
    }

    printf("Actual Count:    100000\n");
    printf("Estimated Count: %.2f\n", hll_count(&hll));
    printf("Error Rate:      %.2f%%\n", fabs(hll_count(&hll) - 100000) / 100000 * 100);

    return 0;
}

Running Result:

6. Conclusion: A Romantic Symphony of Math, Engineering, and AI

The somewhat eccentric name "HyperLogLog" actually inscribes a trilogy of algorithmic evolution, much like a grand symphony:

Log: The core rhythm. The algorithm captures the bit-length ρ of consecutive zeros, mapping a massive cardinality to an elegant logarithmic scale.
LogLog: The bridge. In 2003, Philippe Flajolet proved that one could glimpse the contours of massive data using minimal space.
Hyper: The final movement. In 2007, the introduction of the harmonic mean and bias correction pushed precision to "Hyper" limits.

The key to solving engineering bottlenecks is often not "brute force computing," but profound "mathematical models." This philosophy of "governing the concrete with the abstract" continues to shine brightly today in AI parameter compression and high-efficiency retrieval.

From a random game of coin tossing to estimating the cardinality of billions of data points, HyperLogLog perfectly illustrates the ultimate romance of computer science: Using a square inch of memory to contain a universe of information; trading the "fuzziness" of probability for the "pinnacle" of performance.

A good engineer writes bug-free code; a distinguished engineer leverages the beauty of mathematics to touch the physical limits of the machine.

DEV Community