Building a HyperLogLog CLI in Go — One Billion Distinct in 16 KB, a Harmonic Mean, a Hash Finalizer, and Mergeable Sketches

#go #algorithms #datastructures #cli

A HyperLogLog cardinality estimator as a Go CLI. It counts the number of distinct items in a massive stream using ~16 KB of fixed memory. The hinges: (1) it records the leading-zero run of each hash into m registers and combines them with a harmonic mean, so one lucky outlier can't blow up the estimate; (2) FNV-1a's weak low bits will wreck the estimate on structured input unless you run a finalizer mixer; (3) two sketches merge with zero loss of accuracy by taking the per-register max. Part two of a probabilistic-data-structure series after #274's Bloom filter.

📦 GitHub: https://github.com/sen-ltd/hyperloglog-cli

What HyperLogLog is

It answers "how many distinct values are in this stream?" Counting exactly means
remembering every key you've seen — a hash set whose memory grows with the answer.
HyperLogLog gives up exactness to make memory constant: count a billion
distinct values to ~1% accuracy in 16 KB, regardless of stream size.

It's what you want for "unique visitors", "distinct IPs", "distinct search
terms" — anywhere an approximate count in a single pass beats an exact count you
can't afford to keep in memory.

Here's the CLI, real output:

$ hll count --from visitors.txt --exact
lines read     100000
estimate       28547
registers      m=16384 (p=14), 16 KB, std error ~0.81%
exact distinct 28905
error          -1.24%

100,000 lines (28,905 truly distinct) estimated as 28,547 from 16 KB of
registers — with not a single key stored.

The idea: rare hash patterns leak scale

Hash each item to 64 random-looking bits. A hash that starts with k zeros is
about as likely as flipping k heads in a row, so spotting one hints you've seen
~2ᵏ distinct values. Tracking a single longest run would be wildly noisy, so HLL
takes the first p bits of the hash to pick a register (bucket), stores the
rank (leading-zero run + 1) of the rest in that bucket, and combines all
m = 2^p registers.

func (h *HLL) Add(item string) bool {
    x := hash(item)
    idx := x >> (64 - h.p) // top p bits select the register
    r := rank(x, h.p)      // leading-zero run of the rest, +1
    if r > h.registers[idx] {
        h.registers[idx] = r
        return true
    }
    return false
}

Hinge 1: a harmonic mean over m buckets

The estimate combines every register:

est := alpha(m) * m * m / sum   // sum = Σ 2^(-register)

Summing 2^(-rank) is a harmonic mean in disguise — buckets that saw long
zero-runs contribute little, so a single lucky outlier can't blow up the estimate
the way an arithmetic mean would. Averaging over m registers drops the relative
standard error to 1.04/√m: at p = 14 that's m = 16384 registers and
~0.81% error, for 16 KB of state (one byte per register).

At low cardinality most registers are still empty and the raw formula is biased,
so HLL hands off to linear counting:

if est <= 2.5*m && zeros > 0 {
    return m * math.Log(m/float64(zeros)) // linear counting
}

TestSmallRangeLinearCounting pins that 50 distinct items estimate to within 3.

Hinge 2: why the hash needs a finalizer

We take the register from the high bits and the rank from the rest. FNV-1a is
fast and dependency-free but has poor avalanche in its low bits — structured
inputs like user00001, user00002 share low-bit structure, which biases bucket
assignment and rank together and drifts the estimate. One MurmurHash3 fmix64
finalizer scrambles all 64 bits so they look independent:

func hash(item string) uint64 {
    f := fnv.New64a()
    _, _ = f.Write([]byte(item))
    x := f.Sum64()
    x ^= x >> 33
    x *= 0xff51afd7ed558ccd
    x ^= x >> 33
    x *= 0xc4ceb9fe1a85ec53
    x ^= x >> 33
    return x
}

TestFinalizerHandlesStructuredInput feeds 100k highly structured keys and
asserts the estimate still lands within tolerance.

Because we hash to 64 bits, the original 2007 paper's large-range correction —
which existed only to undo collisions as a 32-bit space fills near 2³² — is
unnecessary, and omitted.

The signature property: merge

Two sketches union by taking the per-register maximum:

func (h *HLL) Merge(other *HLL) error {
    if h.p != other.p {
        return fmt.Errorf("precision mismatch: %d vs %d", h.p, other.p)
    }
    for i, r := range other.registers {
        if r > h.registers[i] {
            h.registers[i] = r
        }
    }
    return nil
}

The result is exactly the sketch you'd have built by feeding both streams into
one — so distinct counts from different machines, days, or shards combine with
zero loss of accuracy. That's why HLL is a staple of log analytics and
distributed counting.

$ hll merge -o week.hll mon.hll tue.hll
merged 2 sketches → week.hll  union estimate=36248

TestMergeEqualsCombinedStream pins this: merging two disjoint 30k-key sketches
reproduces the combined sketch register for register, and estimates the 60k-key
union to within tolerance.

Persistence

A sketch serializes to a compact blob — "HLL1" | p | registers — exactly
5 + 2^p bytes, the same footprint it occupies in memory. Hand-rolled, no codec.

hll/hll.go — sketch: hashing, rank, estimate, merge, (de)serialize (11 tests)
main.go    — CLI: count / add / estimate / merge / info, file or stdin

Takeaway

HyperLogLog is the textbook case of "give up a little accuracy, get constant
memory." The keys are combining buckets with a harmonic mean so outliers
don't dominate, running a hash finalizer so structured input doesn't drift
the estimate, and the fact that sketches merge by max. Put next to a Bloom
filter (#274), you can see the shared design instinct of probabilistic data
structures: accept a bounded error to buy a tiny, fixed memory budget.

📦 GitHub: https://github.com/sen-ltd/hyperloglog-cli