DEV Community

SEN LLC
SEN LLC

Posted on

Building a HyperLogLog CLI in Go — One Billion Distinct in 16 KB, a Harmonic Mean, a Hash Finalizer, and Mergeable Sketches

A HyperLogLog cardinality estimator as a Go CLI. It counts the number of distinct items in a massive stream using ~16 KB of fixed memory. The hinges: (1) it records the leading-zero run of each hash into m registers and combines them with a harmonic mean, so one lucky outlier can't blow up the estimate; (2) FNV-1a's weak low bits will wreck the estimate on structured input unless you run a finalizer mixer; (3) two sketches merge with zero loss of accuracy by taking the per-register max. Part two of a probabilistic-data-structure series after #274's Bloom filter.

📦 GitHub: https://github.com/sen-ltd/hyperloglog-cli

Screenshot

What HyperLogLog is

It answers "how many distinct values are in this stream?" Counting exactly means
remembering every key you've seen — a hash set whose memory grows with the answer.
HyperLogLog gives up exactness to make memory constant: count a billion
distinct values to ~1% accuracy in 16 KB, regardless of stream size.

It's what you want for "unique visitors", "distinct IPs", "distinct search
terms" — anywhere an approximate count in a single pass beats an exact count you
can't afford to keep in memory.

Here's the CLI, real output:

$ hll count --from visitors.txt --exact
lines read     100000
estimate       28547
registers      m=16384 (p=14), 16 KB, std error ~0.81%
exact distinct 28905
error          -1.24%
Enter fullscreen mode Exit fullscreen mode

100,000 lines (28,905 truly distinct) estimated as 28,547 from 16 KB of
registers — with not a single key stored.

The idea: rare hash patterns leak scale

Hash each item to 64 random-looking bits. A hash that starts with k zeros is
about as likely as flipping k heads in a row, so spotting one hints you've seen
~2ᵏ distinct values. Tracking a single longest run would be wildly noisy, so HLL
takes the first p bits of the hash to pick a register (bucket), stores the
rank (leading-zero run + 1) of the rest in that bucket, and combines all
m = 2^p registers.

func (h *HLL) Add(item string) bool {
    x := hash(item)
    idx := x >> (64 - h.p) // top p bits select the register
    r := rank(x, h.p)      // leading-zero run of the rest, +1
    if r > h.registers[idx] {
        h.registers[idx] = r
        return true
    }
    return false
}
Enter fullscreen mode Exit fullscreen mode

Hinge 1: a harmonic mean over m buckets

The estimate combines every register:

est := alpha(m) * m * m / sum   // sum = Σ 2^(-register)
Enter fullscreen mode Exit fullscreen mode

Summing 2^(-rank) is a harmonic mean in disguise — buckets that saw long
zero-runs contribute little, so a single lucky outlier can't blow up the estimate
the way an arithmetic mean would. Averaging over m registers drops the relative
standard error to 1.04/√m: at p = 14 that's m = 16384 registers and
~0.81% error, for 16 KB of state (one byte per register).

At low cardinality most registers are still empty and the raw formula is biased,
so HLL hands off to linear counting:

if est <= 2.5*m && zeros > 0 {
    return m * math.Log(m/float64(zeros)) // linear counting
}
Enter fullscreen mode Exit fullscreen mode

TestSmallRangeLinearCounting pins that 50 distinct items estimate to within 3.

Hinge 2: why the hash needs a finalizer

We take the register from the high bits and the rank from the rest. FNV-1a is
fast and dependency-free but has poor avalanche in its low bits — structured
inputs like user00001, user00002 share low-bit structure, which biases bucket
assignment and rank together and drifts the estimate. One MurmurHash3 fmix64
finalizer scrambles all 64 bits so they look independent:

func hash(item string) uint64 {
    f := fnv.New64a()
    _, _ = f.Write([]byte(item))
    x := f.Sum64()
    x ^= x >> 33
    x *= 0xff51afd7ed558ccd
    x ^= x >> 33
    x *= 0xc4ceb9fe1a85ec53
    x ^= x >> 33
    return x
}
Enter fullscreen mode Exit fullscreen mode

TestFinalizerHandlesStructuredInput feeds 100k highly structured keys and
asserts the estimate still lands within tolerance.

Because we hash to 64 bits, the original 2007 paper's large-range correction
which existed only to undo collisions as a 32-bit space fills near 2³² — is
unnecessary, and omitted.

The signature property: merge

Two sketches union by taking the per-register maximum:

func (h *HLL) Merge(other *HLL) error {
    if h.p != other.p {
        return fmt.Errorf("precision mismatch: %d vs %d", h.p, other.p)
    }
    for i, r := range other.registers {
        if r > h.registers[i] {
            h.registers[i] = r
        }
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

The result is exactly the sketch you'd have built by feeding both streams into
one — so distinct counts from different machines, days, or shards combine with
zero loss of accuracy. That's why HLL is a staple of log analytics and
distributed counting.

$ hll merge -o week.hll mon.hll tue.hll
merged 2 sketches → week.hll  union estimate=36248
Enter fullscreen mode Exit fullscreen mode

TestMergeEqualsCombinedStream pins this: merging two disjoint 30k-key sketches
reproduces the combined sketch register for register, and estimates the 60k-key
union to within tolerance.

Persistence

A sketch serializes to a compact blob — "HLL1" | p | registers — exactly
5 + 2^p bytes, the same footprint it occupies in memory. Hand-rolled, no codec.

hll/hll.go — sketch: hashing, rank, estimate, merge, (de)serialize (11 tests)
main.go    — CLI: count / add / estimate / merge / info, file or stdin
Enter fullscreen mode Exit fullscreen mode

Takeaway

HyperLogLog is the textbook case of "give up a little accuracy, get constant
memory." The keys are combining buckets with a harmonic mean so outliers
don't dominate, running a hash finalizer so structured input doesn't drift
the estimate, and the fact that sketches merge by max. Put next to a Bloom
filter (#274), you can see the shared design instinct of probabilistic data
structures: accept a bounded error to buy a tiny, fixed memory budget.

📦 GitHub: https://github.com/sen-ltd/hyperloglog-cli

Top comments (0)