A HyperLogLog cardinality estimator as a Go CLI. It counts the number of distinct items in a massive stream using ~16 KB of fixed memory. The hinges: (1) it records the leading-zero run of each hash into
mregisters and combines them with a harmonic mean, so one lucky outlier can't blow up the estimate; (2) FNV-1a's weak low bits will wreck the estimate on structured input unless you run a finalizer mixer; (3) two sketches merge with zero loss of accuracy by taking the per-register max. Part two of a probabilistic-data-structure series after #274's Bloom filter.
📦 GitHub: https://github.com/sen-ltd/hyperloglog-cli
What HyperLogLog is
It answers "how many distinct values are in this stream?" Counting exactly means
remembering every key you've seen — a hash set whose memory grows with the answer.
HyperLogLog gives up exactness to make memory constant: count a billion
distinct values to ~1% accuracy in 16 KB, regardless of stream size.
It's what you want for "unique visitors", "distinct IPs", "distinct search
terms" — anywhere an approximate count in a single pass beats an exact count you
can't afford to keep in memory.
Here's the CLI, real output:
$ hll count --from visitors.txt --exact
lines read 100000
estimate 28547
registers m=16384 (p=14), 16 KB, std error ~0.81%
exact distinct 28905
error -1.24%
100,000 lines (28,905 truly distinct) estimated as 28,547 from 16 KB of
registers — with not a single key stored.
The idea: rare hash patterns leak scale
Hash each item to 64 random-looking bits. A hash that starts with k zeros is
about as likely as flipping k heads in a row, so spotting one hints you've seen
~2ᵏ distinct values. Tracking a single longest run would be wildly noisy, so HLL
takes the first p bits of the hash to pick a register (bucket), stores the
rank (leading-zero run + 1) of the rest in that bucket, and combines all
m = 2^p registers.
func (h *HLL) Add(item string) bool {
x := hash(item)
idx := x >> (64 - h.p) // top p bits select the register
r := rank(x, h.p) // leading-zero run of the rest, +1
if r > h.registers[idx] {
h.registers[idx] = r
return true
}
return false
}
Hinge 1: a harmonic mean over m buckets
The estimate combines every register:
est := alpha(m) * m * m / sum // sum = Σ 2^(-register)
Summing 2^(-rank) is a harmonic mean in disguise — buckets that saw long
zero-runs contribute little, so a single lucky outlier can't blow up the estimate
the way an arithmetic mean would. Averaging over m registers drops the relative
standard error to 1.04/√m: at p = 14 that's m = 16384 registers and
~0.81% error, for 16 KB of state (one byte per register).
At low cardinality most registers are still empty and the raw formula is biased,
so HLL hands off to linear counting:
if est <= 2.5*m && zeros > 0 {
return m * math.Log(m/float64(zeros)) // linear counting
}
TestSmallRangeLinearCounting pins that 50 distinct items estimate to within 3.
Hinge 2: why the hash needs a finalizer
We take the register from the high bits and the rank from the rest. FNV-1a is
fast and dependency-free but has poor avalanche in its low bits — structured
inputs like user00001, user00002 share low-bit structure, which biases bucket
assignment and rank together and drifts the estimate. One MurmurHash3 fmix64
finalizer scrambles all 64 bits so they look independent:
func hash(item string) uint64 {
f := fnv.New64a()
_, _ = f.Write([]byte(item))
x := f.Sum64()
x ^= x >> 33
x *= 0xff51afd7ed558ccd
x ^= x >> 33
x *= 0xc4ceb9fe1a85ec53
x ^= x >> 33
return x
}
TestFinalizerHandlesStructuredInput feeds 100k highly structured keys and
asserts the estimate still lands within tolerance.
Because we hash to 64 bits, the original 2007 paper's large-range correction —
which existed only to undo collisions as a 32-bit space fills near 2³² — is
unnecessary, and omitted.
The signature property: merge
Two sketches union by taking the per-register maximum:
func (h *HLL) Merge(other *HLL) error {
if h.p != other.p {
return fmt.Errorf("precision mismatch: %d vs %d", h.p, other.p)
}
for i, r := range other.registers {
if r > h.registers[i] {
h.registers[i] = r
}
}
return nil
}
The result is exactly the sketch you'd have built by feeding both streams into
one — so distinct counts from different machines, days, or shards combine with
zero loss of accuracy. That's why HLL is a staple of log analytics and
distributed counting.
$ hll merge -o week.hll mon.hll tue.hll
merged 2 sketches → week.hll union estimate=36248
TestMergeEqualsCombinedStream pins this: merging two disjoint 30k-key sketches
reproduces the combined sketch register for register, and estimates the 60k-key
union to within tolerance.
Persistence
A sketch serializes to a compact blob — "HLL1" | p | registers — exactly
5 + 2^p bytes, the same footprint it occupies in memory. Hand-rolled, no codec.
hll/hll.go — sketch: hashing, rank, estimate, merge, (de)serialize (11 tests)
main.go — CLI: count / add / estimate / merge / info, file or stdin
Takeaway
HyperLogLog is the textbook case of "give up a little accuracy, get constant
memory." The keys are combining buckets with a harmonic mean so outliers
don't dominate, running a hash finalizer so structured input doesn't drift
the estimate, and the fact that sketches merge by max. Put next to a Bloom
filter (#274), you can see the shared design instinct of probabilistic data
structures: accept a bounded error to buy a tiny, fixed memory budget.

Top comments (0)