Building a Count-Min Sketch CLI in Rust — Sizing from a Target Error, d Counters from Two Hashes, No Underestimates, and Mergeable Sketches

#rust #algorithms #datastructures #cli

A Count-Min sketch as a Rust CLI. It answers "how many times have I seen X?" with fixed memory and an estimate that always rounds up. The hinges: (1) the table width w and depth d aren't guesses — they're determined by your error factor ε and failure probability δ; (2) you don't need d independent hashes — two are enough, via double hashing (Kirsch–Mitzenmacher); (3) it returns the minimum of its rows, so it can never underestimate — that's a guarantee, not a hope. Part three of a probabilistic-data-structure series after #274's Bloom filter (membership) and #275's HyperLogLog (cardinality) — this one is frequency, completing the trilogy.

📦 GitHub: https://github.com/sen-ltd/count-min-sketch

What a Count-Min sketch is

Where a Bloom filter answers "is X in the set?", a Count-Min sketch answers
"how many times have I seen X?" — approximately, and always rounding up:

the estimate is never below the true count. There are no underestimates.
it may overestimate, by a bounded amount, at a probability you choose.

In return you get a fixed-size table of counters, O(d) updates, and no stored
keys at all. That one-sided error — an honest floor, a hedged ceiling — is
the entire design. It's what you want for "top URLs", "heavy-hitter IPs", or
"trending queries" over a stream too large to count exactly.

Here's the CLI, real output:

$ cms new -e 0.0005 -d 0.01
created .cms  ε=0.0005  δ=0.01  w=5437  d=5  table=27185 counters (106 KiB)

$ cms add --from access.log
added 144000 occurrences  total N=144000  w=5437  d=5  error bound ε·N≈72

$ cms query / /login /api/search /favicon.ico /never-requested
/                 40002      (exact 40000)
/login            18002      (exact 18000)
/api/search       12005      (exact 12000)
/favicon.ico      9001       (exact  9000)
/never-requested  3          (exact     0)

144k log lines summarized in a 106 KiB counter table, heavy hitters estimated to
within a handful (overshoot only, inside the guaranteed ε·N ≈ 72), no keys kept.

Hinge 1: sizing from a target error and confidence

You don't pick the width w and depth d by feel. Given an error factor ε
and a failure probability δ, they're determined:

pub fn optimal_params(epsilon: f64, delta: f64) -> (usize, usize) {
    let w = (std::f64::consts::E / epsilon).ceil();   // columns
    let d = (1.0 / delta).ln().ceil();                // rows
    (w as usize, d as usize)
}

The guarantee they buy: an estimate overshoots the true count by at most ε·N
(N = total of all counts) with probability ≥ 1−δ. A test pins the exact
numbers so a refactor can't silently drift the math:

#[test]
fn optimal_params_match_formula() {
    // ε=0.001, δ=0.01 → w = ceil(e/0.001) = 2719, d = ceil(ln 100) = 5
    let (w, d) = CountMinSketch::optimal_params(0.001, 0.01);
    assert_eq!(w, 2719);
    assert_eq!(d, 5);
}

Hinge 2: d counters from two hashes (Kirsch–Mitzenmacher)

Each item needs one column per row — d independent hashes. Computing five
independent hashes per item would be wasteful, and unnecessary. Two base hashes
are enough:

fn columns(&self, item: &str) -> Vec<usize> {
    let h1 = fnv1a_64(item.as_bytes());
    let h2 = djb2_64(item.as_bytes()) | 1; // force odd so it never collapses
    (0..self.d)
        .map(|i| (h1.wrapping_add((i as u64).wrapping_mul(h2)) % self.w as u64) as usize)
        .collect()
}

Two cheap hashes (FNV-1a and DJB2), d columns. h2 is forced odd so the step
never degenerates to stamping the same column in every row — the same trick as

274's Bloom filter.

Hinge 3: take the minimum — and why it can't underestimate

add bumps the item's counter in every row. Collisions from other items only
ever add to a counter, so the true count sits at or below every row's value.
estimate returns the minimum across rows — the tightest upper bound the
table can give:

pub fn estimate(&self, item: &str) -> u32 {
    self.columns(item).into_iter().enumerate()
        .map(|(row, col)| self.counters[row * self.w + col])
        .min().unwrap_or(0)
}

It's a property, not a hope, and it's tested directly: insert 5000 items with
known true counts and assert estimate ≥ true for every one
(never_underestimates). A second test (overestimate_within_bound) confirms a
heavy hitter among 100k items lands inside the ε·N envelope.

The signature property: merge

Count-Min sketches are linear — two sketches of identical dimensions union by
summing counters element-wise:

pub fn merge(&mut self, other: &CountMinSketch) -> Result<(), CmsError> { /* a[i] += b[i] */ }

The result is exactly the sketch you'd have built from both streams combined,
so per-shard or per-day counts merge with no loss. merge_equals_combined_stream
pins this — a nice mirror of HyperLogLog's (#275) merge-by-max.

Persistence

State is a compact binary blob — "CMS1" | w | d | total | counters — written
hand-rolled, no serde:

src/cms.rs  — sketch: sizing, double hashing, estimate, merge, (de)serialize (12 tests)
src/main.rs — clap CLI: new / add / query / merge / info, file or stdin

Takeaway

The probabilistic trilogy is complete: Bloom = membership (#274),
HyperLogLog = cardinality (#275), Count-Min = frequency (this one). All
share the same instinct — push the error to one side and accept it to buy a tiny,
fixed memory budget. In Count-Min that shows up as a one-sided "never
underestimate" error and a min-of-rows upper bound, with ε/δ as direct
dials on the memory-vs-accuracy trade.

📦 GitHub: https://github.com/sen-ltd/count-min-sketch