A Count-Min sketch as a Rust CLI. It answers "how many times have I seen X?" with fixed memory and an estimate that always rounds up. The hinges: (1) the table width
wand depthdaren't guesses — they're determined by your error factorεand failure probabilityδ; (2) you don't needdindependent hashes — two are enough, via double hashing (Kirsch–Mitzenmacher); (3) it returns the minimum of its rows, so it can never underestimate — that's a guarantee, not a hope. Part three of a probabilistic-data-structure series after #274's Bloom filter (membership) and #275's HyperLogLog (cardinality) — this one is frequency, completing the trilogy.
📦 GitHub: https://github.com/sen-ltd/count-min-sketch
What a Count-Min sketch is
Where a Bloom filter answers "is X in the set?", a Count-Min sketch answers
"how many times have I seen X?" — approximately, and always rounding up:
- the estimate is never below the true count. There are no underestimates.
- it may overestimate, by a bounded amount, at a probability you choose.
In return you get a fixed-size table of counters, O(d) updates, and no stored
keys at all. That one-sided error — an honest floor, a hedged ceiling — is
the entire design. It's what you want for "top URLs", "heavy-hitter IPs", or
"trending queries" over a stream too large to count exactly.
Here's the CLI, real output:
$ cms new -e 0.0005 -d 0.01
created .cms ε=0.0005 δ=0.01 w=5437 d=5 table=27185 counters (106 KiB)
$ cms add --from access.log
added 144000 occurrences total N=144000 w=5437 d=5 error bound ε·N≈72
$ cms query / /login /api/search /favicon.ico /never-requested
/ 40002 (exact 40000)
/login 18002 (exact 18000)
/api/search 12005 (exact 12000)
/favicon.ico 9001 (exact 9000)
/never-requested 3 (exact 0)
144k log lines summarized in a 106 KiB counter table, heavy hitters estimated to
within a handful (overshoot only, inside the guaranteed ε·N ≈ 72), no keys kept.
Hinge 1: sizing from a target error and confidence
You don't pick the width w and depth d by feel. Given an error factor ε
and a failure probability δ, they're determined:
pub fn optimal_params(epsilon: f64, delta: f64) -> (usize, usize) {
let w = (std::f64::consts::E / epsilon).ceil(); // columns
let d = (1.0 / delta).ln().ceil(); // rows
(w as usize, d as usize)
}
The guarantee they buy: an estimate overshoots the true count by at most ε·N
(N = total of all counts) with probability ≥ 1−δ. A test pins the exact
numbers so a refactor can't silently drift the math:
#[test]
fn optimal_params_match_formula() {
// ε=0.001, δ=0.01 → w = ceil(e/0.001) = 2719, d = ceil(ln 100) = 5
let (w, d) = CountMinSketch::optimal_params(0.001, 0.01);
assert_eq!(w, 2719);
assert_eq!(d, 5);
}
Hinge 2: d counters from two hashes (Kirsch–Mitzenmacher)
Each item needs one column per row — d independent hashes. Computing five
independent hashes per item would be wasteful, and unnecessary. Two base hashes
are enough:
fn columns(&self, item: &str) -> Vec<usize> {
let h1 = fnv1a_64(item.as_bytes());
let h2 = djb2_64(item.as_bytes()) | 1; // force odd so it never collapses
(0..self.d)
.map(|i| (h1.wrapping_add((i as u64).wrapping_mul(h2)) % self.w as u64) as usize)
.collect()
}
Two cheap hashes (FNV-1a and DJB2), d columns. h2 is forced odd so the step
never degenerates to stamping the same column in every row — the same trick as
274's Bloom filter.
Hinge 3: take the minimum — and why it can't underestimate
add bumps the item's counter in every row. Collisions from other items only
ever add to a counter, so the true count sits at or below every row's value.
estimate returns the minimum across rows — the tightest upper bound the
table can give:
pub fn estimate(&self, item: &str) -> u32 {
self.columns(item).into_iter().enumerate()
.map(|(row, col)| self.counters[row * self.w + col])
.min().unwrap_or(0)
}
It's a property, not a hope, and it's tested directly: insert 5000 items with
known true counts and assert estimate ≥ true for every one
(never_underestimates). A second test (overestimate_within_bound) confirms a
heavy hitter among 100k items lands inside the ε·N envelope.
The signature property: merge
Count-Min sketches are linear — two sketches of identical dimensions union by
summing counters element-wise:
pub fn merge(&mut self, other: &CountMinSketch) -> Result<(), CmsError> { /* a[i] += b[i] */ }
The result is exactly the sketch you'd have built from both streams combined,
so per-shard or per-day counts merge with no loss. merge_equals_combined_stream
pins this — a nice mirror of HyperLogLog's (#275) merge-by-max.
Persistence
State is a compact binary blob — "CMS1" | w | d | total | counters — written
hand-rolled, no serde:
src/cms.rs — sketch: sizing, double hashing, estimate, merge, (de)serialize (12 tests)
src/main.rs — clap CLI: new / add / query / merge / info, file or stdin
Takeaway
The probabilistic trilogy is complete: Bloom = membership (#274),
HyperLogLog = cardinality (#275), Count-Min = frequency (this one). All
share the same instinct — push the error to one side and accept it to buy a tiny,
fixed memory budget. In Count-Min that shows up as a one-sided "never
underestimate" error and a min-of-rows upper bound, with ε/δ as direct
dials on the memory-vs-accuracy trade.

Top comments (0)