SEN LLC

Posted on Jun 26

Building a Bloom Filter CLI in Rust — Sizing from a Target FP Rate, k Hashes from Two, and Guaranteed No False Negatives

#rust #algorithms #datastructures #cli

A Bloom filter as a Rust CLI. It answers "is X in the set?" with one of two replies: "definitely absent" or "probably present." The hinges: (1) the bit count m and hash count k aren't guesses — they're determined by your item count n and target false-positive rate p; (2) you don't need k independent hashes — two are enough, via double hashing (Kirsch–Mitzenmacher); (3) there are no false negatives, and that's a guarantee, not a hope. First Rust entry since #272 dijkstra.

📦 GitHub: https://github.com/sen-ltd/bloom-filter-cli

What a Bloom filter is

It answers "is X in the set?" with exactly two possible replies:

"definitely absent" — always correct. There are never false negatives.
"probably present" — wrong some of the time (a false positive), at a rate you choose.

In return you get a tiny, fixed-size bit array, O(k) lookups, and no stored keys at all. That asymmetry — a confident no, a hedged yes — is the entire design. It's what you want when an occasional wrong "yes" is acceptable in exchange for a fraction of the memory: "have I seen this URL?", "is this in the spam set?", a cache admission filter.

Here's the CLI, real output:

$ bloom new --capacity 1000 --fp 0.01
created .bloom  capacity=1000  fp-target=0.01  m=9586 bits (2 KiB)  k=7

$ bloom add --from words.txt
added 18 new (18 seen)  total=18  m=9586 bits  k=7  fp≈0.000%

$ bloom check apple pineapple dragonfruit
apple        probably present
pineapple    probably present
dragonfruit  definitely absent

Hinge 1: sizing from a target false-positive rate

You don't pick m (bits) and k (hashes) by feel. Given item count n and target false-positive rate p, both fall out of the math:

m = ceil( -(n · ln p) / (ln 2)^2 )
k = round( (m / n) · ln 2 )

pub fn optimal_params(capacity: u64, fp: f64) -> (usize, u32) {
    let n = capacity.max(1) as f64;
    let ln2 = std::f64::consts::LN_2;
    let m = (-(n * fp.ln()) / (ln2 * ln2)).ceil();   // bits
    let k = ((m / n) * ln2).round();                 // hashes
    (m as usize, k as u32)
}

For n = 1000, p = 0.01 that's m = 9586 bits (~1.2 KB) and k = 7. Pin the exact numbers so a refactor can't silently drift the formula:

#[test]
fn optimal_params_match_formula() {
    let (m, k) = BloomFilter::optimal_params(1000, 0.01);
    assert_eq!(m, 9586); // ceil(-(1000·ln 0.01)/(ln 2)^2)
    assert_eq!(k, 7);    // round((m/n)·ln 2)
}

Intuitively, a tighter p needs more bits — also tested (smaller_fp_needs_more_bits).

Hinge 2: k hashes from two (Kirsch–Mitzenmacher)

Each item needs k bit positions, i.e. k independent hashes. Computing seven independent hashes per item would be wasteful — and isn't necessary. Kirsch & Mitzenmacher showed two base hashes suffice:

g_i(x) = h1(x) + i · h2(x)   (mod m),   i = 0 .. k-1

behaves, for filter purposes, like k independent hashes:

fn indices(&self, item: &str) -> Vec<usize> {
    let h1 = fnv1a_64(item.as_bytes());
    let h2 = djb2_64(item.as_bytes()) | 1; // force odd so it never collapses
    (0..self.k)
        .map(|i| (h1.wrapping_add((i as u64).wrapping_mul(h2)) % self.m as u64) as usize)
        .collect()
}

Two cheap hashes (FNV-1a and DJB2), k indices. h2 is forced odd so the step never degenerates to stamping the same bit over and over.

Hinge 3: no false negatives — guaranteed

Insertion only ever sets bits; it never clears them. So an inserted item's bits stay set and contains can't miss it. That's a property, not a hope, so it's a test:

#[test]
fn no_false_negatives() {
    let mut bf = BloomFilter::with_capacity(1000, 0.01);
    let words = ["apple", "banana", "δ-encoder", "🍎", ""];
    for w in &words { bf.add(w); }
    for w in &words { assert!(bf.contains(w)); } // always true
}

The false-positive rate, by contrast, is checked empirically: fill to design capacity, probe 20,000 absent keys, confirm the measured rate lands near the 1% target:

#[test]
fn measured_fp_rate_near_target() {
    let cap = 2000u64;
    let mut bf = BloomFilter::with_capacity(cap, 0.01);
    for i in 0..cap { bf.add(&format!("present-{}", i)); }
    let (mut fp, trials) = (0, 20_000);
    for i in 0..trials {
        if bf.contains(&format!("absent-query-{}", i)) { fp += 1; }
    }
    assert!(fp as f64 / trials as f64 < 0.03); // ~1%
}

The closed form is p ≈ (1 - e^(-k·n/m))^k; the info subcommand reports that estimate.

Counting distinct items honestly

The false-positive estimate depends on n — the number of distinct items inserted. Duplicate inserts would inflate it. So add returns whether the item was new (at least one bit was unset) and only then bumps n:

assert!(bf.add("x"));   // new
assert!(!bf.add("x"));  // duplicate → n unchanged

That's the added 1 new (2 seen) line: adding mango pineapple reports "1 new / 2 seen" because mango was already in words.txt.

Persistence

State is a hand-rolled compact binary blob — "BLM1" | m | k | n | bits — no serde:

pub fn serialize(&self) -> Vec<u8> {
    let mut out = Vec::with_capacity(24 + self.bits.len());
    out.extend_from_slice(b"BLM1");
    out.extend_from_slice(&(self.m as u64).to_le_bytes());
    out.extend_from_slice(&self.k.to_le_bytes());
    out.extend_from_slice(&self.n.to_le_bytes());
    out.extend_from_slice(&self.bits);
    out
}

deserialize validates the magic and bit-array length, rejecting corrupt input.

Architecture

src/bloom.rs — filter: sizing, double hashing, fp estimate, (de)serialize (12 tests)
src/main.rs  — clap CLI: new / add / check / info, file or stdin

A single clap dependency; bit ops, hashing and serialization are all hand-rolled.

Try it

cargo run -- new --capacity 1000 --fp 0.01
cargo run -- add --from examples/words.txt
cargo run -- check apple dragonfruit
cargo run -- info

Or Docker (static Alpine binary, non-root):

docker build -t bloom .
docker run --rm -v "$PWD":/data bloom add apple banana
docker run --rm -v "$PWD":/data bloom check apple

Takeaways

A Bloom filter says "definitely absent" with certainty, "probably present" with a hedge. No false negatives; tunable false positives.
(m, k) aren't guessed — they come from n and the target rate p.
You don't need k independent hashes — two suffice via h1 + i·h2 (keep h2 odd).
No false negatives is a guarantee; the false-positive rate is measured in a test.
Count distinct items for n and the fp estimate stays honest.
One dependency (clap); persistence is a hand-rolled binary format.

This is OSS portfolio entry #274 from SEN LLC. https://sen.ltd/portfolio/

DEV Community