A Bloom filter as a Rust CLI. It answers "is X in the set?" with one of two replies: "definitely absent" or "probably present." The hinges: (1) the bit count
mand hash countkaren't guesses — they're determined by your item countnand target false-positive ratep; (2) you don't needkindependent hashes — two are enough, via double hashing (Kirsch–Mitzenmacher); (3) there are no false negatives, and that's a guarantee, not a hope. First Rust entry since #272 dijkstra.
📦 GitHub: https://github.com/sen-ltd/bloom-filter-cli
What a Bloom filter is
It answers "is X in the set?" with exactly two possible replies:
- "definitely absent" — always correct. There are never false negatives.
- "probably present" — wrong some of the time (a false positive), at a rate you choose.
In return you get a tiny, fixed-size bit array, O(k) lookups, and no stored keys at all. That asymmetry — a confident no, a hedged yes — is the entire design. It's what you want when an occasional wrong "yes" is acceptable in exchange for a fraction of the memory: "have I seen this URL?", "is this in the spam set?", a cache admission filter.
Here's the CLI, real output:
$ bloom new --capacity 1000 --fp 0.01
created .bloom capacity=1000 fp-target=0.01 m=9586 bits (2 KiB) k=7
$ bloom add --from words.txt
added 18 new (18 seen) total=18 m=9586 bits k=7 fp≈0.000%
$ bloom check apple pineapple dragonfruit
apple probably present
pineapple probably present
dragonfruit definitely absent
Hinge 1: sizing from a target false-positive rate
You don't pick m (bits) and k (hashes) by feel. Given item count n and target false-positive rate p, both fall out of the math:
m = ceil( -(n · ln p) / (ln 2)^2 )
k = round( (m / n) · ln 2 )
pub fn optimal_params(capacity: u64, fp: f64) -> (usize, u32) {
let n = capacity.max(1) as f64;
let ln2 = std::f64::consts::LN_2;
let m = (-(n * fp.ln()) / (ln2 * ln2)).ceil(); // bits
let k = ((m / n) * ln2).round(); // hashes
(m as usize, k as u32)
}
For n = 1000, p = 0.01 that's m = 9586 bits (~1.2 KB) and k = 7. Pin the exact numbers so a refactor can't silently drift the formula:
#[test]
fn optimal_params_match_formula() {
let (m, k) = BloomFilter::optimal_params(1000, 0.01);
assert_eq!(m, 9586); // ceil(-(1000·ln 0.01)/(ln 2)^2)
assert_eq!(k, 7); // round((m/n)·ln 2)
}
Intuitively, a tighter p needs more bits — also tested (smaller_fp_needs_more_bits).
Hinge 2: k hashes from two (Kirsch–Mitzenmacher)
Each item needs k bit positions, i.e. k independent hashes. Computing seven independent hashes per item would be wasteful — and isn't necessary. Kirsch & Mitzenmacher showed two base hashes suffice:
g_i(x) = h1(x) + i · h2(x) (mod m), i = 0 .. k-1
behaves, for filter purposes, like k independent hashes:
fn indices(&self, item: &str) -> Vec<usize> {
let h1 = fnv1a_64(item.as_bytes());
let h2 = djb2_64(item.as_bytes()) | 1; // force odd so it never collapses
(0..self.k)
.map(|i| (h1.wrapping_add((i as u64).wrapping_mul(h2)) % self.m as u64) as usize)
.collect()
}
Two cheap hashes (FNV-1a and DJB2), k indices. h2 is forced odd so the step never degenerates to stamping the same bit over and over.
Hinge 3: no false negatives — guaranteed
Insertion only ever sets bits; it never clears them. So an inserted item's bits stay set and contains can't miss it. That's a property, not a hope, so it's a test:
#[test]
fn no_false_negatives() {
let mut bf = BloomFilter::with_capacity(1000, 0.01);
let words = ["apple", "banana", "δ-encoder", "🍎", ""];
for w in &words { bf.add(w); }
for w in &words { assert!(bf.contains(w)); } // always true
}
The false-positive rate, by contrast, is checked empirically: fill to design capacity, probe 20,000 absent keys, confirm the measured rate lands near the 1% target:
#[test]
fn measured_fp_rate_near_target() {
let cap = 2000u64;
let mut bf = BloomFilter::with_capacity(cap, 0.01);
for i in 0..cap { bf.add(&format!("present-{}", i)); }
let (mut fp, trials) = (0, 20_000);
for i in 0..trials {
if bf.contains(&format!("absent-query-{}", i)) { fp += 1; }
}
assert!(fp as f64 / trials as f64 < 0.03); // ~1%
}
The closed form is p ≈ (1 - e^(-k·n/m))^k; the info subcommand reports that estimate.
Counting distinct items honestly
The false-positive estimate depends on n — the number of distinct items inserted. Duplicate inserts would inflate it. So add returns whether the item was new (at least one bit was unset) and only then bumps n:
assert!(bf.add("x")); // new
assert!(!bf.add("x")); // duplicate → n unchanged
That's the added 1 new (2 seen) line: adding mango pineapple reports "1 new / 2 seen" because mango was already in words.txt.
Persistence
State is a hand-rolled compact binary blob — "BLM1" | m | k | n | bits — no serde:
pub fn serialize(&self) -> Vec<u8> {
let mut out = Vec::with_capacity(24 + self.bits.len());
out.extend_from_slice(b"BLM1");
out.extend_from_slice(&(self.m as u64).to_le_bytes());
out.extend_from_slice(&self.k.to_le_bytes());
out.extend_from_slice(&self.n.to_le_bytes());
out.extend_from_slice(&self.bits);
out
}
deserialize validates the magic and bit-array length, rejecting corrupt input.
Architecture
src/bloom.rs — filter: sizing, double hashing, fp estimate, (de)serialize (12 tests)
src/main.rs — clap CLI: new / add / check / info, file or stdin
A single clap dependency; bit ops, hashing and serialization are all hand-rolled.
Try it
cargo run -- new --capacity 1000 --fp 0.01
cargo run -- add --from examples/words.txt
cargo run -- check apple dragonfruit
cargo run -- info
Or Docker (static Alpine binary, non-root):
docker build -t bloom .
docker run --rm -v "$PWD":/data bloom add apple banana
docker run --rm -v "$PWD":/data bloom check apple
Takeaways
- A Bloom filter says "definitely absent" with certainty, "probably present" with a hedge. No false negatives; tunable false positives.
-
(m, k)aren't guessed — they come fromnand the target ratep. - You don't need
kindependent hashes — two suffice viah1 + i·h2(keeph2odd). - No false negatives is a guarantee; the false-positive rate is measured in a test.
- Count distinct items for
nand the fp estimate stays honest. - One dependency (clap); persistence is a hand-rolled binary format.
This is OSS portfolio entry #274 from SEN LLC. https://sen.ltd/portfolio/

Top comments (0)