You reach for a HashSet when you need to ask "have I seen this before?" It works, it is exact, and it is fine until the set gets large enough that holding every element in memory becomes the bottleneck. A bloom filter is the answer to a narrower question: "is it safe to skip the expensive lookup?" It answers that in a fraction of the space, and the price is that it occasionally answers wrong in exactly one direction.
That one-directional error is the whole point, and it is also where people get burned. So let's be precise about what you are buying.
The trade, in actual numbers
A bloom filter is a bit array plus a handful of hash functions. To add an element, you hash it k times, map each hash to a position in an m-bit array, and set those bits to 1. To test membership, you hash the same way and check whether all k bits are already set. If any bit is 0, the element is definitely not in the set. If all bits are 1, the element is probably in the set.
That asymmetry is the contract:
- No false negatives. If the filter says "not present," it is telling the truth. Always.
- False positives are possible. If the filter says "present," it might be lying, because some other elements happened to set the same bits.
The false-positive rate is tunable, and the math is friendlier than you'd guess. For a target rate p, you need roughly -ln(p) / (ln 2)^2 bits per element. Run the numbers:
- A 1% false-positive rate costs about 9.6 bits per element — call it 1.2 bytes.
- A 0.1% rate costs about 14.4 bits per element — under 1.8 bytes.
Compare that to storing the elements themselves. Even if you only kept a 64-bit hash of each item in a HashSet, that's 8 bytes per element before container overhead, and real hash tables carry pointers, load-factor slack, and bucket metadata that push the true cost well past that. A bloom filter at 1% gives you membership testing for roughly one-seventh the bytes of bare hashes, and it does not grow with the size of the elements — a 2 KB URL and a 4-byte integer both cost the same ~9.6 bits.
The number of hash functions also has an optimum:
k = (m/n) · ln 2. For a 1% target that's about 7 hashes; for 0.1% it's about 10. Too few hashes and collisions spike; too many and you saturate the bit array faster. Most libraries computekfor you from the size and target rate, so you rarely set it by hand — but it's worth knowing the knob exists.
Where it actually pays off
A bloom filter earns its place when a "no" lets you skip something genuinely expensive — a disk seek, a network round-trip, a cold cache read. The filter sits in front of the slow path and absorbs the majority of negative lookups in memory.
The canonical example lives inside the databases you already use. LSM-tree storage engines — RocksDB, Cassandra, LevelDB — store data across many on-disk sorted files (SSTables). Without help, a lookup for a missing key would have to check several files on disk. Each SSTable carries a bloom filter, so the engine asks the filter first: "could this key be in this file?" A "no" skips the disk read entirely. With a 1% false-positive rate, you avoid roughly 99% of the pointless disk seeks for keys that aren't there, at a memory cost small enough to keep the filters resident.
The pattern generalizes anywhere the negative case dominates and is cheap to short-circuit:
- Caching layers checking "have we definitely never cached this?" before hitting the origin.
- Crawlers and queues deduping URLs or job IDs without keeping the full visited set in RAM.
- Write paths that want to skip a uniqueness check against a remote store when the key is obviously new.
The common thread: the filter only has to be right about "no." A false positive just means you fall through to the exact check you would have done anyway. You pay a little wasted work, not a wrong answer.
That last sentence is the trap. A bloom filter is safe only when a false positive degrades gracefully — it sends you to the authoritative source, which corrects the lie. If you treat a "probably present" as final truth — blocking a username, charging an account, suppressing an alert — you will eventually act on a collision and have no way to know it happened. Always keep an exact backstop behind the filter when correctness matters.
The lies, and the ones people forget
The false positive is the famous failure mode, but two others bite harder because they're quieter.
You can't delete from a standard bloom filter. Clearing the bits for one element would also clear bits shared with others, reintroducing false negatives and breaking the one guarantee you cared about. If your set shrinks over time, a plain bloom filter is the wrong tool. A counting bloom filter (each slot is a small counter instead of a single bit) supports deletion, at several times the memory. If churn is heavy, you often just rebuild the filter periodically instead.
The false-positive rate is a promise about a specific size. You size the filter for n elements. Push past n and the bit array saturates — more bits flip to 1, and the actual false-positive rate climbs well above your target. A filter built for a million entries and fed ten million isn't "a bit worse," it's approaching uselessly noisy. If your data volume is unknown or unbounded, look at a scalable bloom filter that chains progressively larger filters as it fills.
Before reaching for a bloom filter, sanity-check the negative-lookup ratio. If most of your queries are for elements that are present, the filter says "probably yes" almost every time, you fall through to the real lookup almost every time, and you've added a hashing step that saves nothing. Bloom filters win on workloads dominated by misses, not hits.
None of this requires deep math to use in practice — the libraries handle the bit twiddling. What it requires is matching the structure to a workload where "definitely no" is the valuable answer and a rare false "yes" is harmless.
The mental model to keep: a bloom filter doesn't store your data, it stores a compressed, lossy hint about your data. The compression is the gift. The lossiness is the bill. As long as your code reads the hint as "safe to skip" rather than "known fact," the bill stays small.
Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.
Top comments (0)