DEV Community

Cover image for Bloom Filters: The Data Structure That's Wrong on Purpose (and Why That's Genius)
Hassan Zahar Rifat
Hassan Zahar Rifat

Posted on

Bloom Filters: The Data Structure That's Wrong on Purpose (and Why That's Genius)

How databases, browsers and CDNs answer "have I seen this before?" using a fraction of the memory — by being wrong in exactly one safe direction.

Picture a sign-up form. You type a username, and before you even finish, the field flashes red: "Already taken." Instant. No spinner.

Somewhere behind that, a system just answered the question "have we seen this string before?" against tens of millions of existing usernames — and it did it without querying the database, without loading a list into memory, and in roughly the time it takes a single CPU instruction to run.

It pulled that off with a data structure that has a strange, almost offensive property: it's allowed to be wrong.

Not randomly wrong. Wrong in exactly one direction, in a way you can mathematically bound, in exchange for using a fraction of the memory a normal set would. That trade is the whole trick, and once you see it you'll start noticing Bloom filters everywhere — in your database, your CDN, your browser, your favorite key-value store.

Let's build one from scratch.


First, why the obvious solutions fall apart

The question is dead simple: is element x in set S?

You already know how to answer this. A hash set. O(1) lookups, exact answers, done. So why do we need anything else?

Memory.

Say you're tracking 1 billion URLs a web crawler has already visited, so you don't crawl them twice. An average URL is ~70 bytes. Even if you only stored the raw strings:

1,000,000,000 URLs × 70 bytes ≈ 70 GB
Enter fullscreen mode Exit fullscreen mode

And that's before the overhead of the hash set itself — pointers, load-factor slack, bucket arrays. Realistically you're looking at well north of 100 GB of RAM to answer a yes/no question. You can shard it across machines, sure, but now every "have I seen this?" check is a network hop.

Here's the reframe that unlocks everything:

You don't actually need to store the URLs. You only need to recognize them.

A bouncer doesn't keep a photocopy of every ID he's ever checked. He keeps a much smaller mental fingerprint. A Bloom filter is that bouncer.


The core idea: stop storing, start fingerprinting

A Bloom filter is two things:

  1. A bit array of m bits, all starting at 0.
  2. A set of k independent hash functions, each of which maps any input to a position in that bit array. That's it. No keys, no values, no stored elements. Just a row of bits and some hash functions.

Two operations:

  • add(x) — hash x with all k functions, get k positions, set those bits to 1.
  • contains(x) — hash x with the same k functions, check those k positions. If all of them are 1, return "probably yes." If any is 0, return "definitely no." Read that last line again, because it's the entire personality of the data structure:

A 0 is proof of absence. A 1 is only a hint of presence.


add(x): one element is run through k hash functions, each setting one bit to 1

A worked example you can follow by hand

Let's use a tiny filter: m = 16 bits, k = 3 hash functions. Empty, it looks like this:

index:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
bits:   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
Enter fullscreen mode Exit fullscreen mode

Add "cat". Our three hashes spit out positions 3, 9, 13. Flip them on:

index:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
bits:   0  0  0 [1] 0  0  0  0  0 [1] 0  0  0 [1] 0  0
Enter fullscreen mode Exit fullscreen mode

Add "dog". Hashes give 1, 9, 14. Note 9 is already 1 — that's fine, it just stays 1:

index:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
bits:   0 [1] 0 [1] 0  0  0  0  0 [1] 0  0  0 [1][1] 0
Enter fullscreen mode Exit fullscreen mode

Now let's query.

contains("cat") → check positions 3, 9, 13 → all 1"probably yes." Correct, we added it.

contains("fish") → hashes to 2, 9, 13 → position 2 is 0"definitely no." Correct — and notice we got a guaranteed-correct answer by checking a single zero bit. That's the cheap, certain half of the filter.

contains("bird") → hashes to 1, 13, 14 → position 1 is 1 (set by dog), 13 is 1 (set by cat), 14 is 1 (set by dog) → all three are 1"probably yes."

But we never added "bird".

That's a false positive. No single element claimed those three bits — they got lit up by different elements, and "bird" happened to land on exactly that combination. The filter has no way to tell the difference between "one element set all these" and "three different elements each set one."

This is the price. And here's the beautiful part: it only goes one way.


Step-by-step trace: adding cat then dog, then querying bird produces a false positive from shared bits

Why you get false positives but never false negatives

A false negative would mean: you added x, but contains(x) says no.

For that to happen, at least one of x's k bits would have to be 0. But add(x) set all of them to 1, and bits in a Bloom filter never go back to zero (we'll revisit that caveat later). So once you've added something, every one of its bits is permanently lit. A query for it can only ever find all 1s.

Added elements always return "yes." Bits never un-set. Therefore: zero false negatives, ever.

This asymmetry is what makes Bloom filters useful rather than just clever. "Definitely no" is a hard guarantee. "Probably yes" is a hint you can verify with a slower, exact check only when needed. You build systems around that shape: use the filter as a fast gate, and only pay the expensive cost when it says "maybe."


The implementation

Here's a clean Python version. The only mildly clever bit is how we get k hash functions without writing k of them — more on that right after.

import math
import hashlib

class BloomFilter:
    def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
        # Size the filter from how many items you expect and the FP rate you'll tolerate.
        self.m = self._optimal_size(expected_items, false_positive_rate)
        self.k = self._optimal_hash_count(self.m, expected_items)
        self.bits = bytearray((self.m + 7) // 8)  # m bits, packed into bytes

    def _set_bit(self, index: int) -> None:
        self.bits[index // 8] |= (1 << (index % 8))

    def _get_bit(self, index: int) -> bool:
        return bool(self.bits[index // 8] & (1 << (index % 8)))

    def _hashes(self, item: str):
        """
        Kirsch-Mitzenmacher trick: derive k hashes from just TWO base hashes.
        g_i(x) = (h1(x) + i * h2(x)) mod m.  Statistically as good as k independent hashes.
        """
        data = item.encode("utf-8")
        h1 = int.from_bytes(hashlib.sha256(data).digest()[:8], "big")
        h2 = int.from_bytes(hashlib.md5(data).digest()[:8], "big")
        for i in range(self.k):
            yield (h1 + i * h2) % self.m

    def add(self, item: str) -> None:
        for index in self._hashes(item):
            self._set_bit(index)

    def contains(self, item: str) -> bool:
        return all(self._get_bit(index) for index in self._hashes(item))

    @staticmethod
    def _optimal_size(n: int, p: float) -> int:
        return max(1, int(-(n * math.log(p)) / (math.log(2) ** 2)))

    @staticmethod
    def _optimal_hash_count(m: int, n: int) -> int:
        return max(1, int((m / n) * math.log(2)))
Enter fullscreen mode Exit fullscreen mode

Usage:

bf = BloomFilter(expected_items=1_000_000, false_positive_rate=0.01)

bf.add("[email protected]")

bf.contains("[email protected]")   # True  (definitely added)
bf.contains("[email protected]")    # almost certainly False; ~1% chance of a false "True"
Enter fullscreen mode Exit fullscreen mode

That double-hashing trick is worth stealing

Naively, k = 7 hash functions means you need 7 genuinely different, well-distributed hash functions. Annoying. The Kirsch–Mitzenmacher result says you don't: take two good base hashes h1 and h2, and generate the rest as h1 + i·h2. The false-positive behavior is indistinguishable from using k independent hashes, and you compute two hashes instead of seven. Production libraries (like the ones in Cassandra and Guava) do exactly this.


The math you actually need (no PhD required)

Four variables run the whole show:

Symbol Meaning
n number of items you'll insert
m number of bits in the array
k number of hash functions
p false-positive probability you're willing to accept

You pick n (how much you'll store) and p (how often you'll tolerate a wrong "maybe"). The other two fall out:

Optimal bit-array size:

m = -(n × ln p) / (ln 2)²
Enter fullscreen mode Exit fullscreen mode

Optimal number of hash functions:

k = (m / n) × ln 2
Enter fullscreen mode Exit fullscreen mode

Plug in real numbers. For 1 million items at a 1% false-positive rate:

  • m ≈ 9.6 million bits ≈ 1.2 MB
  • k ≈ 7 hash functions Sit with that. A hash set of those keys would be tens of megabytes minimum. The Bloom filter answers the same membership question in 1.2 MB. Want a 0.1% false-positive rate instead? It climbs to ~1.8 MB. Still tiny.

False-positive rate drops exponentially as bits-per-element grows; ~9.6 bits gives 1%, ~14.4 bits gives 0.1%

The intuition behind the knobs:

  • Too few hash functions → not enough bits set per element → easy collisions don't get distinguished. High FP rate.
  • Too many hash functions → the array fills with 1s too fast → everything starts looking present. Also high FP rate.
  • There's a sweet spot (that k = (m/n) ln 2 formula), and it lands right around where roughly half the bits are set when the filter is full. That's not a coincidence — a half-full array is the point of maximum information per bit. And the punchline that makes this practical: the memory cost per item depends only on p, not on how big each item is. A 70-byte URL and a 10 KB JSON blob cost the same number of bits in the filter, because you only ever store the hashes. That's why it scales where a hash set can't.

Where this actually runs in production

This isn't a whiteboard toy. It's load-bearing infrastructure in systems you use daily.

The front-gate pattern: a

Databases (Cassandra, HBase, RocksDB, LevelDB, Bigtable). These use LSM-tree storage, where data lives in many on-disk files (SSTables). To find a key, you might have to check several files — and disk reads are brutally slow compared to memory. So each SSTable gets a Bloom filter held in RAM. Before touching disk, the DB asks the filter "could this key be in this file?" A "definitely no" skips the disk read entirely. Given how lopsided the cost is (RAM lookup vs. disk seek), even a filter that's only sometimes able to say no is a massive win.

Browsers (Chrome's Safe Browsing). Chrome warns you before you visit a known-malicious site — without shipping you a list of every dangerous URL on the internet (which would be enormous and instantly stale). A compact local filter answers "is this URL probably bad?" If no → safe, proceed instantly. If maybe → then it does a real network check. The filter turns "phone home on every single navigation" into "phone home only on the rare maybe."

CDNs and caches. "Is this object worth caching?" A common trick: only cache something on its second request. A Bloom filter cheaply remembers "have I seen this URL once before?" without storing the URLs, filtering out the long tail of one-hit-wonder objects that would just pollute the cache.

Redis ships Bloom filters as a module (BF.ADD, BF.EXISTS) for exactly this class of problem — dedup, "have I shown this user this item," and so on.

Medium reportedly used one to track which articles you've already read, so its recommendations don't keep serving you the same posts. A false positive just means it occasionally skips one you hadn't seen — a totally acceptable error for a recommendation feed.

Notice the pattern in every case: the filter is a cheap front gate, and the expensive operation (disk, network, recompute) only happens on a "maybe." A false positive costs you a wasted check. It never costs you correctness.


The catch nobody mentions first: you can't delete

Want to remove an element? You'd flip its k bits back to 0. Except — those bits might be shared with other elements you didn't remove. Zero out "cat"'s bits and you might have just told the filter that "dog" is gone too, because they overlapped at position 9.

Clear those shared bits and you've created the one thing a Bloom filter promised would never happen: a false negative. The whole guarantee collapses.

So a plain Bloom filter is add-only. If you need deletion, you reach for a variant:

  • Counting Bloom Filter — replace each bit with a small counter (say 4 bits). add increments, remove decrements, contains checks for non-zero. Deletion works, at ~4× the memory.
  • Scalable Bloom Filter — don't know n in advance? This one chains progressively larger filters as it fills, holding your target FP rate without you having to size it up front.
  • Cuckoo Filter — supports deletion and often uses less space than a counting filter at low FP rates, by storing tiny fingerprints in a cuckoo-hash table. The modern go-to when you need removals. You don't need these on day one. But knowing they exist means you won't paint yourself into a corner when a requirement changes.

When to reach for one (and when not to)

Reach for a Bloom filter when:

  • The expensive thing is a disk read, network call, or recomputation — and you want to skip it on definite misses.
  • "Probably yes, let me double-check" is an acceptable answer.
  • You have way more data than RAM, and exact membership is a luxury you can't afford.
    Skip it when:

  • You need the actual stored values back, not just yes/no. (It stores nothing. There's nothing to retrieve.)

  • A false positive is dangerous rather than merely wasteful — e.g. "is this transaction already processed?" where a wrong "yes" means silently dropping a real payment.

- Your dataset is small enough that a plain hash set fits comfortably. Don't add probabilistic machinery to save a few megabytes you weren't short on.

The one-line takeaway

A Bloom filter trades a sliver of accuracy — in a single, controllable, never-dangerous direction — for an enormous cut in memory. It can't tell you what's in the set, and it'll occasionally claim something's there when it isn't. But when it says "definitely not," it is never, ever wrong.

In a world where the slow part is almost always disk or network, a structure that lets you confidently skip work is worth far more than one that's perfectly precise. Being wrong on purpose, in exactly the right way, turns out to be one of the most useful tricks in systems engineering.


If you build one, try logging your actual false-positive rate against the p you configured — watching the math hold up in practice is oddly satisfying. And if you want to really internalize it: implement contains first and watch it return True for things you never added. That moment of "wait, that's a feature?" is when it clicks.

Top comments (0)