Covers: Probabilistic Data Structures, False Positives vs False Negatives, Counting Bloom Filters, Tuning, Real Implementations
A Question With a Surprising Answer
How does Chrome check, on every single page load, whether the URL you're visiting is on a list of millions of known malicious websites — without sending your browsing history to Google for every page you visit, and without downloading a multi-gigabyte database to your phone?
The answer is a data structure so simple it can be explained in one sentence, yet so powerful it underpins systems at Google, Cassandra, Akamai, and nearly every large-scale database: the Bloom Filter.
The Core Idea: A Probabilistic "Maybe"
A normal data structure (a hash set, a database) answers "is X in this collection?" with a definitive yes or no.
A Bloom Filter answers with:
- "Definitely NOT in the set" (100% certain)
- "Possibly in the set" (might be a false positive)
Bloom Filter says "NOT in set" → guaranteed correct, 100% of the time
Bloom Filter says "MAYBE in set" → could be wrong (false positive)
Bloom Filter NEVER says:
"Definitely IS in set" with the implication of certainty
"NOT in set" when it actually IS in set (no false negatives, ever)
This asymmetry — no false negatives, but possible false positives — is the entire trick, and it's precisely calibrated to be useful.
How It Works: Bit Arrays and Hash Functions
A Bloom Filter is a bit array of size m (initially all zeros) plus k independent hash functions.
Adding an Item
Bit array (m=16): [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Add "malware.com":
hash1("malware.com") % 16 = 2 → set bit 2 to 1
hash2("malware.com") % 16 = 7 → set bit 7 to 1
hash3("malware.com") % 16 = 11 → set bit 11 to 1
Bit array: [0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0]
↑ ↑ ↑
bit 2 bit 7 bit 11
Checking an Item
Check "malware.com" (already added):
hash1("malware.com") % 16 = 2 → bit 2 is 1 ✓
hash2("malware.com") % 16 = 7 → bit 7 is 1 ✓
hash3("malware.com") % 16 = 11 → bit 11 is 1 ✓
ALL bits set → "MAYBE in set" (correctly — it IS in the set)
Check "safe-site.com" (never added):
hash1("safe-site.com") % 16 = 3 → bit 3 is 0 ✗
ANY bit is 0 → "DEFINITELY NOT in set" (correct — it's not!)
Check "another-site.com" (never added, but...):
hash1("another-site.com") % 16 = 2 → bit 2 is 1 (set by malware.com)
hash2("another-site.com") % 16 = 11 → bit 11 is 1 (set by malware.com)
hash3("another-site.com") % 16 = 7 → bit 7 is 1 (set by malware.com)
ALL bits happen to be 1 → "MAYBE in set"
→ FALSE POSITIVE! "another-site.com" was never added, but its hash
positions happen to overlap with bits set by "malware.com"
Why no false negatives are possible: If an item was added, ALL its bits were set to 1 by definition. Checking those same bits later will always find them set to 1 (bits are never unset in a basic Bloom Filter). So a real member can never be reported as "not in set."
Why false positives are possible: Multiple items' hash functions can map to overlapping bit positions. An item that was never added might, by coincidence, have all its bit positions already set to 1 by other items.
Why Tiny Memory Footprint Is the Whole Point
Here's the number that makes Bloom Filters remarkable:
To store 1 million URLs with a 1% false positive rate:
A hash set storing actual URL strings: ~50-100 MB (depends on URL length)
A Bloom Filter: ~1.2 MB
To store 1 BILLION URLs with a 1% false positive rate:
Hash set: ~50-100 GB
Bloom Filter: ~1.2 GB
The Bloom Filter doesn't store the actual data — just bits representing "something hashed here." This is why Chrome can ship a Bloom Filter representing Google's entire Safe Browsing malicious URL list as a small download, updated periodically, checked entirely locally — no network call needed for the common case (the URL is safe).
Chrome's flow:
1. User visits a URL
2. Check LOCAL Bloom Filter: "is this URL possibly malicious?"
3a. Bloom Filter says "definitely not" → proceed immediately,
NO network call (the vast majority of URLs)
3b. Bloom Filter says "maybe" → NOW make a network call to Google's
full database to confirm (rare — only for the small % of false
positives plus actual malicious URLs)
The Bloom Filter acts as a fast local pre-filter — eliminating ~99%+ of cases from ever needing a network round trip, while never missing an actual malicious URL.
Tuning the False Positive Rate
Two parameters control accuracy: the size of the bit array (m) and the number of hash functions (k).
The formula (intuition, not memorization-required for interviews):
False positive rate ≈ (1 - e^(-kn/m))^k
Where:
n = number of items inserted
m = size of bit array
k = number of hash functions
The practical trade-off:
More bits per item (larger m relative to n):
→ Lower false positive rate
→ More memory used
Optimal k (number of hash functions):
k ≈ (m/n) × ln(2)
→ Too few hash functions: insufficient bit coverage, higher false positives
→ Too many hash functions: bit array fills up too fast, higher false positives
→ There's a sweet spot
Practical numbers (commonly cited):
~10 bits per item, 7 hash functions → ~1% false positive rate
~15 bits per item, 10 hash functions → ~0.1% false positive rate
The interview-ready statement: "Bloom filters trade memory for accuracy — you choose your acceptable false positive rate upfront, and that determines how many bits per item you need. Doubling the bits roughly squares the accuracy improvement (10x fewer false positives for roughly 50% more bits, in the practical range)."
Counting Bloom Filters: Adding Deletion
A basic Bloom Filter has a problem: you can't delete an item. Bits are shared between items (that's the whole mechanism) — unsetting a bit to "remove" one item might break the membership check for other items whose hashes also touched that bit.
Counting Bloom Filters solve this by using small counters instead of single bits:
Basic Bloom Filter: [0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0]
Counting Bloom Filter: [0,0,2,0,0,0,0,1,0,0,0,3,0,0,0,0]
↑ ↑
2 items hash here 3 items hash here
Adding an item: increment relevant counters
Removing an item: decrement relevant counters
(if a counter reaches 0, that "bit" is effectively unset)
Checking membership: same as before — are all relevant
counters > 0?
The cost: Counting Bloom Filters use more memory (typically 4 bits per counter instead of 1 bit) — but still dramatically less than storing actual items, while supporting deletion.
Real Implementation: Cassandra's SSTables
This is one of the most elegant uses of Bloom Filters in production databases, and directly relates to the LSM Tree structure we'll cover in Topic 25.
Cassandra stores data in immutable files called SSTables.
A single Cassandra node might have HUNDREDS of SSTable files on disk.
Without Bloom Filters:
Looking up "user_12345" requires checking EVERY SSTable file
on disk — even if "user_12345" only exists in ONE of them.
Each check = a disk read = slow (remember Day 1's latency numbers:
SSD read ≈ 150μs, much slower than memory).
With Bloom Filters:
Each SSTable has an associated Bloom Filter (kept in memory).
Before reading an SSTable from disk, Cassandra checks its
Bloom Filter: "might 'user_12345' be in this SSTable?"
- "Definitely not" (most SSTables) → skip this file entirely,
NO disk read
- "Maybe" (the 1-2 SSTables that actually contain the key, plus
occasional false positives) → read this file from disk
The impact: A query that might have required checking 100 SSTable files on disk now typically checks 1-2 — because 98 of them are eliminated by an in-memory Bloom Filter check that costs microseconds. This is one of the primary reasons Cassandra achieves its famous write and read performance despite using an architecture (LSM trees) that would otherwise require many disk reads per query.
Real Implementation: Google Bigtable / CDN "One-Hit Wonder" Filtering
Akamai's CDN faces a specific problem: a huge fraction of content requested from origin servers is requested exactly once — a user clicks a link nobody else will click, requesting a resource that will never be requested again ("one-hit wonders").
Caching every requested item — including one-hit wonders — wastes cache space on content that will never produce a cache hit.
Akamai's approach using a Bloom Filter:
1. First request for URL X → check Bloom Filter
2. "Definitely not seen before" → DON'T cache it (probably a one-hit
wonder), but ADD it to the Bloom Filter
3. Second request for the SAME URL X → check Bloom Filter
4. "Maybe seen before" (it's now in the filter from step 2)
→ THIS time, cache it — it's been requested twice,
likely to be requested again
This is a beautifully simple use of the false-positive-tolerant nature of Bloom Filters: occasionally caching a one-hit wonder due to a false positive is a minor inefficiency, but the overall cache hit rate improves significantly by not wasting cache slots on content unlikely to be re-requested.
Google Bigtable uses Bloom Filters similarly to Cassandra — to avoid unnecessary disk reads when checking if a row exists in a particular SSTable.
Bloom Filter vs Hash Set: When to Use Which
| Hash Set | Bloom Filter | |
|---|---|---|
| Memory | O(n) — stores actual items | O(n) but tiny constant — just bits |
| False positives | Never | Possible (tunable rate) |
| False negatives | Never | Never |
| Deletion | Yes | No (unless Counting Bloom Filter) |
| Retrieve actual items | Yes | No — only membership testing |
Use a Bloom Filter when:
- You need to test "might X be in this huge set?"
- Memory is constrained relative to the size of the set
- Occasional false positives are acceptable (because they trigger a more expensive but authoritative check — disk read, network call, etc.)
- You DON'T need to retrieve or enumerate the actual items — only test membership
Use a Hash Set when:
- You need exact membership testing (no false positives tolerable)
- You need to retrieve or iterate the actual stored items
- Memory isn't a constraint relative to your dataset size
Interview Scenario: "Reduce Database Load with Bloom Filters"
Q: Your application frequently checks "does this username exist?" — and most checks are for usernames that DON'T exist (new signups checking availability). How would you optimize this?
"Most of these checks are for usernames that don't exist — every database query for a non-existent username is essentially wasted I/O. I'd maintain a Bloom Filter containing all existing usernames, kept in memory and updated whenever a new user signs up.
The check flow becomes: first check the Bloom Filter. If it says 'definitely not in set,' the username is available — no database query needed at all, this is instant. If it says 'maybe in set,' THEN query the database to confirm — this covers both real existing usernames and the rare false positives.
Given that the vast majority of availability checks during signup are for usernames that genuinely don't exist, this eliminates the majority of database reads for this endpoint — turning a database query into an in-memory bit-check for most requests.
One consideration: the Bloom Filter needs to be kept in sync as users sign up — I'd update it synchronously on user creation (cheap — just set a few bits) so it never has false negatives for newly created users."
Key Takeaways
- A Bloom Filter answers "definitely not in set" (always correct) or "maybe in set" (possible false positive) — never a false negative.
- It's a bit array + multiple hash functions. Adding sets bits; checking verifies all relevant bits are set.
- Memory usage is a tiny constant per item — orders of magnitude smaller than storing actual data, regardless of how large the items themselves are.
- Tuning: more bits per item and the right number of hash functions reduces the false positive rate — it's a tunable trade-off, not fixed.
- Counting Bloom Filters use small counters instead of bits, enabling deletion at the cost of more memory.
- Cassandra/Bigtable: Bloom Filters per SSTable eliminate most unnecessary disk reads — checking 1-2 files instead of 100.
- Chrome Safe Browsing: a local Bloom Filter eliminates network calls for the vast majority of (safe) URLs.
- Akamai CDN: Bloom Filters identify "one-hit wonders" to avoid wasting cache space on content unlikely to be requested again.
What's Next
Topic 24 covers Geospatial Indexing — how Uber finds nearby drivers among millions of moving vehicles in milliseconds, comparing Quadtrees, Google's S2, and Uber's own H3 hexagonal grid system.
Topic 23 of the System Design Mastery series.
Tags: system-design bloom-filters data-structures cassandra backend algorithms interview-prep
Top comments (0)