Ever wondered how large-scale systems check username availability?
Do they iterate over every single existing username and respond with a YES or NO?
Well,that would take an eternity. So how do systems at scale handle this efficiently?
👉 Enter Bloom Filters.
What is a Bloom Filter?
The term Bloom Filter may sound fancy, but at its core, it's surprisingly simple.
A Bloom Filter is:
- A probabilistic data structure
- Backed by a bit array (0's and 1's)
- Designed to answer one question
"Have we seen this element before?"
What does a Bloom Filter guarantee?
A Bloom Filter can return only two types of answers:
- ❌ Definitely NO
- ⚠️ Maybe YES
It never produces false negatives, but false positives are possible.
False negative:
Bloom Filter ==> not present
Database ==> present (❌ never happens)
False positive:
Bloom Filter ==> present
Database ==> not present (⚠️ possible)
How does a Bloom Filter work?
Let's walk through the flow step by step.
Step 1 : User enters a word
cat
Step 2 : Initialize the Bloom Filter
To keep things simple, let's assume
- Bit array size = 10
- Number of hash functions = 2 (hash1, hash2)
Initial bit array
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
Step 3 : Hash the word
Next we pass the word through the hash functions
*Hash values are assumed for illustration
hash1("cat") ==> 4
hash2("cat") ==> 5
Hash output is reduced using modulo to fit the bit array.
Step 4 : Set the bits
We set the corresponding bits positions:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 1 0 0 0 0
This indicates that we have seen this word.
The actual word is then stored in the database.
The Bloom Filter does not store the word itself.
How does a Bloom Filter check if a word exists?
When a user wants to check if a word exists, we repeat the same hashing process
Case 1 : Word does not exist
Let's check the word "dog":
hash1("dog") ==> 4
hash2("dog") ==> 7
Now we check these bits/indexes in our bit array
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 1 0 0 0 0
↑ ↑
Since at least one bit is 0, we can say with certainty:
❌ This word does not exist. (Strict NO)
Case 2 : Word may exist
Now let's check "cat" again:
hash1("cat") ==> 4
hash2("cat") ==> 5
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 1 0 0 0 0
↑ ↑
Here the corresponding bits/indexes are set (All bits are set)
⚠️ This word may exist
Why "May be YES"? (False Positives)
Bloom Filters suffer from hash collisions.
Multiple words can hash to the same bit positions.
Example:
word : "bat"
hash1("bat") ==> 4
hash2("bat") ==> 5
These bits were already set by "cat".
Even though the word "bat" was never inserted, the Bloom Filter reports:
⚠️ May be present
This is the case of false positive.
When do false positives increase?
False positives are affected by
- Small bit array size
- Large number of inserted elements
- Too many hash functions
Summary
In short:
- If any bit is not set → ❌ Definitely NO
- If all bits are set → ⚠️ Maybe YES
TL;DR
- Bloom Filters are an optimization technique
- They guarantee no false negatives
- They can produce false positives
In practice, Bloom filters are used as a fast pre-check before hitting a database.
Bloom Filters don't tell you "YES or NO".
They tell you "NO or MAYBE".
Top comments (0)