Prado7

Posted on Jan 21

Bloom Filters : Strictly NO, May be YES

#programming #bloomfilters #systemdesign

Ever wondered how large-scale systems check username availability?
Do they iterate over every single existing username and respond with a YES or NO?
Well,that would take an eternity. So how do systems at scale handle this efficiently?

👉 Enter Bloom Filters.

What is a Bloom Filter?

The term Bloom Filter may sound fancy, but at its core, it's surprisingly simple.
A Bloom Filter is:

A probabilistic data structure
Backed by a bit array (0's and 1's)
Designed to answer one question

"Have we seen this element before?"

What does a Bloom Filter guarantee?

A Bloom Filter can return only two types of answers:

❌ Definitely NO
⚠️ Maybe YES

It never produces false negatives, but false positives are possible.

False negative:
Bloom Filter ==> not present
Database ==> present (❌ never happens)

False positive:
Bloom Filter ==> present
Database ==> not present (⚠️ possible)

How does a Bloom Filter work?

Let's walk through the flow step by step.
Step 1 : User enters a word

cat

Step 2 : Initialize the Bloom Filter
To keep things simple, let's assume

Bit array size = 10
Number of hash functions = 2 (hash1, hash2)

Initial bit array

0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0

Step 3 : Hash the word

Next we pass the word through the hash functions
*Hash values are assumed for illustration

hash1("cat") ==> 4
hash2("cat") ==> 5

Hash output is reduced using modulo to fit the bit array.

Step 4 : Set the bits

We set the corresponding bits positions:

0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 1 0 0 0 0

This indicates that we have seen this word.
The actual word is then stored in the database.

The Bloom Filter does not store the word itself.

How does a Bloom Filter check if a word exists?

When a user wants to check if a word exists, we repeat the same hashing process

Case 1 : Word does not exist

Let's check the word "dog":

hash1("dog") ==> 4
hash2("dog") ==> 7

Now we check these bits/indexes in our bit array


0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 1 0 0 0 0
        ↑     ↑

Since at least one bit is 0, we can say with certainty:
❌ This word does not exist. (Strict NO)

Case 2 : Word may exist
Now let's check "cat" again:

hash1("cat") ==> 4
hash2("cat") ==> 5

0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 1 0 0 0 0
        ↑ ↑

Here the corresponding bits/indexes are set (All bits are set)
⚠️ This word may exist

Why "May be YES"? (False Positives)

Bloom Filters suffer from hash collisions.
Multiple words can hash to the same bit positions.

Example:

word : "bat"
hash1("bat") ==> 4
hash2("bat") ==> 5

These bits were already set by "cat".
Even though the word "bat" was never inserted, the Bloom Filter reports:
⚠️ May be present
This is the case of false positive.

When do false positives increase?

False positives are affected by

Small bit array size
Large number of inserted elements
Too many hash functions

Summary

In short:

If any bit is not set → ❌ Definitely NO
If all bits are set → ⚠️ Maybe YES

TL;DR

Bloom Filters are an optimization technique
They guarantee no false negatives
They can produce false positives

In practice, Bloom filters are used as a fast pre-check before hitting a database.
Bloom Filters don't tell you "YES or NO".
They tell you "NO or MAYBE".