DEV Community

Prado7
Prado7

Posted on

Bloom Filters : Strictly NO, May be YES

Ever wondered how large-scale systems check username availability?
Do they iterate over every single existing username and respond with a YES or NO?
Well,that would take an eternity. So how do systems at scale handle this efficiently?

👉 Enter Bloom Filters.

What is a Bloom Filter?

The term Bloom Filter may sound fancy, but at its core, it's surprisingly simple.
A Bloom Filter is:

  • A probabilistic data structure
  • Backed by a bit array (0's and 1's)
  • Designed to answer one question

"Have we seen this element before?"

What does a Bloom Filter guarantee?

A Bloom Filter can return only two types of answers:

  • ❌ Definitely NO
  • ⚠️ Maybe YES

It never produces false negatives, but false positives are possible.

False negative:
Bloom Filter ==> not present
Database ==> present (❌ never happens)

False positive:
Bloom Filter ==> present
Database ==> not present (⚠️ possible)

How does a Bloom Filter work?

Let's walk through the flow step by step.
Step 1 : User enters a word

cat
Enter fullscreen mode Exit fullscreen mode

Step 2 : Initialize the Bloom Filter
To keep things simple, let's assume

  • Bit array size = 10
  • Number of hash functions = 2 (hash1, hash2)

Initial bit array

0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
Enter fullscreen mode Exit fullscreen mode

Step 3 : Hash the word

Next we pass the word through the hash functions
*Hash values are assumed for illustration

hash1("cat") ==> 4
hash2("cat") ==> 5
Enter fullscreen mode Exit fullscreen mode

Hash output is reduced using modulo to fit the bit array.

Step 4 : Set the bits

We set the corresponding bits positions:

0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 1 0 0 0 0 
Enter fullscreen mode Exit fullscreen mode

This indicates that we have seen this word.
The actual word is then stored in the database.

The Bloom Filter does not store the word itself.

How does a Bloom Filter check if a word exists?

When a user wants to check if a word exists, we repeat the same hashing process

Case 1 : Word does not exist

Let's check the word "dog":

hash1("dog") ==> 4
hash2("dog") ==> 7
Enter fullscreen mode Exit fullscreen mode

Now we check these bits/indexes in our bit array


0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 1 0 0 0 0
        ↑     ↑
Enter fullscreen mode Exit fullscreen mode

Since at least one bit is 0, we can say with certainty:
This word does not exist. (Strict NO)

Case 2 : Word may exist
Now let's check "cat" again:

hash1("cat") ==> 4
hash2("cat") ==> 5
Enter fullscreen mode Exit fullscreen mode
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 1 0 0 0 0
        ↑ ↑
Enter fullscreen mode Exit fullscreen mode

Here the corresponding bits/indexes are set (All bits are set)
⚠️ This word may exist

Why "May be YES"? (False Positives)

Bloom Filters suffer from hash collisions.
Multiple words can hash to the same bit positions.

Example:

word : "bat"
hash1("bat") ==> 4
hash2("bat") ==> 5
Enter fullscreen mode Exit fullscreen mode

These bits were already set by "cat".
Even though the word "bat" was never inserted, the Bloom Filter reports:
⚠️ May be present
This is the case of false positive.

When do false positives increase?

False positives are affected by

  • Small bit array size
  • Large number of inserted elements
  • Too many hash functions

Summary

In short:

  • If any bit is not set → ❌ Definitely NO
  • If all bits are set → ⚠️ Maybe YES

TL;DR

  • Bloom Filters are an optimization technique
  • They guarantee no false negatives
  • They can produce false positives

In practice, Bloom filters are used as a fast pre-check before hitting a database.
Bloom Filters don't tell you "YES or NO".
They tell you "NO or MAYBE".

Top comments (0)