Ayush Kumar Anand

Posted on Dec 28

You Can't Resize a Bloom Filter. Here's What To Do Instead.

#redis #systemdesign #go #performance

What are bloom filters?

Bloom filters are a probabilistic data structure that uses bit arrays and hash functions to reduce the load on our main databases. While you might question why not use caching architectures like redis. The answer is decided by which tradeoff you are willing to choose - accuracy or memory. Redis(Basic Redis) like caching databases provide accuracy but they are not size efficient, whereas data structures like bloom filters works on an opposite spectrum.

How do they work?

Before working with bloom filters, which you might have decided after settling down with the accuracy tradeoff, maybe because you cannot cover the expenses of memory, you need to decide in advance the size of the bit array and the number of hash functions you are going to use. This is important because once a data is written and once the size is decided, the size of the bloom filter cannot be changed and the data, once written, cannot be deleted.

First, let us ponder over the number of hash function. Now the number cannot be low because it will lead to accidental collisions. But at the same time the said number could not be large as it will fill the array faster and will also lead to accidental collisions.

But luckily, we have a formula:

Formula: k = (m / n) × ln(2)

where $m$ -> Size of our bit array (e.g., 100 bits).
$n$ = Number of items we expect to add (e.g., 10 users).

Below is a golang implementation for the same:

package main

import (
    "math"
)

// CalculateOptimalParams calculates 'm' (size of array) and 'k' (num hashes)
// based on how many items (n) you have and what error rate (p) you accept.
func CalculateOptimalParams(n int, p float64) (uint, uint) {
    // 1. Calculate ideal Bit Array Size (m)
    // Formula: m = - (n * ln(p)) / (ln(2)^2)
    m := - (float64(n) * math.Log(p)) / math.Pow(math.Log(2), 2)

    // 2. Calculate ideal Number of Hash Functions (k)
    // Formula: k = (m / n) * ln(2)
    k := (m / float64(n)) * math.Log(2)

    return uint(math.Ceil(m)), uint(math.Ceil(k))
}

func main() {
    n := 1000        // We expect 1,000 users
    p := 0.01        // We accept a 1% False Positive rate

    m, k := CalculateOptimalParams(n, p)

    // In your article, print this result!
    // It proves you aren't guessing.
    println("Recommended Bit Array Size (m):", m)
    println("Recommended Hash Functions (k):", k)
}

Now, we come to the core logic. Bloom filters works by answering the question, does an entry of an item exists in data structure i.e. the bit array.

Let us take an example.
Suppose we are making an app like Splitwise. We have asked a user to choose a userName, but we have a constraint - the userName should be globally unique.

Now, when a user chooses a name, we can run a sql query like

SELECT 1 FROM users WHERE username = 'ayush' LIMIT 1

Now, if the count is 1, we will ask the user to choose another name. But the query is proportional to log(n) where n is the size of the table in the best-case scenario, assuming there is an indexing on the username column.

Conversely, we can use a bloom filter. We can just check in the filter, if the name is there. If the answer is a definite NO, we can let the user keep his username, otherwise we can query the database and be sure if the name is already in the database and then proceed accordingly. You can see the huge effeciency we just acheived.

The Maths Behind

Bloom filters uses two things:

A Bit Array: An array of zeros and ones. (e.g., [0, 0, 0, 0, 0]).
Hash Functions: Math formulas that turn a word into a number.

Let's walk through it: Imagine an empty array of 10 bits: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Step 1: Adding "Ayush" We run "Ayush" through 2 hash functions:

Hash 1("Ayush") = 2
Hash 2("Ayush") = 7

We flip the bits at index 2 and 7 to 1. Array: [0, 0, 1, 0, 0, 0, 0, 1, 0, 0]

Step 2: Checking "Kumar" We want to know if "Rahul" exists.

Hash 1("Kumar") = 3
Hash 2("Kumar") = 4

We check index 3 and 4. Are they 1?

Index 3 is 0.

Verdict: Kumar is DEFINITELY NOT in the set. (We didn't need to check the database!).

Step 3: The "False Positive" (The "Maybe") Let's say we check "Anand".

Hash 1("Anand") = 2
Hash 2("Anand") = 7

We check the array. Index 2 is 1. Index 7 is 1. (Because "Ayush" set them earlier!).

Verdict: The filter says "Anand MAYBE exists."
Reality: Anand doesn't exist. This is a collision.
Action: Now (and only now), we check the real database to confirm.

Below is a golang implementation:

package main

import (
    "fmt"
    "hash/fnv"
)

// Defining the Bloom Filter
type BloomFilter struct {
    bitset []bool // The array of 0s and 1s
    size   int    // Size of the array
}

func NewBloomFilter(size int) *BloomFilter {
    return &BloomFilter{
        bitset: make([]bool, size),
        size:   size,
    }
}

// Hash Function 1
func (bf *BloomFilter) hash1(s string) int {
    h := fnv.New32a()
    h.Write([]byte(s))
    return int(h.Sum32()) % bf.size
}

// Hash Function 2 (Simulated by adding salt)
func (bf *BloomFilter) hash2(s string) int {
    h := fnv.New32a()
    h.Write([]byte(s + "salt")) // Add a simple salt to change the hash
    return int(h.Sum32()) % bf.size
}

// Add an item
func (bf *BloomFilter) Add(item string) {
    idx1 := bf.hash1(item)
    idx2 := bf.hash2(item)
    bf.bitset[idx1] = true
    bf.bitset[idx2] = true
    fmt.Printf("Added: %s (Indices: %d, %d)\n", item, idx1, idx2)
}

// Check if item exists
func (bf *BloomFilter) Exists(item string) bool {
    idx1 := bf.hash1(item)
    idx2 := bf.hash2(item)

    // If BOTH bits are true, it *might* exist
    if bf.bitset[idx1] && bf.bitset[idx2] {
        return true
    }
    // If ANY bit is false, it definitely does NOT exist
    return false
}

func main() {
    filter := NewBloomFilter(100)

    // 1. Add Ayush
    filter.Add("Ayush")

    // 2. Check Ayush
    fmt.Println("Does Ayush exist?", filter.Exists("Ayush")) // True (Maybe)

    // 3. Check Kumar
    fmt.Println("Does Kumar exist?", filter.Exists("Kumar")) // False (Definitely Not)
}

Some real world examples:

Medium.com: Uses Bloom Filters to avoid recommending articles you have already read. Logic: "Has Ayush read this ID?" If Filter says "No", show it. If "Maybe", check DB.
Google Chrome: Malicious URL checker.Instead of storing every bad URL in the browser (too big), they store a Bloom Filter. If you visit a site, Chrome checks the filter. If it says "Maybe dangerous," it calls Google servers to confirm.
Cassandra/Postgres (Databases):Before searching a hard disk for a row, they check a Bloom Filter. If the row isn't there, they save a slow disk operation.

How to scale?

So, sure we can decide the size of the array and of the items. But it is just an assumption, it will change in the future. So how to scale it. The scaling technology is implemented using stacking bloom fiters with greater size aslo coined as Scalable Bloom Filters (SBF). This technique is used by the bloom filter provided by the Redis Stack.

So, how does the stacking algo works?

A. The "Write" Rule (Adding Data)
We only write to the Newest (Active) filter.

If you want to add "Ayush", you don't touch Filter 1 or Filter 2. You add him to Filter 3.
Why? Because Filters 1 and 2 are frozen.

B. The "Read" Rule (Checking Data)
We must check ** ALL** filters, usually starting from the newest to the oldest.

Question: "Does Ayush exist?"
Check Filter 3: "No." (He isn't in the new hotel).
Check Filter 2: "No."
Check Filter 1: "Yes." (He checked in 2 months ago).
Result: True.

The Logical OR: Result = F3.Exists() || F2.Exists() || F1.Exists()
Problems with the scaling approach....

Every time we add a new filter layer, we introduce a New Source of Lies (False Positives).

If Filter 1 lies 1% of the time, and Filter 2 lies 1% of the time...
Your total chance of being lied to is now roughly 2%.
If you have 10 filters, your error rate becomes 10%. That is garbage.

How Redis fixed it?

They fixed it by making every new filter stricter than the last one.

Filter 1: We allow 0.1% error.
Filter 2: We allow 0.01% error. (10x stricter).
Filter 3: We allow 0.001% error. (100x stricter).

Hence even if our stack uses 20 filters, the total error rate stays tiny (under 1%) because the later filters are so precise.

DEV Community