Haji Rufai

Posted on Jun 9

I Built a Bloom Filter from Scratch in Pure Python and Finally Understood How Databases Skip Reading Data

#python #database #dataengineering #programming

A few years into building data pipelines, I kept running into the same trick in systems I depended on every day.

Cassandra used it to avoid reading SSTables that did not contain my key. Bigtable used it for the same reason. Spark used it to skip whole partitions during joins. My CDN used it to decide whether a URL had ever been cached. Postgres extensions used it to prune rows before touching disk.

Same idea, five different systems. A tiny structure that answers one question fast: "have I probably seen this before?"

It is called a Bloom filter. And the day I built one from scratch was the day I stopped treating it like magic.

This is the toy version I wrote. No libraries, no bitarray, no mmh3. Just pure Python and the math underneath. By the end you will know exactly why a structure that fits in a few kilobytes can replace a lookup table that would cost you gigabytes.

The problem it actually solves

Imagine you run an LSM-tree storage engine. Writes go into many small sorted files on disk. To read one key, you might have to check ten files. Most of them do not contain the key. Each miss is a wasted disk read.

A disk read is slow. A memory check is fast. So before touching a file, you ask a small in-memory structure: "could this key be in here?"

If it says no, you skip the file entirely. If it says maybe, you read the file and check for real.

The key word is probably. A Bloom filter never says "yes". It says "no" with certainty, or "maybe" with a small chance of being wrong. It can produce false positives. It can never produce a false negative.

That asymmetry is the whole point. A false positive costs you one unnecessary read. A false negative would cost you correctness. Bloom filters trade a little wasted work for a hard guarantee.

The core idea in one paragraph

You keep a big array of bits, all starting at zero. To add an item, you run it through several hash functions. Each hash points at one position in the bit array. You set those bits to one.

To check an item, you hash it the same way and look at those same positions. If every one of them is a one, the item is probably present. If even one of them is a zero, the item is definitely absent, because adding it would have set that bit.

That is it. No stored keys, no buckets, no resizing. Just bits flipping on.

Building it: the bit array

Python has no native bit array, so I packed bits into a bytearray. Each byte holds eight bits. To touch bit i, I find its byte with i // 8 and its position inside that byte with i % 8.

class BitArray:
    def __init__(self, size_in_bits):
        self.size = size_in_bits
        self.data = bytearray((size_in_bits + 7) // 8)

    def set(self, i):
        self.data[i // 8] |= (1 << (i % 8))

    def get(self, i):
        return (self.data[i // 8] >> (i % 8)) & 1

Setting a bit uses a bitwise OR with a mask. Reading shifts the byte right and masks off everything but the lowest bit. This is the same packing trick databases use to keep the filter small enough to live in memory.

Building it: the hash functions

A Bloom filter needs several independent hash functions. Writing several good hashes by hand is painful, so I used a standard shortcut called double hashing. You compute two base hashes once, then combine them to fake as many as you need.

import hashlib

def _two_hashes(item):
    item = item.encode() if isinstance(item, str) else item
    h1 = int.from_bytes(hashlib.sha256(item).digest()[:8], "big")
    h2 = int.from_bytes(hashlib.md5(item).digest()[:8], "big")
    return h1, h2

def _hash_positions(item, num_hashes, size):
    h1, h2 = _two_hashes(item)
    for i in range(num_hashes):
        yield (h1 + i * h2) % size

The formula h1 + i * h2 gives a different position for each i. It behaves close enough to independent hashes for real use, and Cassandra ships essentially this approach.

Building it: the filter

Now the structure ties together. Add sets bits. Check reads bits and bails out the moment it sees a zero.

class BloomFilter:
    def __init__(self, size_in_bits, num_hashes):
        self.bits = BitArray(size_in_bits)
        self.num_hashes = num_hashes
        self.count = 0

    def add(self, item):
        for pos in _hash_positions(item, self.num_hashes, self.bits.size):
            self.bits.set(pos)
        self.count += 1

    def __contains__(self, item):
        for pos in _hash_positions(item, self.num_hashes, self.bits.size):
            if not self.bits.get(pos):
                return False
        return True

The early return False is where the certainty comes from. One zero bit means the item was never added. No exceptions.

bf = BloomFilter(size_in_bits=1000, num_hashes=4)
bf.add("user_42")
bf.add("user_99")

print("user_42" in bf)   # True
print("user_99" in bf)   # True
print("user_500" in bf)  # False, almost certainly

The part everyone gets wrong: sizing it

Here is where most tutorials wave their hands. The size and number of hashes are not guesses. They come from two numbers you choose: how many items you expect, and how many false positives you can tolerate.

The formulas are short. Given n expected items and a target false positive rate p:

import math

def optimal_size(n, p):
    # number of bits
    m = -(n * math.log(p)) / (math.log(2) ** 2)
    return int(math.ceil(m))

def optimal_hashes(m, n):
    # number of hash functions
    k = (m / n) * math.log(2)
    return max(1, int(round(k)))

Say you expect one million keys and accept a one percent false positive rate.

n = 1_000_000
p = 0.01
m = optimal_size(n, p)        # about 9,585,059 bits
k = optimal_hashes(m, n)      # 7 hash functions

print(f"{m} bits = {m / 8 / 1024 / 1024:.2f} MB")
print(f"{k} hash functions")

That is roughly 1.2 megabytes to track a million keys with one percent error. A real hash set of a million strings would cost you tens of megabytes. That gap is why storage engines reach for this instead of a set.

Push the error rate down to 0.1 percent and the filter grows to about 1.8 megabytes. Cheaper accuracy is mostly a matter of spending a few more bits.

Proving the false positive rate is real

Theory is nice. I wanted to see the one percent with my own eyes, so I built a filter, added known items, then queried items I never added and counted how often it lied.

def make_optimal(n, p):
    m = optimal_size(n, p)
    k = optimal_hashes(m, n)
    return BloomFilter(m, k)

bf = make_optimal(10_000, 0.01)

for i in range(10_000):
    bf.add(f"present_{i}")

false_positives = 0
trials = 100_000
for i in range(trials):
    if f"absent_{i}" in bf:
        false_positives += 1

print(f"measured rate: {false_positives / trials:.4f}")
# measured rate: roughly 0.0102

The measured rate lands right on top of the one percent target. The math is not a vibe. It is a contract the structure actually keeps.

What you cannot do, and why that is fine

You cannot delete from a plain Bloom filter. Clearing the bits for one item could clear bits another item depends on, and that would create a false negative, the one thing the structure promises never to happen. If you need deletes, you move to a counting Bloom filter that stores small counters instead of single bits.

You also cannot list what is inside. The filter holds no keys, only the shadow they cast on the bit array. That is exactly why it is so small. It remembers that you saw something without remembering what.

Why this changed how I read systems

After building it, the trick stopped being a black box and started being a tool I could reach for. Dedup a stream of billions of events without holding every ID in memory. Skip remote lookups for keys that were obviously never written. Prune join partitions before they ever hit the network.

Every one of those is the same one million keys in 1.2 megabytes idea, wearing a different costume.

The systems we depend on are not magic. They are a handful of sharp ideas, implemented carefully and reused everywhere. The fastest way to stop being intimidated by them is to build the toy version yourself.

Pick the structure under the system you lean on most. Write it from scratch this week. You will read your own infrastructure differently afterward.

If you build a Bloom filter after reading this, reply and tell me what false positive rate you measured. I read every comment, and I am always curious whether the math holds on someone else's machine.

Follow me for more "build the toy version of what you depend on" deep dives. Next up I am tearing into how databases keep millions of keys sorted on disk without ever loading them all into memory.

DEV Community