DEV Community

Sandeep Mahanty
Sandeep Mahanty

Posted on • Originally published at codejumbl.com

Bloom Filters: A probabilistic powerhouse

Introduction

When dealing with large-scale data systems — especially where memory and speed matter — we often face a simple but important question:

“Have I seen this item before?”

You might reach for a HashSet or database index — but what if memory is tight and occasional false positives are acceptable?

That’s where Bloom Filters come in.

What is a Bloom Filter?

A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can return:

  • Definitely not in the set
  • Possibly in the set

The trade-off? It can return false positives, but never false negatives.


How Does It Work?

A Bloom filter is essentially:

  • A bit array of size m, initialized to all 0s.
  • k different hash functions that map input to positions in the bit array.

Insertion

To insert an element:

  1. Hash the element with all k hash functions.
  2. Set each corresponding bit to 1.

Lookup

To check membership:

  1. Hash the element with the same k functions.
  2. If any of the corresponding bits are 0 → definitely not in the set.
  3. If all are 1 → possibly in the set.

When Should You Use a Bloom Filter?

Use cases:

  • Caching layers (check if it's worth looking up an item)
  • Distributed systems (avoid expensive queries)
  • Email spam detection
  • Web crawler visited URLs

Ideal when:

  • You don’t need to delete items.
  • You can tolerate false positives.
  • Memory usage matters more than perfect accuracy.

Real world uses of Bloom Filter

Cassandra the distributed highly available and high write throughput NoSQL database employs bloom filters extensively in its read path. Cassandra keeps data flushed into SSTables (String Sorted Tables) on disk to quickly read data. Bloom filters help here by efficiently determining if the data being requested exists in a particular SSTable file. Bloom Filters

Similarly Apache HBase and ScyllaDB are also example of some databases which use bloom filters.

CDNs like Akamai leverage bloom filters to quickly check if an object is present in the cache or not.

Bloom Filter Implementation

  public class BloomFilter {
    private BitSet bitset;
    private int bitSize;
    private int hashCount;

    public BloomFilter(int bitSize, int hashCount) {
        this.bitSize = bitSize;
        this.hashCount = hashCount;
        this.bitset = new BitSet(bitSize);
    }

    public void add(String item) {
        for (int i = 0; i < hashCount; i++) {
            int hash = getHash(item, i);
            bitset.set(Math.abs(hash % bitSize));
        }
    }

    public boolean mightContain(String item) {
        for (int i = 0; i < hashCount; i++) {
            int hash = getHash(item, i);
            if (!bitset.get(Math.abs(hash % bitSize))) return false;
        }
        return true;
    }

    private int getHash(String item, int seed) {
        return item.hashCode() ^ (seed * 0x5bd1e995);
    }
}
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

Bloom filters are a powerful addition to any backend engineer’s toolkit, especially when working at scale where memory efficiency is critical. This post aimed to offer a simplified introduction to the concept.

If you're interested in a more in-depth explanation—along with an interactive visualization—you can check out a post I wrote on the topic:
https://codejumbl.com/posts/bloom-filters

Top comments (0)