Introduction
When dealing with large-scale data systems — especially where memory and speed matter — we often face a simple but important question:
“Have I seen this item before?”
You might reach for a HashSet
or database index — but what if memory is tight and occasional false positives are acceptable?
That’s where Bloom Filters come in.
What is a Bloom Filter?
A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can return:
- Definitely not in the set
- Possibly in the set
The trade-off? It can return false positives, but never false negatives.
How Does It Work?
A Bloom filter is essentially:
- A bit array of size
m
, initialized to all 0s. -
k
different hash functions that map input to positions in the bit array.
Insertion
To insert an element:
- Hash the element with all
k
hash functions. - Set each corresponding bit to 1.
Lookup
To check membership:
- Hash the element with the same
k
functions. - If any of the corresponding bits are 0 → definitely not in the set.
- If all are 1 → possibly in the set.
When Should You Use a Bloom Filter?
Use cases:
- Caching layers (check if it's worth looking up an item)
- Distributed systems (avoid expensive queries)
- Email spam detection
- Web crawler visited URLs
Ideal when:
- You don’t need to delete items.
- You can tolerate false positives.
- Memory usage matters more than perfect accuracy.
Real world uses of Bloom Filter
Cassandra the distributed highly available and high write throughput NoSQL database employs bloom filters extensively in its read path. Cassandra keeps data flushed into SSTables (String Sorted Tables) on disk to quickly read data. Bloom filters help here by efficiently determining if the data being requested exists in a particular SSTable file. Bloom Filters
Similarly Apache HBase and ScyllaDB are also example of some databases which use bloom filters.
CDNs like Akamai leverage bloom filters to quickly check if an object is present in the cache or not.
Bloom Filter Implementation
public class BloomFilter {
private BitSet bitset;
private int bitSize;
private int hashCount;
public BloomFilter(int bitSize, int hashCount) {
this.bitSize = bitSize;
this.hashCount = hashCount;
this.bitset = new BitSet(bitSize);
}
public void add(String item) {
for (int i = 0; i < hashCount; i++) {
int hash = getHash(item, i);
bitset.set(Math.abs(hash % bitSize));
}
}
public boolean mightContain(String item) {
for (int i = 0; i < hashCount; i++) {
int hash = getHash(item, i);
if (!bitset.get(Math.abs(hash % bitSize))) return false;
}
return true;
}
private int getHash(String item, int seed) {
return item.hashCode() ^ (seed * 0x5bd1e995);
}
}
Final Thoughts
Bloom filters are a powerful addition to any backend engineer’s toolkit, especially when working at scale where memory efficiency is critical. This post aimed to offer a simplified introduction to the concept.
If you're interested in a more in-depth explanation—along with an interactive visualization—you can check out a post I wrote on the topic:
https://codejumbl.com/posts/bloom-filters
Top comments (0)