A couple of years ago I wanted to learn more about different types of commonly used data structures. It’s one thing to have used them, but knowing how they work or building them from scratch is something else. In this article, we’re doing a deep dive into bloom filters including some code examples and use-cases.
What is a Bloom Filter?
According to Wikipedia, a bloom filter is:
A space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.
So in its essence, a bloom filter is an array of bits (1/8 of a byte) where initially all bits are set to 0
. To add an element to the bloom filter a hashing function is required to “map” the element to a specific position in the bit array. Two or more elements can be mapped to the same position which can result in false positives (which is why it's called probabilistic). However, if the bit at the position of the element is 0
we know for sure that it's not included in the set.
In short:
A bloom filter can be used to efficiently check if an element is not included in a set.
Real-world usage
CRLite
In 2020 Mozilla announced that they are working on an alternative for validating revoked certificates in Firefox which is called CRLite. At the time a collection of all unexpired revoked certificates compromised 6.7GB on disk which was compressed to a ~ 1.3MB bloom filter. Verifying if a certificate is revoked is cheap for valid certificates as it only requires hashing the certificate serial number to get the position of the bit and verifying if it's set to 0
.
Cassandra
Apache Cassandra uses bloom filters to determine whether an SSTable has data for a particular partition. Verifying if the SSTable has data for a partition is cheap as it doesn’t need to read its content (avoiding IO operations).
Building one from scratch
We’re going to use Rust to create our bloom filter. Let’s start with a simple structure that includes the bit array:
struct BloomFilter<const N: usize> {
bytes: [u8; N],
}
impl<const N: usize> BloomFilter<N> {
fn new() -> Self {
Self { bytes: [0; N] }
}
}
fn main() {
let filter = BloomFilter::<10>::new();
}
Now we have a basic structure that holds our bit set and sets all bits to 0
when the bloom filter is created. Since there is no way to represent a single bit as a type our bit array is defined as an array of bytes. Thus size N
in this example is the size in bytes (as u8
is the data type for a byte in Rust), not bits.
The first thing we’re going to implement is adding a new element to our bloom filter. We need to hash the value to get a numeric value, get the position of the byte, get the position of the bit inside the byte, and set it to 1
. We’re going to use strings for our elements and use the DefaultHasher
as our hashing function.
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
fn hash(value: &str) -> u64 {
let mut s = DefaultHasher::new();
value.hash(&mut s);
s.finish()
}
struct BloomFilter<const N: usize> {
bytes: [u8; N],
}
impl<const N: usize> BloomFilter<N> {
fn new() -> Self {
Self { bytes: [0; N] }
}
fn add(&mut self, key: &str) {
let bit_size = N * 8;
let pos = hash(key) % (bit_size as u64);
let (byte_idx, bit_idx) = (pos / 8, pos % 8);
self.bytes[byte_idx as usize] |= 1 << bit_idx;
}
}
There is a lot to unpack here. To get the absolute bit position, the element is hashed to get a numeric value and a modulo operation is used against the size of the bit array so that we end up with a valid position. Then we need to find the position of the byte and calculate the relative bit position. At last, we set the bit by using a bitwise operation.
The last step is to implement a method that checks if an element is not included in the set. The process is similar to adding an element but instead of setting the bit we need to check its value.
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
fn hash(value: &str) -> u64 {
let mut s = DefaultHasher::new();
value.hash(&mut s);
s.finish()
}
struct BloomFilter<const N: usize> {
bytes: [u8; N],
}
impl<const N: usize> BloomFilter<N> {
fn new() -> Self {
Self { bytes: [0; N] }
}
fn get_positions(&self, key: &str) -> (usize, u64) {
let bit_size = N * 8;
let pos = hash(key) % (bit_size as u64);
((pos / 8) as usize, pos % 8)
}
fn add(&mut self, key: &str) {
let (byte_idx, bit_idx) = self.get_positions(key);
self.bytes[byte_idx] |= 1 << bit_idx;
}
fn contains(&self, key: &str) -> bool {
let (byte_idx, bit_idx) = self.get_positions(key);
self.bytes[byte_idx] & (1 << bit_idx) != 0
}
}
Most of the logic is moved to the get_positions
method which calculates the byte index and bit position. Let’s also add some test code to see if it works as intended:
fn main() {
let mut filter = BloomFilter::<10>::new();
filter.add("test1");
filter.add("test2");
println!("test1: {}", filter.contains("test1"));
println!("test2: {}", filter.contains("test2"));
println!("test3: {}", filter.contains("test3"));
}
On my machine this results in the following output:
➜ cargo run
Compiling bloom-filter v0.1.0
Finished dev [unoptimized + debuginfo] target(s) in 0.14s
Running `target/debug/bloom-filter`
test1: true
test2: true
test3: false
Remember that we can only guarantee that test3
is not included in the set. We can’t be sure that either test1
or test2
are included in the set as these could be false positives. The source code can be found on GitHub.
Closing remarks
In our bloom filter implementation, we choose a random size of 10
bytes (=80 bits). In the real world, the size should be based on your dataset and the expected probability rate of false positives. Choosing a size that’s too small can have negative effects on performance as there will be too many false positives. The best way to determine the properties of your bloom filter is by running performance tests and using the bloom filter calculator.
Top comments (0)