A Bloom filter is a probabilistic data structure. Which means when you check if it contains a value it responds with "No, I don't." or "I probably do." You might be reasonably thinking, "Why is that helpful?" Well, when built well, they are really, really, really fast and can help avoid time heavy operations.
Practical Examples of a Bloom filter
- When a Web Browser discovers a malicious website it adds it to a Bloom filter. And whenever you ask to go to a website, that site’s URL is checked against that Bloom filter. And it reports back that the website is safe, because it definitely wasn’t in there, or let’s you know that it might be a scary place and are you sure you want to visit it.
- Every Medium user has a Bloom filter that tracks what articles they’ve read. Before Medium recommends an article to you it checks that filter and if it sees that you definitely haven’t read it, than it recommends it. But if there’s a chance you have read it, Medium doesn’t pass it along.[1]
- When a word needs to be hyphenated at the end of a line, most of the time it doesn’t need any special rules, but in a relatively small amount of cases it does, so feed those words into a Bloom filter. And every time you need to hyphenate a word, check it against that filter to know if it definitely doesn’t need you to figure out an edge case, or maybe you do so you should do more calculations.[2]
So now that we know when to use a Bloom filter, let's look at how they actually work.
First let's make a simple one in JavaScript.
class BloomFilter {
constructor(size){
this.storage = [];
for (let i = 0; i < size; i++) {
this.storage.push(false);
}
}
hash (key) {
/* takes a key and hashes it several times
and returns an array of indexes based on those hashes */
}
add (key) {
this.hash(key).forEach(index => {
this.storage[index] = true;
}
}
contains (key) {
return this.hash.every(index=> this.storage[index]);
}
}
There are three important numbers to consider when you are making a Bloom filter.
- m is the number of indexes in the array
- k is the number of hashing functions
- n is the number of items you want to store in the filter
Without going deep into the math, the formula you want to use to compute your chance of a false positive is
That looks really complicated, but just know that the bigger n gets, the bigger m and k have to get to keep the number of false positives down.
Here's How a Bloom Filter Actually Works
When you add a value to the filter it gets pushed through k hash functions, for this example, let's say three. And those hashes get correlated to an index in the filter's storage array. So you flip those three indexes to true.
In this example red is false and green is true.
Let's add another value to the set.
Now when you want to look for a value you pass it through the same k hash functions and check if all the indexes come back true.
If any of them come back false, you know the value definitely isn't in the filter.
However, there is a chance that if all of them come back true, it could just be a coincidence.
So even if you get a positive result, you might still have to perform additional checks on the data, depending on what you built the filter for. But with a proper k and m you should be able to keep you rates of false positives very low. Which means you've substantially reduced the amount of data that you have to do intensive calculations on.
If you'd like to see an excellent interactive representation of a Bloom filter in action check out this link.
Top comments (0)