I talked with a peer about a bloom filter and it sounded interesting so I decided to explore further. A bloom filter , not a snapchat filter, is a data structure that prevents expensive queries on a database. In this blog I will talk about the inner workings of the bloom filter that make the bloom filter so efficient. I also have never made a bloom filter nor am I an expertise on bloom filters. This is my understanding of how bloom filters work based on the research I have done.
How it's made ?
The bloom filter uses an array to store a one at a certain locations in the array based on the hash value. Remember arrays are zero based indexed so the first bucket will be 0 bits. The hash value will be a number that is less than the the size of the array. Based on the length of the array and the amount of hash functions used will determine the probability of a false positive. A false positive in this case would mean that the bloom filter will return true for a member in the data set when really it isn’t. This happens occasionally. The good news is that it will never give you a false negative. Returning false for a member that’s actually in the data set.
Visuals
Looking at the array you will see 0s and 1s filled in the buckets. Items are hashed and stored in the set. Hash functions will always hash the same inputs to the same values (hash value). In the above picture x, y, and z are present in the data base. We know this by looking at the hash values for the letters which reference where the one will be stored in the array. If there is a one at every location in the array, it will return true. In the case where a member is not in the set like W. When looking at the locations and if zero is the value. The bloom filter will return false immediately. Preventing an expensive query. Sometimes an input value that is not in the database can be hashed to the same value as one that is in the database. This is where returning a false positive can happen. Rarely this happens but determining how many items you want to store and hash functions you want to use before hand will prevent the chances of this happening.
Conclusion
The bloom filter is an approximate data structure because occasionally you can get a false positive. It’s efficient because it checks an array to see if an input item has been there before. If it hasn’t it can immediately return false. This prevents going through the whole data base which can take a lot of time and programmers are all about optimizing time complexity. Many companies use bloom filters like Bing, Medium, and bitcoin. Here is a link on how those companies put the bloom filter to use https://en.wikipedia.org/wiki/Bloom_filter#Examples. Thanks for reading!
Top comments (0)