Shavin Anjitha

Posted on Oct 10

Scaling with Probabilistic Techniques in System Design: Bloom Filters

#systemdesign #datastructures #softwareengineering

Have you ever wondered how large-scale systems like Twitter, Facebook, or Instagram instantly check whether a username or email ID is already taken or not with thousands of requests per second to these systems?

We could check the database that stores user records to see if a username or email ID is available, but this method is not very scalable as the number of records grows. With billions of records in the table, querying the database’s tables can become a bottleneck for platforms like Twitter and Facebook, leading to performance issues when handling thousands of requests per second.

Additionally, it can lead to poor user experience and is a very time-consuming task.

One simple solution to this problem is to introduce a caching system. The caching system can reduce the number of times it hits the database each time a user requests availability. Caching can decrease the load on the database and significantly improve response time. It can help the system to scale efficiently by caching the data of repeating requests.

But also, a cache system alone is not sufficient. If the cache is missing, the system has to hit the database. Also, what if a cache issue steals data? And also, caches consume memory, and if not properly managed, it can lead to excessive memory usage.

So, what is the solution? The very efficient and scalable solution to this problem is Bloom Filters. Memory-efficient probabilistic data structure was introduced by computer scientist Howard Bloom in 1970.

Bloom filters were introduced as a technique for a system to check the availability of an element in a set when the number of elements in the set requires an impractically large amount of memory. As you know, achieving something great comes with a cost. In this case, the price for efficiency is that this solution is probabilistic in nature. This means there might be some false positive results, where it might indicate that an element is in the set when it is not. However, it can never give false negative results, meaning it will never tell you that an element is not in the set when it actually is.

One interesting aspect of bloom filters is that elements can be added to the set but not removed. As a result, the number of elements in the set continues to grow indefinitely, leading to a steadily increasing false positive rate.

Bloom filters provide constant time performance for insertion and querying, making them an efficient solution for testing membership problems. Additionally, they can be easily distributed across multiple servers, making them well-suited for cloud-native distributed systems.

How Bloom Filter Works

Bloom filters use fixed-size m-bit arrays all set to 0 initially. Filter equipped with k-number of hash functions, which maps elements to one of possible m position in the bit array. These hash functions should be independent and uniformly distributed for optimal functioning of the bloom filter.

Assume we want to insert an element into the set. First, we need to feed it to the k-hash functions, take the k-bit array positions, and turn those bits array positions to 1.

To test whether an element is in the set, feed it to the k-hash functions to get the k-bit array positions. If any of the bits at these positions in the m-bit array is 0, then the element is definitely not in the set. But if all the k-bit array positions are 1, then either the element is in the set or another element mapped to the same k-bit array positions, providing false positive results.

Removing an element from this simple bloom filter is impossible because if we remove those k-bit array positions from the list, there is a chance that we also removed the elements that mapped to the same k-bit array positions. Although setting those k-bit array positions seems simple, it can introduce the possibility of false negatives, which is something we should avoid.

Example of how bloom filter works

Suppose we want to check if a book is available in a large online bookstore. Each book has a unique ISBN for identification. One way to do this is by searching the book data table. However, as the number of records increases, this method can become very time-consuming.

We can introduce a bloom filter to tackle this problem. Let’s assume we create a 10-bit array with 3 independent, uniformly distributed hash functions.

arr = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

H(e) = {h1(e), h2(e), h3(e)}

Let's assume we have added the two books with hashes below:

H (Book A) = {1, 5, 8}

H (Book B) = {2, 7, 9}

Now the bit array looks like this

arr = [0, 1, 1, 0, 0, 1, 0, 1, 1, 1]

Now, if we want to check whether Book 3 is in the store, we can hash the book’s ISBN and take the 3-bit array positions like below:

H (Book 3) = {3, 5, 9}

arr[3] = 0, arr[5] = 1, arr[9] = 1

There are one-bit positions that are not set to 1, and therefore Book 3 is not in the book store.

But if another book, Book 4, gives the hashes like this:

H (Book 4) = {2, 5, 9}

arr[2] = 1, arr[5] = 1, arr[9] = 1

In this case, we can’t definitely say that the book is definitely in the store but with a false positive probability.

Choosing Good Hash Functions

It is important to design a k-different hash function for a bloom filter, but it can be challenging for large k values. A good hash function should be uniformly distributed and have little to no correlation with other hash functions. When using large m and k values, the independence constraint on hash functions can be relaxed with only a small increase in the false positive rate. Uniformity helps reduce the chance of collisions, thereby reducing the false positive rate.

We can use cryptographic hash functions such as SHA-256 or md5 for their stability and uniform distribution, but they are slower, which can become a bottleneck for large systems with high request volumes.

The best approach is to use non-cryptographic hash functions such as Murmur, FNV Series, Jenkins, and City hash functions. These hash functions are straightforward and provide uniformity, allowing for fast calculation of hashes.

Performance Consideration

Bloom filters have this unusual property that the time needed to insert an element and test the membership of an element is fixed. have O(k) complexity. Complexity is independent of the number of elements in the set. Hardware implementation of these bloom filters can provide excessively increased performance because k-lookups are independent and can be parallelized at the hardware level.

Also, bloom filters are very memory/space efficient because they maintain a fixed-size bit array independent of the number of elements in the set.

Optimizing Bloom Filter

If we assume that each hash function selects each array position with equal probability and there is an m-bit array with k-hash functions, we can calculate the False Positive Rate using the below formula:

Assuming we have a close approximation to the uniformity of hash functions, we have that the probability of false positives decreases as m increases and increases as n increases.

The optimal number of hash functions k for given m and n, that minimize the false positive rate probability is

By doing some mathematics, we can easily achieve the optimal number of hash functions and bit array length for a given false positive rate like the below:

This means that for a given false positive rate, the length of the boom filter array m is proportionate to the number of elements being filtered, and the number of hash functions only depends on the target false positive rate.

Predicting the value m and l for a given application:

Estimate the number of elements (n) your system is going to handle in the future.
Determine the false positive rate appropriate to your application. Be cautious when choosing a low false positive rate, it can increase the bit array size.
Calculate the bit array size using estimated n and false positive rate. Which can affect the space complexity of the filter.
Calculate the value k using a false positive rate. Which can affect the time complexity of the filter.

Example:

For 1,000,000,000 elements (n = 1000000000) and a 1% false positive rate (e = 0.01),

Bit array size (m) = 9585058377.37 = 8.9268 GB nearly

The number of hash functions (k) = 6.64 = 7 hash functions.

So, an 8.92 GB bit array with 7 hash functions is enough for maintaining a 1% false positive rate for a system with a billion elements.

Bloom Filter Applications

The famous content delivery provider Akamai Technologies uses a bloom filter to prevent ‘One-Hit Wonders’ from being stored in its disk caches. One hit wonder are web resources requested by the user just once. Something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using bloom filters to detect a second request for a web resource and caching that resource only prevents one-hit wonders from entering disk caches significantly.
The Google Chrome web browser previously used the bloom filter for its Safe Browsing feature to identify malicious URLs. The bloom filter helps quickly check if a URL is potentially harmful without sending the actual URL to Google servers.
The medium used bloom filters in its recommendation systems to avoid showing those articles that have already been seen by the user.
Google Big Table, HBase Apache Cassandra, PostgreSQL, and ScyllaDB use bloom filters to reduce the disk lookups for non-existent rows and columns by improving the query performance.
Bitcoin uses Bloom Filters to speed up wallet synchronization until privacy vulnerabilities with the implementation of the Bloom Filter were discovered.

Drawbacks of Bloom Filters

False positive: highlighted downside of the bloom filter while it can’t be avoided by nature. But it can be reduced using optimal parameters.
Need for multiple hash functions: bloom filter needs independent, uniformly distributed hash functions that are relatively easy to calculate to maintain the efficiency of hashing. Finding such hash functions could be a complex task.
Inflexibility: The size of the bit array is fixed. It cannot be shrunk or expanded during production. It must be designated during its development time. For the bloom filter to be successful, the amount of data that will be added to the system must be stated or made obvious in advance.
Only existing probability can be obtained from the filter, not the original data (nevertheless, this is not a requirement of a bloom filter).

Use Cases of Bloom Filters

Checking for email ID/username availability: for large-scale systems with billions of records, checking the availability of usernames in its distributed data storage systems would be impractical with thousands of requests per second. Bloom filters can permit the system to roughly estimate the individual ID’s availability.
Recommending New Content: Large content streaming systems like Medium store their content distributed across data centers around the world. So, when an algorithm proposes content for a user, it is impractical to crunch the database to check whether the user has already viewed that content before. Bloom filters can help at that point.
Cache Filtering: Content Delivery Networks deploy their web caches around the world to cache and serve web resources to users with greater performance and reliability. Analysis shows that nearly three-quarters of the URLs accessed from a typical web cache are “One Hit Wonders"—that are accessed by only one and never again. It’s wasteful of disk capacity to store one-hit wonders on a cache system. To prevent one-hit wonders from entering the web cache, we can deploy a bloom filter to keep track of the web URLs that are accessed by users. A web resource is cached only when it has been accessed at least once before, significantly reducing the write workload on the disks.
Ensuring the security level of a suspicious URL: If you’ve deployed a cloud security system that prevents you from viewing malicious URLs. This service could store an archive of billions of potentially hazardous URLs and process several millions of requests per minute. In this situation, looking for a database or cache is impossible. Bloom filter can be used to probabilistically check the URL’s security.

Counting Bloom Filter

One significant drawback of the simple bloom filter is that once we add an element to the set, we can’t remove it from it. Either we have to rebuild the entire bloom filter from scratch or we have to acknowledge the element persistent in the set forever. Counting bloom filter is an extension for the bloom filter which provides a way to implement a delete operation on a bloom filter without creating a bloom filter fresh.

In a counting bloom filter, single-bit array positions are extended to multi-bit array positions. The insertion operation is extended to increment the value of each array position the membership check process checks that each of the hashed positions is non-zero. The delete operation consists of decrementing the value of each hashed array position.

Arithmetic overflow of array positions is a problem and the array position should be large enough to make this rare. Because using a multi-bit array counting bloom filter consumes more space than a simple bloom filter.

DEV Community

Scaling with Probabilistic Techniques in System Design: Bloom Filters

How Bloom Filter Works

Example of how bloom filter works

Choosing Good Hash Functions

Performance Consideration

Optimizing Bloom Filter

Example:

Bloom Filter Applications

Drawbacks of Bloom Filters

Use Cases of Bloom Filters

Counting Bloom Filter

Top comments (0)

Read next

Why Use Nix package manager, Even on macOS?

Data Structures: Creating Custom Node Classes

DSA: Graph Theory Questions for Interviews and Practice

Minimum Spanning Trees Breaking the Norm