How Bloom Filters Help Tech Giants Manage Massive Caches

#systemdesign

Ever wondered how tech giants like Facebook, Google, and Akamai, manage their massive caches?

With millions of users and countless searches, how do they figure out what to store in their cache and what not to store in their cache?

How do they quickly find out if an item is a frequently searched item or not?

This is where bloom filters come into play.

What are Bloom filters?

According to the definition, "A Bloom filter is a space-efficient probabilistic data structure used to test set membership."

In layman's terms, a Bloom filter is a clever data structure that allows you to check if an item is present or not present in a set.

Bloom Filters use several independent hash functions and an array of fixed size containing only 0s and 1s

[0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0]

How does it work?

Let's take an example to understand how it works.

For this example, let's assume that there are 3 hash functions and the length of the bit array is 10.

Suppose a user comes on Facebook and searches for a word, let's say "Party".
Now Facebook first needs to figure out if search results for the word "Party" are in their cache or not. So Facebook consults its Bloom filter.
The word "Party" is passed through all the independent hash functions resulting in indices in the bit array.

h1("Party") = 2
h2("Party") = 5
h3("Party") = 1

Next we examine the bits at these positions in the array. Suppose our bit array currently looks like this:

array: [0, 1, 1, 0, 1, 0, 1, 0, 0, 0]
index:  0, 1, 2, 3, 4, 5, 6, 7, 8, 9

The bits at 2, 5, 1 are 1, 0, 1 respectively.

Since not all the bits are set (bit at index 5 is 0), we can be certain the "Party" is not present in cache.
If all the bits were set to 1, it would indicate that "Party" might be present in the cache.

Bloom filters doesn't guarantee that the item will be present in the data set, but it guarantees that the item will not be present in the data set.

The example above demonstrates how to check if an item is present in the cache, but how can we determine whether a searched item is frequently requested?

How Do Bloom Filters Help Manage and Update the Cache?

Let’s build on our earlier example:

Suppose the word "Party" was not found in the cache because the bit at index 5 was unset. When we set this bit, the next search for "Party" will indicate a possible presence in the cache, allowing for a quick lookup.

However, when millions of users are making thousands of requests per second, this approach runs into a problem. If we set bits for every search, the Bloom filter’s bit array will quickly fill up with 1s, reducing its effectiveness and causing frequent false positives.

To address this, systems often use multiple Bloom filters (or layers of bit arrays). Here’s how it works:

Bloom filter 1: [0, 1, 1, 0, 1, 0, 1, 0, 0, 0]
Bloom filter 2: [0, 1, 0, 1, 1, 0, 1, 1, 0, 0]
.
.
.
Bloom filter n: [1, 0, 0, 0, 1, 0, 1, 1, 0, 1]

Each time "Party" is searched, we update a different layer or bit array.
For example, "Party" might appear present in the first Bloom filter, but absent in the second. This indicates it has been searched before, but not often enough to be considered “frequently searched.”
Only when "Party" is found to be present in all layers do we treat it as a frequently accessed item and consider storing or updating it in the cache.

This layered approach helps distinguish between items that are searched occasionally and those that are truly popular, ensuring that the cache is used efficiently and only for items with sustained demand.

False Positives

A false positive occurs when the Bloom filter suggests that an item might be present in the cache, but in reality, it isn’t. As discussed earlier, Bloom filters doesn't guarantee that the item will be present in the data set, but it guarantees that the item will not be present in the data set.

This happens because multiple different items can set the same bits in the bit array. So, when you query for an item and all its corresponding bits are set to 1, it’s possible those bits were set by other items, not the one you’re searching for.

Why is this acceptable?

This tradeoff is fundamental to Bloom filters. In many system design scenarios, it’s acceptable to occasionally perform an unnecessary cache lookup (due to a false positive), as this is far less costly than searching the entire database for every request. The Bloom filter helps minimize expensive database operations, even if it sometimes leads to extra cache lookups.

Final thoughts

Bloom filters might seem simple, but their ability to deliver fast, space efficient, and probabilistic checks makes them incredibly powerful for large scale systems. From reducing unnecessary database lookups to optimizing cache usage, they quietly power some of the biggest tech infrastructures behind the scenes.

Hi, I’m Samit, a software developer and freelancer passionate about building real world projects. If you’re looking to collaborate or just want to say hi, check out my portfolio. Let’s connect!