How X Checks Your Username in Milliseconds: A Gentle Guide to Bloom Filters

#webdev #programming #database #backend

You must have tried searching for your username while creating your X account. Ever wondered how you get such a quick confirmation of whether an account with that username exists or not?

Well, to be honest, even if it says that an account exists with that username, there might be no account with that username in the database.

Surprised? Let's learn about this amazing probabilistic data structure called Bloom Filters.

You see when you have usernames of 2B+ users, none of these approaches(to find out if the username exists already) is feasible:
❌ 1 database query for each lookup
❌ Maintaining a hash table of usernames in memory

The first approach of 1 database query for each lookup is pretty easy to understand as it brings with it a huge number of database queries and huge performance overhead.
The second approach of maintaining a hash table of usernames in memory sounds like it would do the job and it also probably will. However, the problem with this approach is the amount of RAM that would be required to store just the usernames of 2B+ users. Mathematically, this would require roughly 30 GB RAM. Hence, this approach too is not economically viable.

The solution lies in this new data structure that we began this thread with: Bloom Filters.

Bloom filters is simply an array of bits. Each cell of the array can contain either 0 or 1. The length of this array is different based on the number of usernames that must be handled. For simplicity, let's assume we have an array of length 10. Each element of array currently has 0 in them.

Also, we have hash functions, the number of which is again depending upon the number of usernames that is being handled. In our case for simplicity, we consider having just 2 hash functions, h1 and h2.
Mind you, a hash function is one that gives a different output for different input string. In our case, we expect each of these function to output a number from 0 to n-1 (where, n = length of the array).

Let's understand how this becomes helpful in our usecase.

For example, we are taking 4 usernames(in the same sequence) that we need to create accounts with:

prince
ankit
ayush
prince
mona

Now, let's understand what happens case-wise:

prince
'prince' gets passed through hash functions h1 and h2, and gives output 4 and 7 respectively. Now, next step is to turn the value stored in these indices in the array to 1. So now, the array has 0 in all its elements, except at indices 4 and 7 where it has 1.
ankit
'ankit' too gets passed to h1 and h2, and hashes received are 3 and 1 respectively. Again, we put 1 at those indices in the array.
Now, our array has 0 at all indices except, at indices 1, 3, 4 and 7.
ayush
'ayush' on getting passed to h1 and h2 gives hashes 9 and 5.

Now, the array has 1 at all indices except at 0, 2, 6, 8.

Hence, the updated array after adding 3 new users to our database now becomes as follows:
[ 0, 1, 0, 1, 1, 1, 0, 1, 0, 1 ]

Alright, now how do we check if a username exists the database after doing all these crazy stuff?

That is why I took a fourth username i.e. "prince", which already exists in our database. Again, we pass this string through h1 and h2, and we get the same hashes i.e. 4 and 7. This time we check that the indices values are already 1. This means that these indices were converted to 1 by some string earlier which was the same as "prince". Hence, we tell the user that this username already exists.

Now the 5th case i.e. "mona":
Let's see the h1 and h2 for this string gives indices 0 and 9 respectively. Here, we see that the index 0 in the array has value 0 while index 9 has the value 1. In this case too, the username will be marked as 'taken' because the value at one of the indices is 1.

You get it now? This is why you might get 'username taken' warning raised even when there is no such username in the database from before. We call this case of falsely marking the username taken as 'false positive'.

Now why do we use this data structure when it may give false positives? There are two benefits:

It never gives false negatives because everytime it marks a username as taken, it can be(for sure) taken when both bits in the array are 1, or partially mark a username taken when one of the bits is 1.
It is much efficient to be stored in the RAM since you are literally storing just an array of 'bits'.
It is practically not a big deal to have a small % of false positives while selecting newer usernames because the user would simply just try a newer username when prompted that their already exist.

Mathematically, we can even decrease the rate of false positives by adjusting the parameters:
m = length of array
k = number of hash functions
n = the number of values(i.e. usernames) we are expecting to handle

I share below the screenshot of formula for same, in case you're interested.

To put things in perspective, earlier we need almost 30 GB of RAM for handling 2B+ usernames with quick lookups, now using Bloom Filters we only would be needing about 0.5 GB RAM for the same usecase, without compromising on the quick lookups.

DEV Community

How X Checks Your Username in Milliseconds: A Gentle Guide to Bloom Filters

Top comments (0)