DEV Community

Vinit Jogani
Vinit Jogani

Posted on

The Top Trick to Scalable Search Algorithms

Aside: Okay, I agree -- that title is a bit clickbait, but it's not clickbait if it's (mostly) true... right? At least I did not call it a "secret" trick to build up the suspense for no reason... he says, as he only filibusters farther away from the point. But enough about that, let's jump into it.

I'm going to come right out with this one -- I think, by far, the most powerful weapon in scaling search algorithms is hashing. When I talk about search algorithms here, by the way, I mean something more general than search engines -- most complex algorithms that use Big Data are actually search algorithms. You may want to find similar customers (lookalike modeling) or compare certain parameters of products or recommend books to readers. It's all search! OK -- back to hashing.

Hashing is so commonly used that it's almost underrated. Every computer scientist has likely studied all the nitty-gritty properties of hashing every which way in their data structures class but apart from its applications in hash-tables, I don't think many people realize all the powerful things hashing is capable of.

Brief Background

(feel free to skip if you already know what hashing is)

For the uninitiated, hashing is a technique that takes an input of any size and converts it into a fixed-size output using a mathematical function. This output, called a hash, can be used to represent the original input in a more concise and efficient manner. Hashing is widely used in computer science for various purposes, including data storage and retrieval, security, and identification.

Like mentioned earlier, perhaps the most common use of hashing is in hashtables. The goal here is that you want to efficiently store mappings of keys to values such that search/read/write is near constant time (for a sufficiently large hash table). Hashing allows us to map any arbitrary key into a fixed number of K buckets in a pseudo-random way such that this problem becomes possible and probabilistically efficient.

Now there are a lot of other applications of hashing that are also very common. You may hash passwords for security so that the plaintext passwords, if leaked, are not as useful for hackers. You may also use hashes to verify the integrity of a file e.g. if you download a file, you can hash the file and double check if the hash is an exact match as what is advertised by the author to ensure it has not been corrupted or tampered with.

That's all cool, but I want to specifically discuss the more powerful applications of hashing within the context of writing scalable algorithms for difficult problems.

A Simple Start

Suppose you have N files with M bytes each, and you are tasked with finding duplicates within these files. Here is pseudocode for a very naive algorithm:

foreach pair of files A, B {
    duplicate = True
    foreach byte Ai, Bi {
        if Ai != Bi {
            duplicate = False
            break
        }
    }
    if duplicate: print(A, B)
}
Enter fullscreen mode Exit fullscreen mode

It is clear to see that this file will take roughly O(MN^2) steps because for each pair of files, we would need to compare each of the M bytes. Especially when N or M is large, this is going to be intractable.

However, with hashing, you can do much better:

foreach file F {
    hash = 0
    foreach byte B in F {
        hash = hash_function(hash, B)
    }
    if hash_table.contains(hash):
        print(F, hash_table[hash])
    else:
        hash_table.add(hash, F)
}
Enter fullscreen mode Exit fullscreen mode

Instead of comparing each pair of files byte by byte, we first compute the hash of each file and then compare the hashes instead. If two files have the same hash, there is a good chance that they are duplicates, and we can then perform a more expensive comparison to confirm this. This algorithm has a complexity of O(MN), assuming that the number of duplicates is actually a small subset, which is a significant improvement over the previous O(MN^2) algorithm.

While this problem is a bit contrived, there are actually many problems that fall under this common structure -- you want to find all products that have the same values for N features. Instead of comparing each of the N features for the files, you may want to compare just the hashes and that will likely be a good enough approximation. You can also reuse these hashes over time so this is great for online algorithms as well.

In fact, one reason why this is so especially powerful is that it falls very nicely in the MapReduce structure which we can very heavily parallelize across many cores/machines. You see, when you need to compare pairs of files -- all computers in your cluster need to have all files. However, hashing is an independent process that you can compute for all files in parallel across many machines. Then, you can map the file name to its hash and use MapReduce to actually find the conflicts. Generally, hashing improves efficiency on a single core/machine but its powers multiply when you combine it with the wonders of Big Data tools like Spark.

Fuzzy Matching?!

So exact matching of data is an obvious application of hashing which we can scale tremendously with parallel computing. However, what about more fuzzy matching?

The interesting problem here is of comparing two sets. Say you have two sets of items A and B, and you want to find all sets that are similar i.e. have a high Jaccard Similarity which is just (A intersection B)/(A union B). Hashing the entire set to a single number won't help us here because that is only good enough for exact matching.

One solution here is an algorithm called Min-Hashing. In this algorithm, each item in a set is hashed to a random number, and the smallest hash value for each set is chosen as the hash value for that set. Typically we use K hash functions for this, and then we compute the similarity using element-wise comparisons of these K hash values per set. It can be shown that this is actually a very good approximation of the Jaccard Similarity of the original sets without having to do an intersection over union operation over the possibly many more elements in those original sets. Not only is this K a fixed size vector, we also only need to do element-wise comparisons i.e. compare the i-th hash value in two sets instead of all pairs of hash values.

Let's say each set hash M items and we use K to be much smaller than M (usually, we can pick K to be very very small and still achieve very decent results). If intersections and unions are implemented with hash tables, they can each take O(M) steps and we need to do this for all pairs of sets so again O(MN^2). With Min-Hashing, we can do this in O(KN^2) which is better but let's face it -- not that great since we still have that N^2 term.

Well, need I remind you -- we still have the parallelism of the cloud. We can compute these K hashes parallelly and then use MapReduce to broadcast each set to all of the K hashes for a highly parallel implementation of the above. The idea is that we may not need to compare all N^2 pairs at all -- if we can just map elements to each of the K hashes, we only need to compare elements that share at least 1 hash which may be much smaller in practice.

The recurring theme here is that hashing allows us to do very good approximations -- generally with Big Data, it's hard for us to get scalability at the same time as precision. In the vast majority of cases though, it doesn't matter!

A very similar problem to this one is in finding near-duplicate documents, say for example in Search at Google. If we have a large corpus of documents and we want to find all documents that are almost but not exactly, identical. This is a difficult problem to solve with traditional algorithms. With Min-Hashing, we get very efficient implementations for this (if we consider documents to just be sets of words or terms). In general, many problems can be thought of as decomposing larger entities into sets of objects e.g. video is a set of images, url is a set of paths, etc.

There is another similar algorithm here that is worth pointing out: Sim-Hashing. It tries to solve a very similar problem but it looks more so at bit representations of the inputs. So say you have two images where one of them may have some noise added to it -- you may be able to use sim-hash to get a fingerprint that is more robust to that noise.

Adding Context!!

So far we have talked about matching mostly categorical data where there is no contextual meaning of the exact values themselves. If we were hashing real numbers, there is actually some semantic meaning that they capture in their relative distances. For example, 4 and 4.1 are more similar than 4 and 400. In some fuzzy matching algorithms, we would like our hashing to be more respectful of this additional context.

Enter, locality sensitive hashing (LSH). This is a technique for efficiently finding approximate nearest neighbors in high-dimensional data. It does this by hashing the input items into a series of buckets, such that similar items (e.g. close-by numbers) are more likely to be hashed to the same bucket. One of the key advantages of LSH is that it is highly scalable, making it well-suited to working with large datasets. By using randomization and hashing, LSH can find approximate nearest neighbors much more quickly than traditional methods, which often involve calculating distances between every pair of items in the dataset. This can greatly improve the performance of algorithms such as recommendation systems and anomaly detection.

There is actually a very interesting paper by Google about Grale which uses LSH for graph learning -- the idea is to learn a graph by automatically building edges between similar items in a scalable way such that we don't need to compute pairwise distances.

I actually had to use a similar algorithm recently: I had a long list of products which each had several numeric parameters: say length, width, and height. I wanted to group all products which had similar dimensions and doing pairwise matching on every query was just too inefficient. Using LSH, I was able to significantly cut down the comparisons that I actually had to make.

Conclusion

So hopefully I have expressed how much I love hashing-derived algorithms and the breadth of their applications. Like I have mentioned a few times, this is especially useful when approximate solutions are good enough and you can parallelize your operations across a cluster -- both of which are likely true for Big Data applications.

I have personally used hashing tricks in many diverse real-world projects to great effect. I haven't even covered all the non-search, non-scalability related applications of hashing like cryptography. As a computer scientist, I think this is one of the most useful tools to have in your pocket given how versatile and powerful it can be.

Top comments (0)