Fast fuzzy searching of a string in a list

#programming #javascript #c

Original post: https://siderite.dev/blog/fast-fuzzy-searching-of-string-in-list
Sift4 algorithm: http://siderite.dev/blog/super-fast-and-accurate-string-distance.html

I haven't been working on the Sift string distance algorithm for a while, but then I was reminded of it because someone wanted it to use it to suggest corrections to user input. Something like Google's: "Did you mean...?" or like an autocomplete application. And it got me thinking of ways to use Sift for bulk searching. I am still thinking about it, but in the meanwhile, this can be achieved using the Sift4 algorithm, with up to 40% improvement in speed to the naïve comparison with each item in the list.

Testing this solution, I've realized that the maxDistance parameter did not work correctly. I apologize. The code is now fixed on the algorithm's blog post, so go and get it.

So what is this solution for mass search? We can use two pieces of knowledge about the problem space:

the minimum possible distance between two string of length l1 and l2 will always abs(l1-l2)
- it's very easy to understand the intuition behind it: one cannot generate a string of size 5 from a string of size 3 without at least adding two new letters, so the minimum distance would be 2
as we advance through the list of strings, we have a best distance value that we keep updating
- this molds very well on the maxDistance option of Sift4

Thus armed, we can find the best matches for our string from a list using the following steps:

set a bestDistance variable to a very large value
set a matches variable to an empty list
for each of the strings in the list:
1. compare the minimum distance between the search string and the string in the list (abs(l1-l2)) to bestDistance
  1. if the minimum distance is larger than bestDistance, ignore the string and move to the next
2. use Sift4 to get the distance between the search string and the string in the list, using bestDistance as the maxDistance parameter
  1. if the algorithm reaches a temporary distance that is larger than bestDistance, it will break early and report the temporary distance, which we will ignore
3. if distance<bestDistance, then clear the matches list and add the string to it, updating bestDistance to distance
4. if distance=bestDistance, then add the string to the list of matches

When using the common Sift4 version, which doesn't compute transpositions, the list of matches is retrieved 40% faster on average than simply searching through the list of strings and updating the distance. (about 15% faster with transpositions) Considering that Sift4 is already a lot faster than Levenshtein, this method will allow searching through hundreds of thousands of strings really fast. The gained time can be used to further refine the matches list using a slower, but more precise algorithm, like Levenshtein, only on a lot smaller set of possible matches.

Here is a sample written in JavaScript, where we search a random string in the list of English words:

search = getRandomString(); // this is the search string
let matches=[]; // the list of found matches
let bestDistance=1000000; // the smaller distance to our search found so far
const maxOffset=5; // a common value for searching similar strings
const l = search.length; // the length of the search string
for (let word of english) {
    const minDist=Math.abs(l-word.length); // minimum possible distance
    if (minDist>bestDistance) continue; // if too large, just exit
    const dist=sift4(search,word,maxOffset,bestDistance);
    if (dist<bestDistance) {
        matches = [word]; // new array with a single item
        bestDistance=dist;
        if (bestDistance==0) break; // if an exact match, we can exit (optional)
    } else if (dist==bestDistance) {
        matches.push(word); // add the match to the list
    }
}

There are further optimizations that can be added, beyond the scope of this post:

words can be grouped by length and the minimum distance check can be done on entire buckets of strings of the same lengths
words can be sorted, and when a string is rejected as a match, reject all string with the same prefix
- this requires an update of the Sift algorithm to return the offset at which it stopped (to which the maxOffset must be added)

I am still thinking of performance improvements. The transposition table gives more control over the precision of the search, but it's rather inefficient and resource consuming, not to mention adding code complexity, making the algorithm harder to read. If I can't find a way to simplify and improve the speed of using transpositions I might give up entirely on the concept. Also, some sort of data structure could be created - regardless of how much time and space is required, assuming that the list of strings to search is large and constant and the number of searches will be very big.

Let me know what you think in the comments!