Akhil

Posted on Jul 5, 2020 • Edited on Jul 9, 2020

The algorithm behind Ctrl + F.

#webdev #javascript #computerscience #career

Ctrl + F on chrome opens up a search box that is used to find text on a web page, pdf, etc. It's one of the fastest ones I have seen and decided to dig deeper into what's going on.

So let's go on a journey of implementing a fast string matching algorithm.

Note: The algorithm which we will implement might be similar to the one used in Chrome, but since its Google we're talking about, they might've made optimizations

You might be wondering why do we need an algorithm when we have regular expression which does the same?

Yes we have regular expressions at our disposal but regular expressions are slow when we task it with finding patterns on large data, regular expression is awesome when we task it with finding a "dynamic patter" like all 10 digit phone numbers starting with +91, but in this case, we want to find one particular string.

If you want to know more Read here

This leaves us the only option of implementing a pattern matcher. Let's start with basic we can think about. We're given a document containing millions of words and we want to find one word, how shall we approach this? It's like finding a needle in a haystack.

Naive Approach

The first idea that we think of is comparing pattern and string character by character :

Implementation :



let string = "ATAATTACCAACATC";
let pattern = "ATC";
let position = [];
let found = true;
for(let i=0;i<string.length;i++){
  found = true;
  for(let j=0;j<pattern.length;j++){
    if(string[i+j] != pattern[j]){
      found = false;
      break;
    }
  }
  if(found){
    position.push(i);
  }
}

console.log(position);

But this performs in O(nm) time complexity, which is very slow.

How to optimize it?

For each string, if it doesn't match, we move by one character. How about skipping the whole word?

In this case, instead of starting all over again, we skip the string when it mismatches.

In the previous approach, we compared string nearly 45 times, here we compared string only 15 times which is a huge leap.

Here we can perform an optimization, instead of comparing from the front, how about comparing from the end?

In this case, we compared the string just 9 times, which is nearly half of the previous case.

But as you might've guessed this has a huge flaw of, what if the end characters match but starting character mismatch.

So we need a concrete algorithm that will skip characters such that overall character comparison decreases.

What other options do we have?

One thing which we could do is instead of moving the entire pattern, we move a part of the pattern.

We match each character between mismatched string and pattern, then we check if we have any common characters, if we do then we move only part of those characters.

In this case, we did 12 comparison operations and this will work if compare string and pattern from either side.

This algorithm is called the Boyer Moore Pattern Matching algorithm.

Implementation of Boyer Moore Pattern Matching algorithm

This is a modified version of the original algorithm, the original algorithm found only the first instance of the pattern, here we're finding all the occurrences of the pattern.

Step 1> create an empty map of size 256 (because 256 ASCII characters) and set to -1.



let string = "ATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATC";
let pattern = "ATC";

let M = pattern.length;
let N = string.length;
let skip;                            //to determine substring skip
let res = [];                        //to store result

let map = new Array(256);            //array of 256 length

Step 2> Map character to its index in the pattern.



for(let c = 0;c<256;c++){
  map[c] = -1;                       //initialize to -1
}

for(let j=0;j<M;j++){
  map[pattern[j]] = j;               //initialize to the it's index in pattern
}

Step 3> Loop over the string, notice that in the for loop, instead of "i++", we're using i+= skip, ie skip that part of the string.



for(let i=0;i<=N-M;i+=skip)

Step 4> Set skip to 0 during each iteration, this is important.



for(let i=0;i<=N-M;i+=skip){
  skip=0;
}

Step 5> Match pattern with string.



for(let i=0;i<=N-M;i+=skip){
  skip=0;
  for(let j = M-1;j>=0;j--){

    if(pattern[j] != string[i+j]){
      skip = Math.max(1,j-map[string[i+j].charCodeAt(0)]);
      break;
    }
  }
}

Step 6> If there's a mismatch, find the length that must be skipped, here we perform



   skip = Math.max(1,j-map[string[i+j]]);

In some cases like eg : "ACC" and "ATC", in these cases the last character's match but rest do not.
Logically we must go back and match first "C" of the string with "C" of pattern, but doing so will mean that we are going back which we logically shouldn't or else we will be stuck in an infinite loop going back and forth.
To ensure that we keep on going forward with the matching process, we ensure that whenever we come across situations when there's a negative skip, we set skip to 1.

Step 7> If the skip is 0, ie there we no mismatch, add "i" to the result list.



if(skip == 0){
    console.log(skip)
    res.push(i);
    skip++;
  }

Combining them all :



let string = "ATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATC";
let pattern = "ATC";

let M = pattern.length;
let N = string.length;
let skip;
let res = [];

let map = new Array(256);

for(let c = 0;c<256;c++){
  map[c] = -1;
}

for(let j=0;j<M;j++){
  map[pattern[j]] = j;
}

for(let i=0;i<=N-M;i+=skip){
  skip=0;
  for(let j = M-1;j>=0;j--){

    if(pattern[j] != string[i+j]){
      skip = Math.max(1,j-map[string[i+j].charCodeAt(0)]));
      break;
    }
  }
  if(skip == 0){
    res.push(i);
    skip++;
  }
}

console.log(res);

That's it! That's how Boyer Moore's pattern matching works.

There are many other Pattern Matching algorithms like Knuth Morris Pratt and Rabin Karp but these have their own use cases.

I found this on StackOverflow you can read it here but in a nutshell:

Boyer Moore : Takes O(m) space, O(mn) worst case ,best case Ω(m/n). preforms 25% better on dictionary words and long words. Pratcial usecase includes implementation of grep in GNU for string matching, chrome probably uses it for string search.

Knuth Morris Pratt: Takes O(m) space, O(m+n) worst case, works better on DNA sequences.

Rabin Karp: Use O(1) auxiliary space, this performs better under searching for long words in a document containing many long words (see StackOverflow link for more).

I hope you liked my explanation. I usually write about how to solve interview questions and real-life applications of algorithms.

If I messed up somewhere or explained something wrongly, please comment down below.

Thanks for reading! :)

github:https://github.com/AKHILP96/Data-Structures-and-Algorithms/blob/master/Algorithm/boyermoore.js

PS: I am looking for job, if you want someone who knows's how to design UI/UX while keeping development in mind, hit me up :) thanks!

Oldest comments (49)

Uriel Bitton • Jul 5 '20

very interesting!

Akhil • Jul 5 '20

Thanks for reading :)

Deepak Ahuja 👨‍💻 • Jul 5 '20

Amazing post! Thanks

Akhil • Jul 5 '20

Thanks for reading :)

Akhil • Jul 5 '20

Thanks for reading :)

Dibyamohan Acharya • Jul 5 '20

Nice one

Akhil • Jul 5 '20

Thanks for reading :)

Ashwin Sharma P • Jul 5 '20

Good one and very informative.

Akhil • Jul 5 '20

Thanks for reading :)

Naveen Yalla • Jul 5 '20

Good read

Akhil • Jul 6 '20

Thanks for reading :)

Fiqri Syah Redha • Jul 5 '20

This is interesting. Thanks for sharing.

Akhil • Jul 6 '20

Thanks alot for reading :)

Ben Halpern • Jul 6 '20

Very cool!

Akhil • Jul 6 '20

OMG !! Thanks for reading 🙏🙏🙏

Ben Halpern • Jul 6 '20

🤓

John Pham • Jul 6 '20 • Edited

Great write up on the different string matching algorithms!

The algorithm which we will implement might be similar to the one used in Chrome, but since its Google we're talking about, they might've made optimizations

The source code for the find in page is actually open source! You can see how it's implemented here: source.chromium.org/chromium/chrom...

Actual implementation: https://source.chromium.org/chromium/chromium/src/+/master:v8/src/strings/string-search.h;l=281;drc=e3355a4a33909a48ebb8614048d90cffc67d287e?q=string%20search&ss=chromium&originalUrl=https:%2F%2Fcs.chromium.org%2F

Looks like they swap in different algorithms based on the context.

Akhil • Jul 6 '20

OMG ! thanks for sharing :). I read on StackOverflow that somewhere around 2008-2010 when chrome was picking of, they shared how they've implemented a version of Boyer Moore's pattern matching algorithm. But I couldn't find it on youtube.

Akhil • Jul 6 '20

Yep, that's the beauty of the algorithm since we will store the last index, the "skip" will be large. Since the skip is large if the characters from the end do not match, we will skip even more characters bringing down the overall number of comparisons.

View full discussion (49 comments)