DEV Community

Akhil
Akhil

Posted on • Edited on

The algorithm behind Ctrl + F.

Ctrl + F on chrome opens up a search box that is used to find text on a web page, pdf, etc. It's one of the fastest ones I have seen and decided to dig deeper into what's going on.

So let's go on a journey of implementing a fast string matching algorithm.

Note: The algorithm which we will implement might be similar to the one used in Chrome, but since its Google we're talking about, they might've made optimizations

You might be wondering why do we need an algorithm when we have regular expression which does the same?

Yes we have regular expressions at our disposal but regular expressions are slow when we task it with finding patterns on large data, regular expression is awesome when we task it with finding a "dynamic patter" like all 10 digit phone numbers starting with +91, but in this case, we want to find one particular string.

If you want to know more Read here

This leaves us the only option of implementing a pattern matcher. Let's start with basic we can think about. We're given a document containing millions of words and we want to find one word, how shall we approach this? It's like finding a needle in a haystack.

Naive Approach

The first idea that we think of is comparing pattern and string character by character :

Alt Text

Implementation :



let string = "ATAATTACCAACATC";
let pattern = "ATC";
let position = [];
let found = true;
for(let i=0;i<string.length;i++){
  found = true;
  for(let j=0;j<pattern.length;j++){
    if(string[i+j] != pattern[j]){
      found = false;
      break;
    }
  }
  if(found){
    position.push(i);
  }
}

console.log(position);


Enter fullscreen mode Exit fullscreen mode

But this performs in O(nm) time complexity, which is very slow.

How to optimize it?

For each string, if it doesn't match, we move by one character. How about skipping the whole word?

Alt Text

In this case, instead of starting all over again, we skip the string when it mismatches.

In the previous approach, we compared string nearly 45 times, here we compared string only 15 times which is a huge leap.

Here we can perform an optimization, instead of comparing from the front, how about comparing from the end?

Alt Text

In this case, we compared the string just 9 times, which is nearly half of the previous case.

But as you might've guessed this has a huge flaw of, what if the end characters match but starting character mismatch.

So we need a concrete algorithm that will skip characters such that overall character comparison decreases.

What other options do we have?

One thing which we could do is instead of moving the entire pattern, we move a part of the pattern.

We match each character between mismatched string and pattern, then we check if we have any common characters, if we do then we move only part of those characters.

Alt Text

In this case, we did 12 comparison operations and this will work if compare string and pattern from either side.

This algorithm is called the Boyer Moore Pattern Matching algorithm.

Implementation of Boyer Moore Pattern Matching algorithm

This is a modified version of the original algorithm, the original algorithm found only the first instance of the pattern, here we're finding all the occurrences of the pattern.

Step 1> create an empty map of size 256 (because 256 ASCII characters) and set to -1.



let string = "ATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATC";
let pattern = "ATC";

let M = pattern.length;
let N = string.length;
let skip;                            //to determine substring skip
let res = [];                        //to store result

let map = new Array(256);            //array of 256 length



Enter fullscreen mode Exit fullscreen mode

Step 2> Map character to its index in the pattern.



for(let c = 0;c<256;c++){
  map[c] = -1;                       //initialize to -1
}

for(let j=0;j<M;j++){
  map[pattern[j]] = j;               //initialize to the it's index in pattern
}


Enter fullscreen mode Exit fullscreen mode

Step 3> Loop over the string, notice that in the for loop, instead of "i++", we're using i+= skip, ie skip that part of the string.



for(let i=0;i<=N-M;i+=skip)


Enter fullscreen mode Exit fullscreen mode

Step 4> Set skip to 0 during each iteration, this is important.



for(let i=0;i<=N-M;i+=skip){
  skip=0;
}


Enter fullscreen mode Exit fullscreen mode

Step 5> Match pattern with string.



for(let i=0;i<=N-M;i+=skip){
  skip=0;
  for(let j = M-1;j>=0;j--){

    if(pattern[j] != string[i+j]){
      skip = Math.max(1,j-map[string[i+j].charCodeAt(0)]);
      break;
    }
  }
}


Enter fullscreen mode Exit fullscreen mode

Step 6> If there's a mismatch, find the length that must be skipped, here we perform



   skip = Math.max(1,j-map[string[i+j]]);


Enter fullscreen mode Exit fullscreen mode

In some cases like eg : "ACC" and "ATC", in these cases the last character's match but rest do not.
Logically we must go back and match first "C" of the string with "C" of pattern, but doing so will mean that we are going back which we logically shouldn't or else we will be stuck in an infinite loop going back and forth.
To ensure that we keep on going forward with the matching process, we ensure that whenever we come across situations when there's a negative skip, we set skip to 1.

Step 7> If the skip is 0, ie there we no mismatch, add "i" to the result list.



if(skip == 0){
    console.log(skip)
    res.push(i);
    skip++;
  }


Enter fullscreen mode Exit fullscreen mode

Combining them all :



let string = "ATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATC";
let pattern = "ATC";

let M = pattern.length;
let N = string.length;
let skip;
let res = [];

let map = new Array(256);

for(let c = 0;c<256;c++){
  map[c] = -1;
}

for(let j=0;j<M;j++){
  map[pattern[j]] = j;
}

for(let i=0;i<=N-M;i+=skip){
  skip=0;
  for(let j = M-1;j>=0;j--){

    if(pattern[j] != string[i+j]){
      skip = Math.max(1,j-map[string[i+j].charCodeAt(0)]));
      break;
    }
  }
  if(skip == 0){
    res.push(i);
    skip++;
  }
}

console.log(res);


Enter fullscreen mode Exit fullscreen mode

That's it! That's how Boyer Moore's pattern matching works.

There are many other Pattern Matching algorithms like Knuth Morris Pratt and Rabin Karp but these have their own use cases.

I found this on StackOverflow you can read it here but in a nutshell:

Boyer Moore : Takes O(m) space, O(mn) worst case ,best case Ω(m/n). preforms 25% better on dictionary words and long words. Pratcial usecase includes implementation of grep in GNU for string matching, chrome probably uses it for string search.

Knuth Morris Pratt: Takes O(m) space, O(m+n) worst case, works better on DNA sequences.

Rabin Karp: Use O(1) auxiliary space, this performs better under searching for long words in a document containing many long words (see StackOverflow link for more).

I hope you liked my explanation. I usually write about how to solve interview questions and real-life applications of algorithms.

If I messed up somewhere or explained something wrongly, please comment down below.

Thanks for reading! :)

github:https://github.com/AKHILP96/Data-Structures-and-Algorithms/blob/master/Algorithm/boyermoore.js

PS: I am looking for job, if you want someone who knows's how to design UI/UX while keeping development in mind, hit me up :) thanks!

Buy Me A Coffee

Oldest comments (49)

Collapse
 
urielbitton profile image
Uriel Bitton

very interesting!

Collapse
 
akhilpokle profile image
Akhil

Thanks for reading :)

Collapse
 
dpkahuja profile image
Deepak Ahuja 👨‍💻

Amazing post! Thanks

Collapse
 
akhilpokle profile image
Akhil

Thanks for reading :)

Collapse
 
Sloan, the sloth mascot
Comment deleted
Collapse
 
akhilpokle profile image
Akhil

Thanks for reading :)

Collapse
 
codewithsaviour profile image
Dibyamohan Acharya

Nice one

Collapse
 
akhilpokle profile image
Akhil

Thanks for reading :)

Collapse
 
ashwinsharmap profile image
Ashwin Sharma P

Good one and very informative.

Collapse
 
akhilpokle profile image
Akhil

Thanks for reading :)

Collapse
 
nyalla profile image
Naveen Yalla

Good read

Collapse
 
akhilpokle profile image
Akhil

Thanks for reading :)

Collapse
 
fiqrisr profile image
Fiqri Syah Redha

This is interesting. Thanks for sharing.

Collapse
 
akhilpokle profile image
Akhil

Thanks alot for reading :)

Collapse
 
ben profile image
Ben Halpern

Very cool!

Collapse
 
akhilpokle profile image
Akhil

OMG !! Thanks for reading 🙏🙏🙏

Collapse
 
ben profile image
Ben Halpern

🤓

Collapse
 
johnphamous profile image
John Pham • Edited

Great write up on the different string matching algorithms!

The algorithm which we will implement might be similar to the one used in Chrome, but since its Google we're talking about, they might've made optimizations

The source code for the find in page is actually open source! You can see how it's implemented here: source.chromium.org/chromium/chrom...

Actual implementation: https://source.chromium.org/chromium/chromium/src/+/master:v8/src/strings/string-search.h;l=281;drc=e3355a4a33909a48ebb8614048d90cffc67d287e?q=string%20search&ss=chromium&originalUrl=https:%2F%2Fcs.chromium.org%2F

Looks like they swap in different algorithms based on the context.

Collapse
 
akhilpokle profile image
Akhil

OMG ! thanks for sharing :). I read on StackOverflow that somewhere around 2008-2010 when chrome was picking of, they shared how they've implemented a version of Boyer Moore's pattern matching algorithm. But I couldn't find it on youtube.

Collapse
 
akhilpokle profile image
Akhil

Yep, that's the beauty of the algorithm since we will store the last index, the "skip" will be large. Since the skip is large if the characters from the end do not match, we will skip even more characters bringing down the overall number of comparisons.