DEV Community

Nic
Nic

Posted on • Originally published at coderscat.com on

KMP String Search Algorithm

2020_05_10_kmp-string-search.org_20200510_135023.png

Knuth–Morris–Pratt string-searching algorithm (or KMP algorithm) is a string search algorithm.

It’s named with the combined names of authors and first published at 1970.

KMP algorithm is elegant and efficient. Give any text and pattern, the overall time complexity is O(M+N), where M is the length of text and N is the length of pattern.

To understanding KMP, we need to explain with naive string search algorithm and prefix string.

Naive string search algorithm

Naive string search has two loops, we iterate each index from text and check each substring with pattern.

The overall time complexity is O(M*N).

int naive_search(char* s, char* p) {
  int sl = strlen(s);
  int pl = strlen(p);
  int i = 0;
  int j = 0;
  while (i < sl && j < pl) {
    if (s[i] == p[j]) {
      i++;
      j++;
    } else {
      i = i - j + 1;
      j = 0; // Fallback to the beginning of pattern
    }
  }
  return j == pl ? (i-j) : (-1);
}

Enter fullscreen mode Exit fullscreen mode

The drawback of naive algorithm is we fallback to the beginning of pattern every time. It will perform badly for such an input:

text :    a a a d e d f
pattern : a a a c

loop #1 : a a a c
loop #2 :   a a a c
loop #3 :     a a a c
Enter fullscreen mode Exit fullscreen mode

Each time we checked almost all the characters of pattern, but the last one is different, and then we restart the checking process from the beginning.

How to solve this issue?

Prefix and suffix string sets

To understand how to optimize naive string search, we need to know what is the common prefix and suffix substring.

Given any string, a prefix string set is a set contains all the prefix substring of it. For example, the pefix string set of “hello” is:

{ "h", "he", "hel", "hell"}
Enter fullscreen mode Exit fullscreen mode

Suffix string set is similar but only contains suffix sub-string, for “hello”:

{ "o", "ol", "llo", "ello"}
Enter fullscreen mode Exit fullscreen mode

With these two sets, we could get the longest length of common string in two sets. For example:

text: "ababab"
Prefix set: { "a", "ab", "aba", "abab", "ababa" }
Suffix set: { "b", "ab", "bab", "abab", "babab" }
Longest common string length: length("abab") => 4
Enter fullscreen mode Exit fullscreen mode

How to use a common string of prefix and suffix substring to make string search quicker?

For each index of pattern, we build a table which contains the longest length of current substring.

For pattern “ccca”, the table is:

2020_05_10_kmp-string-search.org_20200510_133533.png

With text input text of “ccccccccca”, we can skip some part of pattern:

kmp-s1.gif

We even can skip multiple characters from pattern. Given the text of “ababababca”, let’s search the pattern of “abababca”:

kmp-s2.gif

KMP algorithm

With all above explaination, we can figure out KMP is similar with naive string search algorithm, but with an optimization of next table. We use next table to skip prefix parts of pattern:

void getNext(char* p, int* next){
  int i = 0, j = -1;
  next[0] = -1;
  while (i < strlen(p)) {
    if (j == -1 || p[i] == p[j]) {
      ++i;
      ++j;
      next[i] = j;
    }
    else j = next[j];
  }
}

int KMP(char* text, char* pattern) {
  int lt = strlen(text);
  int lp = strlen(pattern);
  int i = 0;
  int j = 0;

  int next[lp + 1];
  getNext(pattern, next);

  while (i < lt && j < lp) {
    if (j == -1 || text[i] == pattern[j]) {
      i++;
      j++;
    }
    else j = next[j];
  }
  if (j == lp) return i - j;
  return -1;
}

int main() {
  char* text = strdup("ababababca");
  char* pattern = strdup("abababca");
  printf("res: %d\n", KMP(text, pattern));
  return 0;
}

Enter fullscreen mode Exit fullscreen mode

The post KMP String Search Algorithm appeared first on Coder's Cat.

Top comments (1)

Collapse
 
mridulsharma01 profile image
Mridul Sharma

Thanks a lot!