[For Beginners] Understanding the KMP Algorithm by Comparing with the Brute-Force

#beginners #algorithms

The Knuth-Morris-Pratt algorithm (KMP), a well-known string search algorithm, is a method designed to efficiently perform pattern matching by minimizing unnecessary comparisons. To help both beginners and myself deepen our understanding, I’ll explain the basic ideas and workings of the KMP algorithm.

In this article, we’ll refer to the string pattern to be searched as the query and the string in which we want to find the query as the text.

The Limitations of the Brute-Force Method

When it comes to string search algorithms, the simplest form is the brute-force method (also referred to as the naive method). Few would dispute this. Here, let’s consider a text of length n and a query of length m. The brute-force method compares the query to the text character by character, starting from the beginning, and continues until a match is found.

However, this approach has a worst-case time complexity of O(nm), making it inefficient for long strings. For instance, in the following example, the brute-force method requires 20 comparisons:

Text: ABABAABABAB

Query: ABABAB

The KMP algorithm, on the other hand, can reduce this to just 14 comparisons.

The Key to the Algorithm: The Failure Table

The core idea of the KMP algorithm is to maximize the use of already matched parts even when a mismatch occurs. To achieve this, the algorithm relies on a data structure called the Failure Table (also referred to as the shift table or partial match table). This table helps determine where to resume the comparison after a mismatch.

The Failure Table has two main characteristics:

It grows row by row as the query pattern is built one character at a time. Each row represents a scenario where the text and query match up to a certain point.
It calculates the shift value by extracting the longest matching prefix and suffix for each substring of the query. This value represents the length of the longest overlapping prefix and suffix, which tells us how far we can skip ahead in the text after a mismatch.

The Failure Table is constructed using only the query string. By analyzing substrings of the query, from one character up to m characters, it records how much of the query matches itself in terms of prefixes and suffixes. Let’s take a look at an example to clarify.

Query: ABABAB

PATTERN	PREFIX + (others)	(others) + SUFFIX	SHIFT
A	-	-	0
AB	A + (b)	(a) B	0
ABA	A + (ba)	(ab) + A	1
ABAB	AB + (ab)	(ab) + AB	2
ABABA	ABA + (ba)	(ab) + ABA	3
ABABAB	ABAB + (ab)	(ab) + ABAB	4

Using the Failure Table: A Practical Example

Let’s see how the Failure Table is used during pattern matching with the KMP algorithm. We’ll use the same text and query as mentioned earlier.

On the first attempt, the algorithm successfully matches up to the 5th character. However, the final character results in a mismatch. Now, instead of restarting the comparison from the next character in the text, the KMP algorithm uses the Failure Table to decide whether to continue from the current position or to backtrack slightly.

Here’s how it works:

The algorithm identifies that the last matched character was at position 5 (A). In the Failure Table, this corresponds to index 4.
The Failure Table indicates that FailureTable[4] = A, showing a potential overlap. The shift value for this position is 3.
This means the next comparison can resume from text[5] and pattern[3], skipping unnecessary comparisons.

This efficient skip allows the algorithm to continue without rechecking the already matched portion, saving computational effort.

Repeating this process, the algorithm determines a match after a total of 14 comparisons. From a computational perspective, constructing the Failure Table has a complexity of O(m), and the search process has a complexity of O(n). Therefore, the worst-case time complexity of the KMP algorithm is O(n + m).

One key point to remember is that the KMP algorithm never backtracks the index on the text side. Keeping this feature in mind can help prevent confusion when implementing the algorithm.

At first glance, the KMP algorithm may seem complex, but its core idea is simple: to reduce unnecessary comparisons. By understanding and utilizing the Failure Table, efficient string search becomes achievable.

I hope this article has helped you grasp the basics of the KMP algorithm.