Matteo

Posted on Nov 11

Searching Algorithms Part 2: Pattern Matching in Strings with Z, KMP, and Rabin-Karp

#algorithms #python #interviewprep #computerscience

1. Introduction. The hook of today

In part 1, array search taught us to keep cutting the space of possibilities.

String search asks a similar question. Where in this text does a pattern occur.
The naive approach tries every starting index and checks character by character. It works, but it repeats work that we could reuse.
Today we learn how to avoid that repetition with three classic tools: Z algorithm, KMP, and Rabin-Karp. By the end you will recognize which tool fits which shape of problem and you will know how to code each one quickly.

2. The naive idea

Given text s and pattern p, try every index i and compare s[i : i + len(p)] to p.

def find_all_naive(s: str, p: str):
    n, m = len(s), len(p)
    out = []
    for i in range(n - m + 1):
        # character by character to avoid creating a slice
        ok = True
        for j in range(m):
            if s[i + j] != p[j]:
                ok = False
                break
        if ok:
            out.append(i)
    return out

Complexity

Time: O(n m).
Space: O(1).

If you build the slice s[i : i + m] and compare whole strings, each slice copy costs O(m). That leads to O(n m) with larger constants. The rest of this post shows how to do better by reusing information across positions.

3. Z algorithm

3.1 Principle

The Z array for a string x records, for each position i, the length of the longest substring starting at i that matches the prefix of x.
If we compute Z on p#s where # is a separator that does not appear in either string, any Z value equal to len(p) marks a match of p inside s.

The algorithm maintains a window [L, R] called the Z box that represents the rightmost segment that matches a prefix. Inside the box we reuse previous values by mirroring positions. When we fall outside the box we match characters from scratch and extend the box.

3.2 Tradeoffs and assumptions

Pros
• Simple linear scan once you know the template
• Great for prefix based queries such as longest prefix that is also a suffix

Cons
• Slightly less familiar to many engineers than KMP
• Requires a safe separator when you concatenate pattern and text

Assumptions
• Input is indexable characters. For byte data the logic is identical.

Complexity

Time: O(n) for the combined string
Space: O(n) for the Z array.

3.3 Standard implementation

Building the Z array

def z_array(x: str):
    n = len(x)
    Z = [0] * n
    L = R = 0
    for i in range(1, n):
        if i <= R:
            Z[i] = min(R - i + 1, Z[i - L])
        while i + Z[i] < n and x[Z[i]] == x[i + Z[i]]:
            Z[i] += 1
        if i + Z[i] - 1 > R:
            L, R = i, i + Z[i] - 1
    return Z

Pattern search with Z

def find_with_z(s: str, p: str):
    if not p:
        return list(range(len(s) + 1))
    combo = p + "#" + s
    Z = z_array(combo)
    m = len(p)
    hits = []
    for i in range(m + 1, len(combo)):
        if Z[i] == m:
            hits.append(i - m - 1)
    return hits

3.4 Example. Longest Happy Prefix

Goal. Find the longest prefix of s that is also a suffix and not equal to the whole string.

Why Z fits
Z stores lengths of segments that match the prefix. The answer equals the largest Z value that ends at the last index.

def longest_prefix(s: str) -> str:
    Z = z_array(s) # previously built function
    n = len(s)
    for i in range(1, n):
        if i + Z[i] == n:
            return s[:Z[i]]
    return ""

4. KMP

4.1 Principle

KMP uses a prefix function called LPS that records the length of the longest proper prefix that is also a suffix for each prefix of the pattern.

When a mismatch occurs after matching k characters, KMP uses the LPS of k to jump the pattern to the next viable alignment, without moving the text backward. The text pointer never retreats. That is the key efficiency.

4.2 Tradeoffs and assumptions

Pros
• Deterministic linear time for search
• No hashing and no collision concerns
• Works well when there are many partial matches

Cons
• The LPS table is conceptually tricky for beginners
• Slightly more code than Z if you only need prefix queries

Assumptions
• Same as Z. Operates on indexable characters

Complexity

Time: O(n + m).
Space: O(m) for the LPS table.

4.3 Standard implementation

Build LPS

def build_lps(p: str):
    m = len(p)
    lps = [0] * m
    k = 0
    for i in range(1, m):
        while k > 0 and p[i] != p[k]:
            k = lps[k - 1]
        if p[i] == p[k]:
            k += 1
            lps[i] = k
    return lps

Pattern search with LPS

def kmp_search(s: str, p: str):
    if not p:
        return list(range(len(s) + 1))
    lps = build_lps(p)
    i = j = 0
    hits = []
    while i < len(s):
        if s[i] == p[j]:
            i += 1
            j += 1
            if j == len(p):
                hits.append(i - j)
                j = lps[j - 1]
        else:
            if j > 0:
                j = lps[j - 1]
            else:
                i += 1
    return hits

4.4 Example. Find the index of the first occurrence.

Use KMP to return the first match or minus one when absent.

def str_str(haystack: str, needle: str) -> int:
    n, m = len(haystack), len(needle)

    if not m or n < m:
        return -1

    lps = build_lps(needle) # previously built function
    i = j = 0 # over haystack and needle

    while i < n:
        if haystack[i] == needle[j]:
            i += 1
            j +=1
            if j == m:
                return i - j
        else:
            if j > 0:
                j = lps[j-1]
            else:
                i += 1

    return -1

5. Rabin Karp with binary search on length

5.1 Principle

The Rabin–Karp algorithm transforms substrings into numeric hashes so they can be compared in constant time.
Instead of comparing every character of a pattern with every substring, we slide a rolling hash window over the text and only verify characters when the hash values match.

It is one of the earliest practical examples of using hashing for pattern matching.

5.2 Best Practices

Choose a good base.
A common choice is a prime number slightly larger than your alphabet size (for lowercase letters 26 → base ≈ 31, for ASCII 256).
Using a base that spreads digits well reduces collisions.

Choose a large prime modulus.
It keeps hash values small while minimizing collisions. Popular moduli are 10*9 + 7, 109 + 9, or 2*61 - 1 for 64-bit arithmetic.

Use modular arithmetic.
Hash updates must be done modulo mod to avoid overflow and to ensure a stable range.

Optional: double hashing.
Using two different moduli and storing hash pairs nearly eliminates collisions.

5.3 Tradeoffs and assumptions

Pros
• Very fast in practice for fixed length checks
• Simple to combine with binary search on the answer

Cons
• Hash collisions are possible
• Requires careful choice of base and modulus
• For full robustness many solutions use two moduli

Assumptions
• Treat the string as integer codes. For Unicode consider normalizing or limiting to bytes

5.4 Standard implementation

Below is the clean, general Rabin–Karp search function, exactly what you’d use to find all occurrences of a substring in a larger text.

def rabin_karp(text: str, pattern: str) -> list[int]:
    n, m = len(text), len(pattern)
    if not m or n < m:
        return []

    base = 256            # or 31 for lowercase letters
    mod = 10**9 + 7       # large prime modulus
    power = pow(base, m - 1, mod)

    # compute initial hashes
    hp = 0  # hash of pattern
    hs = 0  # rolling hash of current window
    for i in range(m):
        hp = (hp * base + ord(pattern[i])) % mod
        hs = (hs * base + ord(text[i])) % mod

    res = []

    for i in range(n - m + 1):
        # if hashes match, verify characters
        if hp == hs and text[i:i + m] == pattern:
            res.append(i)

        # roll the hash window
        if i < n - m:
            hs = (hs - ord(text[i]) * power) % mod
            hs = (hs * base + ord(text[i + m])) % mod
            hs = (hs + mod) % mod  # avoid negatives

    return res

5.4 How It Works

Compute the hash of the pattern and the first window of the text.
Slide the window one character at a time.
Update the hash in constant time:
remove the leftmost character and add the new rightmost character.
When hashes match, verify characters to avoid false positives.

Complexity
Each step is O(1), so the overall average complexity is O(n + m), though in the worst case (many hash collisions) it can degrade to O(n m).
With good parameters, this almost never happens.

5.5 Example: Repeated DNA Sequences

We can apply the same logic to find all 10-letter substrings that appear more than once.

def findRepeatedDnaSequences(self, s: str) -> List[str]:
    n = len(s)
    m = 10
    if n <= m:
        return []

    base = 131           # good spread for ASCII range
    mod = 10**9 + 7
    power = pow(base, m - 1, mod)

    hs = 0
    seen = {}
    res = set()

    # initial hash for first window
    for i in range(m):
        hs = (hs * base + ord(s[i])) % mod
    seen[hs] = 0

    # slide the window
    for i in range(n - m):
        hs = (hs - ord(s[i]) * power) % mod
        hs = (hs * base + ord(s[i + m])) % mod
        hs = (hs + mod) % mod

        if hs in seen:
            if s[seen[hs] : seen[hs] + m] == s[i + 1 : i + 1 + m]:
                res.add(s[i + 1 : i + 1 + m])
        seen[hs] = i + 1

    return list(res)

5.6 Advanced: Binary Search + Rabin–Karp

For problems like Longest Duplicate Substring or Longest Repeated Substring, we can combine Rabin–Karp with binary search on substring length:

Binary search determines the candidate length L.
Rabin–Karp checks whether any duplicate substring of length L exists.
If yes, try a longer length; otherwise, shorter.

This hybrid approach runs in O(n log n) and is one of the cleanest ways to solve those problems efficiently.

Complexity Summary

6. Quick comprehension check

Question
You have to find every position where pattern p occurs inside text s. Which method would you choose if you want deterministic linear time and you do not want to deal with hashing.

Answer
Use KMP. Build the LPS table for p and scan s once while reusing prefix information.

7. Final summary and review

All three methods avoid recomputing work that naive scanning repeats.

Z algorithm
Records how much the prefix matches from every position. Excellent for prefix queries and for fast detection of pattern matches by computing Z on pattern, separator, text.

KMP
Uses the prefix function to jump the pattern forward after mismatches. Deterministic linear time. A reliable default when hashing is not desirable.

Rabin Karp
Uses rolling hashes to compare substrings by numbers. Very natural to combine with binary search on length for longest duplicate or longest repeated tasks.

8. Hook for tomorrow

We searched inside arrays and inside strings. Next we switch to structures that are not linear. Trees and graphs. We will learn how BFS and DFS explore state spaces, how to encode visited sets, and how to graft searching on top of traversal to solve pathfinding and many classic interview problems.

DEV Community

Searching Algorithms Part 2: Pattern Matching in Strings with Z, KMP, and Rabin-Karp

1. Introduction. The hook of today

2. The naive idea

3. Z algorithm

3.1 Principle

3.2 Tradeoffs and assumptions

3.3 Standard implementation

3.4 Example. Longest Happy Prefix

4. KMP

4.1 Principle

4.2 Tradeoffs and assumptions

4.3 Standard implementation

4.4 Example. Find the index of the first occurrence.

5. Rabin Karp with binary search on length

5.1 Principle

5.2 Best Practices

5.3 Tradeoffs and assumptions

5.4 Standard implementation

5.4 How It Works

5.5 Example: Repeated DNA Sequences

5.6 Advanced: Binary Search + Rabin–Karp

6. Quick comprehension check

7. Final summary and review

8. Hook for tomorrow

Top comments (0)