1. Introduction. The hook of today
In part 1, array search taught us to keep cutting the space of possibilities.
String search asks a similar question. Where in this text does a pattern occur.
The naive approach tries every starting index and checks character by character. It works, but it repeats work that we could reuse.
Today we learn how to avoid that repetition with three classic tools: Z algorithm, KMP, and Rabin-Karp. By the end you will recognize which tool fits which shape of problem and you will know how to code each one quickly.
2. The naive idea
Given text s and pattern p, try every index i and compare s[i : i + len(p)] to p.
def find_all_naive(s: str, p: str):
n, m = len(s), len(p)
out = []
for i in range(n - m + 1):
# character by character to avoid creating a slice
ok = True
for j in range(m):
if s[i + j] != p[j]:
ok = False
break
if ok:
out.append(i)
return out
Complexity
- Time: O(n m).
- Space: O(1).
If you build the slice s[i : i + m] and compare whole strings, each slice copy costs O(m). That leads to O(n m) with larger constants. The rest of this post shows how to do better by reusing information across positions.
3. Z algorithm
3.1 Principle
The Z array for a string x records, for each position i, the length of the longest substring starting at i that matches the prefix of x.
If we compute Z on p#s where # is a separator that does not appear in either string, any Z value equal to len(p) marks a match of p inside s.
The algorithm maintains a window [L, R] called the Z box that represents the rightmost segment that matches a prefix. Inside the box we reuse previous values by mirroring positions. When we fall outside the box we match characters from scratch and extend the box.
3.2 Tradeoffs and assumptions
Pros
• Simple linear scan once you know the template
• Great for prefix based queries such as longest prefix that is also a suffix
Cons
• Slightly less familiar to many engineers than KMP
• Requires a safe separator when you concatenate pattern and text
Assumptions
• Input is indexable characters. For byte data the logic is identical.
Complexity
- Time: O(n) for the combined string
- Space: O(n) for the Z array.
3.3 Standard implementation
Building the Z array
def z_array(x: str):
n = len(x)
Z = [0] * n
L = R = 0
for i in range(1, n):
if i <= R:
Z[i] = min(R - i + 1, Z[i - L])
while i + Z[i] < n and x[Z[i]] == x[i + Z[i]]:
Z[i] += 1
if i + Z[i] - 1 > R:
L, R = i, i + Z[i] - 1
return Z
Pattern search with Z
def find_with_z(s: str, p: str):
if not p:
return list(range(len(s) + 1))
combo = p + "#" + s
Z = z_array(combo)
m = len(p)
hits = []
for i in range(m + 1, len(combo)):
if Z[i] == m:
hits.append(i - m - 1)
return hits
3.4 Example. Longest Happy Prefix
Goal. Find the longest prefix of s that is also a suffix and not equal to the whole string.
Why Z fits
Z stores lengths of segments that match the prefix. The answer equals the largest Z value that ends at the last index.
def longest_prefix(s: str) -> str:
Z = z_array(s) # previously built function
n = len(s)
for i in range(1, n):
if i + Z[i] == n:
return s[:Z[i]]
return ""
4. KMP
4.1 Principle
KMP uses a prefix function called LPS that records the length of the longest proper prefix that is also a suffix for each prefix of the pattern.
When a mismatch occurs after matching k characters, KMP uses the LPS of k to jump the pattern to the next viable alignment, without moving the text backward. The text pointer never retreats. That is the key efficiency.
4.2 Tradeoffs and assumptions
Pros
• Deterministic linear time for search
• No hashing and no collision concerns
• Works well when there are many partial matches
Cons
• The LPS table is conceptually tricky for beginners
• Slightly more code than Z if you only need prefix queries
Assumptions
• Same as Z. Operates on indexable characters
Complexity
- Time: O(n + m).
- Space: O(m) for the LPS table.
4.3 Standard implementation
Build LPS
def build_lps(p: str):
m = len(p)
lps = [0] * m
k = 0
for i in range(1, m):
while k > 0 and p[i] != p[k]:
k = lps[k - 1]
if p[i] == p[k]:
k += 1
lps[i] = k
return lps
Pattern search with LPS
def kmp_search(s: str, p: str):
if not p:
return list(range(len(s) + 1))
lps = build_lps(p)
i = j = 0
hits = []
while i < len(s):
if s[i] == p[j]:
i += 1
j += 1
if j == len(p):
hits.append(i - j)
j = lps[j - 1]
else:
if j > 0:
j = lps[j - 1]
else:
i += 1
return hits
4.4 Example. Find the index of the first occurrence.
Use KMP to return the first match or minus one when absent.
def str_str(haystack: str, needle: str) -> int:
n, m = len(haystack), len(needle)
if not m or n < m:
return -1
lps = build_lps(needle) # previously built function
i = j = 0 # over haystack and needle
while i < n:
if haystack[i] == needle[j]:
i += 1
j +=1
if j == m:
return i - j
else:
if j > 0:
j = lps[j-1]
else:
i += 1
return -1
5. Rabin Karp with binary search on length
5.1 Principle
The Rabin–Karp algorithm transforms substrings into numeric hashes so they can be compared in constant time.
Instead of comparing every character of a pattern with every substring, we slide a rolling hash window over the text and only verify characters when the hash values match.
It is one of the earliest practical examples of using hashing for pattern matching.
5.2 Best Practices
Choose a good base.
A common choice is a prime number slightly larger than your alphabet size (for lowercase letters 26 → base ≈ 31, for ASCII 256).
Using a base that spreads digits well reduces collisions.
Choose a large prime modulus.
It keeps hash values small while minimizing collisions. Popular moduli are 10*9 + 7, 109 + 9, or 2*61 - 1 for 64-bit arithmetic.
Use modular arithmetic.
Hash updates must be done modulo mod to avoid overflow and to ensure a stable range.
Optional: double hashing.
Using two different moduli and storing hash pairs nearly eliminates collisions.
5.3 Tradeoffs and assumptions
Pros
• Very fast in practice for fixed length checks
• Simple to combine with binary search on the answer
Cons
• Hash collisions are possible
• Requires careful choice of base and modulus
• For full robustness many solutions use two moduli
Assumptions
• Treat the string as integer codes. For Unicode consider normalizing or limiting to bytes
5.4 Standard implementation
Below is the clean, general Rabin–Karp search function, exactly what you’d use to find all occurrences of a substring in a larger text.
def rabin_karp(text: str, pattern: str) -> list[int]:
n, m = len(text), len(pattern)
if not m or n < m:
return []
base = 256 # or 31 for lowercase letters
mod = 10**9 + 7 # large prime modulus
power = pow(base, m - 1, mod)
# compute initial hashes
hp = 0 # hash of pattern
hs = 0 # rolling hash of current window
for i in range(m):
hp = (hp * base + ord(pattern[i])) % mod
hs = (hs * base + ord(text[i])) % mod
res = []
for i in range(n - m + 1):
# if hashes match, verify characters
if hp == hs and text[i:i + m] == pattern:
res.append(i)
# roll the hash window
if i < n - m:
hs = (hs - ord(text[i]) * power) % mod
hs = (hs * base + ord(text[i + m])) % mod
hs = (hs + mod) % mod # avoid negatives
return res
5.4 How It Works
- Compute the hash of the pattern and the first window of the text.
- Slide the window one character at a time.
- Update the hash in constant time:
- remove the leftmost character and add the new rightmost character.
- When hashes match, verify characters to avoid false positives.
Complexity
Each step is O(1), so the overall average complexity is O(n + m), though in the worst case (many hash collisions) it can degrade to O(n m).
With good parameters, this almost never happens.
5.5 Example: Repeated DNA Sequences
We can apply the same logic to find all 10-letter substrings that appear more than once.
def findRepeatedDnaSequences(self, s: str) -> List[str]:
n = len(s)
m = 10
if n <= m:
return []
base = 131 # good spread for ASCII range
mod = 10**9 + 7
power = pow(base, m - 1, mod)
hs = 0
seen = {}
res = set()
# initial hash for first window
for i in range(m):
hs = (hs * base + ord(s[i])) % mod
seen[hs] = 0
# slide the window
for i in range(n - m):
hs = (hs - ord(s[i]) * power) % mod
hs = (hs * base + ord(s[i + m])) % mod
hs = (hs + mod) % mod
if hs in seen:
if s[seen[hs] : seen[hs] + m] == s[i + 1 : i + 1 + m]:
res.add(s[i + 1 : i + 1 + m])
seen[hs] = i + 1
return list(res)
5.6 Advanced: Binary Search + Rabin–Karp
For problems like Longest Duplicate Substring or Longest Repeated Substring, we can combine Rabin–Karp with binary search on substring length:
- Binary search determines the candidate length L.
- Rabin–Karp checks whether any duplicate substring of length L exists.
- If yes, try a longer length; otherwise, shorter.
This hybrid approach runs in O(n log n) and is one of the cleanest ways to solve those problems efficiently.
Complexity Summary
6. Quick comprehension check
Question
You have to find every position where pattern p occurs inside text s. Which method would you choose if you want deterministic linear time and you do not want to deal with hashing.
Answer
Use KMP. Build the LPS table for p and scan s once while reusing prefix information.
7. Final summary and review
All three methods avoid recomputing work that naive scanning repeats.
Z algorithm
Records how much the prefix matches from every position. Excellent for prefix queries and for fast detection of pattern matches by computing Z on pattern, separator, text.
KMP
Uses the prefix function to jump the pattern forward after mismatches. Deterministic linear time. A reliable default when hashing is not desirable.
Rabin Karp
Uses rolling hashes to compare substrings by numbers. Very natural to combine with binary search on length for longest duplicate or longest repeated tasks.
8. Hook for tomorrow
We searched inside arrays and inside strings. Next we switch to structures that are not linear. Trees and graphs. We will learn how BFS and DFS explore state spaces, how to encode visited sets, and how to graft searching on top of traversal to solve pathfinding and many classic interview problems.


Top comments (0)