The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find a pattern (or patterns) in an input string.
It is not the quickest algorithm for finding a single pattern, however a small tweak can allow the algorithm to perform well for multiple patterns. Using Rabin-Karp to search for multiple patterns could come in handy for things like detecting common words, sentences, or characters between documents or even source code.
In this post I will detail resources for learning about Rabin-Karp, go over it's key features, & provide a complete + efficient implementation of the algorithm (with test cases).
Resources:
- Rabin-Karp Wikipedia
- MIT Video Explanation
- Rolling Hash Video Explanation
- Article Explanation
- Implementation
Takeaways:
- Aside from a tricky rolling-hash, the algorithm is simple compared to other string-searching algorithms.
- The algorithm is as follows:
- Accept an input
i
and a patternp
- Define a variable
n
to be the length ofi
and a variablem
to be the length ofp
. - Create a hash of pattern
p
(hashP
) and a hash of lengthm
of inputi
(hashI
).- I.e for an input
i = "abcde"
and a pattern lengthm
of 2,hashI
would be a hash of the substring"ab"
.
- I.e for an input
- If
hashP == hashI
then we have possibly found patternp
at the start of inputi
. Check each character of the substring representinghashI
with our patternp
to be sure (call this functionCheckEqual()
) - Otherwise iterate from pattern length
m
to input lengthn
- For each iteration recalculate the hash
hashI
:- We are sliding a window across our input
- Window is of size
m
(pattern length) - This means for each iteration we are removing a leading character from our hash and adding a trailing character
- If
i = "abcde"
,m = 2
, andhashI
represents"ab"
- then our first iteration would remove"a"
fromhashI
and add"c"
. MeaninghashI
now represents"bc"
. This part of the operation is done in constant time.
- After rehashing, check if
hashP == hashI
. If they are the same, run our character equality checkCheckEqual()
to be certain. - Explore all of
i
until we have found our pattern or no pattern exists ini
.
- Accept an input
- A rolling hash is a hash function where the input is hashed in a sliding window. As the window moves through the input the hash is updated quicker than rehashing all the characters in the new window - as the previous window and the current window share characters.
- E.g if our input is
"abcde"
and our window size is 3. We first hash"abc"
then slide our window and next hash"bcd"
- a rolling hash takes advantage of the fact that"bc"
was already hashed. Instead of hashing"bc"
again a rolling hash removes"a"
and appends to the hash of"bc"
the hash of"c"
. This means a hash for"bc
" was only calculated once for two hashes. - Extrapolated over larger inputs, and more times, a rolling hash can be very efficient. It's the reason Rabin-Karp is reasonably fast at string-searching.
- E.g if our input is
- Rabin-Karp's rolling hash is calculated modulo a large prime number. This means hash values are also relatively prime and the distribution of the hash values are more uniform.
- Choosing a large prime helps us avoid false positives - where two hashes are the same but the values they represent are not.
- Choosing a random prime means we can guard against the worst case running time, as no specific inputs will affect the running time predictably.
- The rolling hash is often computed multiplied by a relatively prime Base/Radix. This value should be at least as large as the character set. For my implementation I used 256 - as this is the number of ASCII characters.
- A common rolling hash for Rabin-Karp is:
S[i]*R^m-i % Q
. WhereS
is our input string,i
is the position (in a loop/iteration),R
is the radix/base,m
is the pattern length, andQ
is the prime.- Using Horner's Method this can be simplified in code to:
private long Hash(string input, int patternLength)
{
long hash = 0;
for (int i = 0; i < patternLength; i++)
{
int ascii = input[i];
// Assume Base/Prime are already defined
hash = (Base * hash + ascii) % Prime;
}
return hash;
}
- Time complexity of Rabin-Karp is linear (specifically, amortized
O(n + m)
wheren
is input length andm
is pattern length). - Brute forced approaches, and worst case Rabin-Karp (more likely with smaller, non-random primes, and/or bad rolling hash), are
O(n * m)
. - Space is
O(1)
- We can modify Rabin-Karp with a Bloom Filter or simply a hash set to search for multiple patterns.
- In my implementations I return the index (aka offset) the pattern starts at in the input. You could easily modify either the single or multiple pattern implementations to do other things (like count occurrences of patterns).
Below are Rabin-Karp implementations for single & multiple patterns (of the same size). The implementations assume ASCII input and will work for inputs up to a reasonably large size. The rolling hash uses a relatively large random prime to guard against poor running times:
As always, if you found any errors in this post please let me know!
Top comments (0)