The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find a pattern (or patterns) in an input string.
It is not the quickest algorithm for finding a single pattern, however a small tweak can allow the algorithm to perform well for multiple patterns. Using Rabin-Karp to search for multiple patterns could come in handy for things like detecting common words, sentences, or characters between documents or even source code.
In this post I will detail resources for learning about Rabin-Karp, go over it's key features, & provide a complete + efficient implementation of the algorithm (with test cases).
Resources:
- Rabin-Karp Wikipedia
- MIT Video Explanation
- Rolling Hash Video Explanation
- Article Explanation
- Implementation
Takeaways:
- Aside from a tricky rolling-hash, the algorithm is simple compared to other string-searching algorithms.
- The algorithm is as follows:
- Accept an input
iand a patternp - Define a variable
nto be the length ofiand a variablemto be the length ofp. - Create a hash of pattern
p(hashP) and a hash of lengthmof inputi(hashI).- I.e for an input
i = "abcde"and a pattern lengthmof 2,hashIwould be a hash of the substring"ab".
- I.e for an input
- If
hashP == hashIthen we have possibly found patternpat the start of inputi. Check each character of the substring representinghashIwith our patternpto be sure (call this functionCheckEqual()) - Otherwise iterate from pattern length
mto input lengthn - For each iteration recalculate the hash
hashI:- We are sliding a window across our input
- Window is of size
m(pattern length) - This means for each iteration we are removing a leading character from our hash and adding a trailing character
- If
i = "abcde",m = 2, andhashIrepresents"ab"- then our first iteration would remove"a"fromhashIand add"c". MeaninghashInow represents"bc". This part of the operation is done in constant time.
- After rehashing, check if
hashP == hashI. If they are the same, run our character equality checkCheckEqual()to be certain. - Explore all of
iuntil we have found our pattern or no pattern exists ini.
- Accept an input
- A rolling hash is a hash function where the input is hashed in a sliding window. As the window moves through the input the hash is updated quicker than rehashing all the characters in the new window - as the previous window and the current window share characters.
- E.g if our input is
"abcde"and our window size is 3. We first hash"abc"then slide our window and next hash"bcd"- a rolling hash takes advantage of the fact that"bc"was already hashed. Instead of hashing"bc"again a rolling hash removes"a"and appends to the hash of"bc"the hash of"c". This means a hash for"bc" was only calculated once for two hashes. - Extrapolated over larger inputs, and more times, a rolling hash can be very efficient. It's the reason Rabin-Karp is reasonably fast at string-searching.
- E.g if our input is
- Rabin-Karp's rolling hash is calculated modulo a large prime number. This means hash values are also relatively prime and the distribution of the hash values are more uniform.
- Choosing a large prime helps us avoid false positives - where two hashes are the same but the values they represent are not.
- Choosing a random prime means we can guard against the worst case running time, as no specific inputs will affect the running time predictably.
- The rolling hash is often computed multiplied by a relatively prime Base/Radix. This value should be at least as large as the character set. For my implementation I used 256 - as this is the number of ASCII characters.
- A common rolling hash for Rabin-Karp is:
S[i]*R^m-i % Q. WhereSis our input string,iis the position (in a loop/iteration),Ris the radix/base,mis the pattern length, andQis the prime.- Using Horner's Method this can be simplified in code to:
private long Hash(string input, int patternLength)
{
long hash = 0;
for (int i = 0; i < patternLength; i++)
{
int ascii = input[i];
// Assume Base/Prime are already defined
hash = (Base * hash + ascii) % Prime;
}
return hash;
}
- Time complexity of Rabin-Karp is linear (specifically, amortized
O(n + m)wherenis input length andmis pattern length). - Brute forced approaches, and worst case Rabin-Karp (more likely with smaller, non-random primes, and/or bad rolling hash), are
O(n * m). - Space is
O(1) - We can modify Rabin-Karp with a Bloom Filter or simply a hash set to search for multiple patterns.
- In my implementations I return the index (aka offset) the pattern starts at in the input. You could easily modify either the single or multiple pattern implementations to do other things (like count occurrences of patterns).
Below are Rabin-Karp implementations for single & multiple patterns (of the same size). The implementations assume ASCII input and will work for inputs up to a reasonably large size. The rolling hash uses a relatively large random prime to guard against poor running times:
As always, if you found any errors in this post please let me know!
Top comments (0)