rabin karp
Write a function that finds instances of a substring in a string in linear time.
Solution
Hashing the substring
Creating a hash works a bit like fingerprinting: it enables condensing a long and complex piece of data, like a string, into a simpler one, like an integer
.
The hash for pyth
is 2053428
.
Hashing a string works letter by letter:
- 1/ multiply the previous result by the base (26, as there are 26 letters in the ascii lowercase alphabet)
- 2/ add the ascii value of the current letter
- 3/ go to step 1 for the next letter
This does not ensure that the hash 2053428
can only represent pyth
since hashes can collide. But the probability that it represents pyth
is very high.
We create a hash, because comparing substrings is now only a matter of comparing two integers. It also enables us to roll the hash down the string in constant time.
Likewise, we'll pre-compute the hash for algo.
Rolling the hash
Let's replace our letters with variables a
, b
, c
and d
.
We can also replace our base with , and re-organise our equation a little.
We can use this rearranged formula, , to remove the left-most letter, and add a new letter to the right.
Here's an implementation, with pre-computed hashes:
1def rabin_karp(s, t):
2 s_n, t_n = len(s), len(t)
3 rolling = reduce(lambda h, c: h * 26 + ord(c), s[:t_n], 0)
4 target = reduce(lambda h, c: h * 26 + ord(c), t, 0)
5 for i in range(t_n, s_n):
6 if rolling == target and s[i - t_n : i] == t:
7 return i - t_n
8 rolling = (rolling - (ord(s[i - t_n]) * (26 ** (t_n - 1)))) * 26 + ord(s[i])
9 return s_n - t_n if rolling == target and s[-t_n:] == t else -1
With comments:
1def rabin_karp(s, t):
2 s_n, t_n = len(s), len(t)
3 # pre-compute our rolling hash for the string
4 rolling = reduce(lambda h, c: h * 26 + ord(c), s[:t_n], 0)
5 # pre-compute our target hash
6 target = reduce(lambda h, c: h * 26 + ord(c), t, 0)
7 # iterate from the end of the target to the end of the string
8
9 for i in range(t_n, s_n):
10 # if the rolling hash matches our target hash
11 # we still need to check for hash collisions
12 # by doing a string comparison
13 if rolling == target and s[i - t_n : i] == t:
14 return i - t_n
15 # a is the left-most character in the rolling hash
16 a = ord(s[i - t_n])
17 # here we multiply it by B^(n-1)
18 aBnm1 = a * (26 ** (t_n - 1))
19 # remove leftmost character from rolling hash
20 rolling -= aBnm1
21 # multiply all characters by the base (distributive property)
22 rolling *= 26
23 # add the new character to the rolling hash
24 rolling += ord(s[i])
25
26 # check the last iteration
27 return s_n - t_n if rolling == target and s[-t_n:] == t else -1
Here's another implementation, that doesn't require hashes to be pre-computed, or to check the last iteration separately:
1def rabin_karp(s, t):
2 s_n, t_n, H, T, s_pow = len(s), len(t), 0, 0, 26 ** (len(t) - 1)
3 if len(t) <= len(s) and len(t):
4 for i in range(s_n):
5 if i < t_n:
6 H, T = H * 26 + ord(s[i]), T * 26 + ord(t[i])
7 else:
8 H = (H - (ord(s[i - t_n]) * s_pow)) * 26 + ord(s[i])
9 if H == T and s[i - t_n + 1 : i + 1] == t:
10 return i - t_n + 1
11 return -1 if len(t) else 0