rabin karp

🏠

Write a function that finds instances of a substring in a string in linear time.

Solution

Hashing the substring

Creating a hash works a bit like fingerprinting: it enables condensing a long and complex piece of data, like a string, into a simpler one, like an integer.

The hash for pyth is 2053428.

Hashing a string works letter by letter:

This does not ensure that the hash 2053428 can only represent pyth since hashes can collide. But the probability that it represents pyth is very high.

We create a hash, because comparing substrings is now only a matter of comparing two integers. It also enables us to roll the hash down the string in constant time.

Likewise, we'll pre-compute the hash for algo.

Rolling the hash

Let's replace our letters with variables a, b, c and d.

We can also replace our base with , and re-organise our equation a little.

We can use this rearranged formula, , to remove the left-most letter, and add a new letter to the right.

Here's an implementation, with pre-computed hashes:

1def rabin_karp(s, t):
2    s_n, t_n = len(s), len(t)
3    rolling = reduce(lambda h, c: h * 26 + ord(c), s[:t_n], 0)
4    target = reduce(lambda h, c: h * 26 + ord(c), t, 0)
5    for i in range(t_n, s_n):
6        if rolling == target and s[i - t_n : i] == t:
7            return i - t_n
8        rolling = (rolling - (ord(s[i - t_n]) * (26 ** (t_n - 1)))) * 26 + ord(s[i])
9    return s_n - t_n if rolling == target and s[-t_n:] == t else -1

With comments:

 1def rabin_karp(s, t):
 2    s_n, t_n = len(s), len(t)
 3    # pre-compute our rolling hash for the string
 4    rolling = reduce(lambda h, c: h * 26 + ord(c), s[:t_n], 0)
 5    # pre-compute our target hash
 6    target = reduce(lambda h, c: h * 26 + ord(c), t, 0)
 7    # iterate from the end of the target to the end of the string
 8
 9    for i in range(t_n, s_n):
10        # if the rolling hash matches our target hash
11        # we still need to check for hash collisions
12        # by doing a string comparison
13        if rolling == target and s[i - t_n : i] == t:
14            return i - t_n
15        # a is the left-most character in the rolling hash
16        a = ord(s[i - t_n])
17        # here we multiply it by B^(n-1)
18        aBnm1 = a * (26 ** (t_n - 1))
19        # remove leftmost character from rolling hash
20        rolling -= aBnm1
21        # multiply all characters by the base (distributive property)
22        rolling *= 26
23        # add the new character to the rolling hash
24        rolling += ord(s[i])
25
26    # check the last iteration
27    return s_n - t_n if rolling == target and s[-t_n:] == t else -1

Here's another implementation, that doesn't require hashes to be pre-computed, or to check the last iteration separately:

 1def rabin_karp(s, t):
 2    s_n, t_n, H, T, s_pow = len(s), len(t), 0, 0, 26 ** (len(t) - 1)
 3    if len(t) <= len(s) and len(t):
 4        for i in range(s_n):
 5            if i < t_n:
 6                H, T = H * 26 + ord(s[i]), T * 26 + ord(t[i])
 7            else:
 8                H = (H - (ord(s[i - t_n]) * s_pow)) * 26 + ord(s[i])
 9            if H == T and s[i - t_n + 1 : i + 1] == t:
10                return i - t_n + 1
11    return -1 if len(t) else 0