levenshtein distance

🏠

The levenshtein distance is how many edits need to be made on word A to transform it into word B. Characters can be:

The levenstein distance is symmetrical, meaning transforming word A into word B takes as many operations as transforming word B into word A.

When computing the levenshtein distance using dynamic programming, the first step should be to focus on the transformations, and decisions taking place at each step.

Let's take saturday and sunday as examples.

By simply removing the letters in common, we can already get a hint that the distance is 3.

Let's now think about this in terms of transformations of saturday. Starting with:

Looking at the animation above should make it clear as day that 3 edits are required, thus the levenshtein distance is 3.

Defining the strategy

Clipping suffixes

As you can see, when the ends of the words match, we can simply clip them off one letter at a time.

1if a[-1] == b[-1]:
2  return dp(a[:-1], b[:-1])

This performs the:

transformations.

Next we'll try 3 different transformations, and see which one resulted in the shortest levenshtein distance (yes this is recursive, we'll come to that later).

1/ adding a character to a
2/ substituting the last character in a
3/ deleting the last character from a

adding b's last character to a

This would for example tranform:

because we are copying the last character from b, and appending it to a. At which point, we would, once again ask, "what's the levenshstein distance between: saturn and sun?".

substituting a's last character with b's last character

This would for example tranform:

At which point, we would, once again ask, "what's the levenshstein distance between: satun and sun?".

deleting a's last character

This would for example tranform:

At which point, we would, once again ask, "what's the levenshstein distance between: satu and sun?".

Taking the minimum

Remember we are interested in the minimum transformations. This is hard to reason about, as the definition is recursive. What transformations can i make that will result in the shortest levenshtein distance?

This does not apply to clipping characters off the end of both words, since we perform the same operation on both a and b, the levenshtein distance ends up being the same.

However, we tried a bunch of transformations on a, i.e adding substituting and deleting the last character. This is no smarter than trial and error.. we're just trying things out and asking the question again on these substrings.

The important point however, is we need to ask what resulted in the minimum distance for these transformations, ie.

Asking this question gives us the answer of how many transformations are required to transform substring a into substring b. However, If you observe carefully, you'll notice we performed an operation ourselves before asking the question: we added, substituted, or deleted in a first! So we need the: +1 to register that fact.

Base cases

But what happens when one string is empty? What's the levenshtein distance between:

The distance is 3! The length of sun. You'd need to add 3 characters to a for it to become b. Likewise, what's the distance between:

The distance is 4! You'd need to delete 4 characters from a to make it b. So our base cases are:

1if len(a) == 0 or len(b) == 0:
2  return len(a) or len(b)

And that's it! You have it. This is called a top down approach, because we start with the full words, and break it down step by step, recursively, while caching the solutions.

You could also compute the bottom up approach, by asking the question: "what's the levelshtein distance between s and s?" then progressively make the string bigger.

Here's our first implementation:

 1import functools
 2
 3
 4def lev(A, B):
 5    @functools.lru_cache(None)
 6    def dp(a, b):
 7        if len(a) == 0 or len(b) == 0:
 8            return len(a) or len(b)
 9
10        if a[-1] == b[-1]:
11            return dp(a[:-1], b[:-1])
12
13        add_last = dp(a + b[-1], b)
14        sub_last = dp(a[:-1] + b[-1], b)
15        del_last = dp(a[:-1], b)
16        return 1 + min(sub_last, add_last, del_last)
17
18    return dp(A, B)
19
20
21d = lev("saturday", "sunday")
22print(d)

Reducing the number of calls

Next, we can actually make some optimisations. There's no point adding b's last character to a's. See:

Once we compare saturn and sun the first thing we'll need to do is chop off the matching 'n' at the end of each strings. So instead of adding b's last character onto a, we can simply chop the last character off of b.

Yes adding b's last character to a ends up deleting the last character off of b, and leaving a intact. This optimisation saves on the number of calls.

1add_last = dp(a, b[:-1])

Likewise, if we substitute the last character off of a with the last character off of b, well both last characters end up getting chopped off, we can can simply cut to the chase.

1sub_last = dp(a[:-1], b[:-1])

Our optimised implementation:

 1import functools
 2
 3
 4def lev(A, B):
 5    @functools.lru_cache(None)
 6    def dp(a, b):
 7        if len(a) == 0 or len(b) == 0:
 8            return len(a) or len(b)
 9
10        if a[-1] == b[-1]:
11            return dp(a[:-1], b[:-1])
12
13        add_last = dp(a, b[:-1])
14        del_last = dp(a[:-1], b)
15        sub_last = dp(a[:-1], b[:-1])
16        return 1 + min(sub_last, add_last, del_last)
17
18    return dp(A, B)
19
20
21d = lev("saturday", "sunday")
22print(d)

passing around indices instead of substrings

Yet another optimisation we can make is noticing we don't need to pass a substring, but instead we can simply pass the indices of the character being worked on:

 1import functools
 2
 3
 4def lev(A, B):
 5    @functools.lru_cache(None)
 6    def dp(ai, bi):
 7        if ai < 0:
 8            return bi + 1
 9        elif bi < 0:
10            return ai + 1
11
12        if A[ai] == B[bi]:
13            return dp(ai - 1, bi - 1)
14
15        sub_last = dp(ai - 1, bi - 1)
16        add_last = dp(ai, bi - 1)
17        del_last = dp(ai - 1, bi)
18        return 1 + min(sub_last, add_last, del_last)
19
20    return dp(len(A) - 1, len(B) - 1)
21
22
23d = lev("saturday", "sunday")
24print(d)

This optimisation does, however, come at a great cost in terms of complexity (of understanding). You can make it easier to debug by creating substring variables at the top of it:

1a = A[:ai+1]
2b = B[:bi+1]

This is probably the form you want to remember, and be able to code up. The following version is less legible, but would still be acceptable.

 1from functools import lru_cache
 2
 3def lev(A, B):
 4    @lru_cache(None)
 5    def dp(ai, bi):
 6        if ai < 0 or bi < 0:
 7            return (ai, bi)[ai < 0] + 1
 8        if A[ai] == B[bi]:
 9            return dp(ai - 1, bi - 1)
10        return 1 + min(dp(ai - 1, bi - 1), dp(ai, bi - 1), dp(ai - 1, bi))
11    return dp(len(A) - 1, len(B) - 1)
12
13d = lev("saturday", "sunday")
14print(d)

See also:

minimum deletion