Pascal Thormeier

Posted on Oct 24, 2020 • Edited on Nov 3, 2020

Algorithm explained: Levenshtein edit distance

#computerscience #nlp #tutorial #php

A little while ago I finished my bachelor's thesis about audio text alignment and stumbled upon the Needleman-Wunsch algorithm. This algorithm, together with its cousin, the Smith-Waterman algorithm, are both used in bioinformatics, but can also be used in NLP. One example is the Levenshtein edit distance: It counts the number of necessary edits to one string to transform it into another.

In this post I will explain how the algorithm works in detail and do a practical example with PHP.

Examine the algorithm

Before we implement the algorithm, we try to understand what it does. PHP has the Levenshtein algorithm already built in, so we can examine its behaviour with a single line of code:

<?php
echo levenshtein('fried', 'fresh'); // 3

That looks reasonable: Replace i with e, replace e with s, replace d with h - that's 3 edits in total. Adding or removing characters are single operations as well:

<?php
echo levenshtein('fiend', 'friend'); // 1 edit: Adding
echo levenshtein('hello', 'hell'); // 1 edit: Removing

How does it behave if a character was moved around?

<?php
echo levenshtein('beer', 'bree'); // 2

2 edits - moving is not counted separately, it will count this as two replacements.

So the Levenshtein algorithm counts the following things as edits:

Replacing characters (1 edit per character)
Adding a character (1 edit per character added)
Removing a character (1 edit per character removed)

With this information we can start to have a look at how the algorithm actually counts those.

Score matrix and trace-back matrix

We need two matrices to implement the algorithm: The score matrix and the trace-back matrix. The score matrix keeps track of the number of edits, the trace-back matrix shows the path of least edits. We don't actually need it to calculate the number of edits, but we'll keep it in diagrams for illustration purposes. There's a near infinite number of possible ways to transform one string into another, we want the one with the least edits.

But first we label a few things:

<?php
declare(strict_types=1);

/**
 * Calculates the Levenshtein distance
 * @param string $a Source string
 * @param string $b Target string
 * @return int Distance
 */
function myLevenshtein(string $a, string $b): int
{
  $m = strlen($a);
  $n = strlen($b);

  return 0; // for now
}

echo myLevenshtein('fried', 'fresh'); // 0

n and m are now the lengths of the strings we're comparing. The matrices will be of size (n+1) * (m+1):

<?php
declare(strict_types=1);

/**
 * Declares a $width*$height sized matrix with `zero` values.
 * @param int $width Width of the matrix
 * @param int $height Height of the matrix
 * @return array The matrix.
 */
function declareMatrix(int $width, int $height): array
{
  return array_fill(
    0, 
    $height, 
    array_fill(0, $width, null)
  );
}

/**
 * Calculates the Levenshtein distance
 * @param string $a Source string
 * @param string $b Target string
 * @return int Distance
 */
function myLevenshtein(string $a, string $b): int
{
  $m = strlen($a);
  $n = strlen($b);

  $scoreMatrix = declareMatrix($m + 1, $n + 1);

  return 0; // for now
}

echo myLevenshtein('fried', 'fresh'); // 0

(For simplicity, I'll leave away the declareMatrix function from now on.)

Ok, so far, so good. Now, what do we do with these matrices?

Filling the matrices

The matrices now basically look like the following:

(The part we initialized is the empty 6x6 grid in the lower-right parts of the matrices. The words only function as labels in this illustration.)

We need to fill these with numbers somehow.

A single cell can represent four different values:

The character at $a[$x] match $b[$y]
The character at $a[$x] doesn't match $b[$y]
The character from $a at $x has to be removed
The character from $b at $y has to be removed

To calculate which of these values a cell holds, we take the following formula:

$isMatch = $a[$x - 1] === $b[$y - 1];

$scoreMatrix[$y][$x] = min(
  // Remove a character from $a:
  $scoreMatrix[$y][$x - 1] + 1, 

  // Remove a character from $b
  $scoreMatrix[$y - 1][$x] + 1,

  // Replace character or match
  $scoreMatrix[$y - 1][$x - 1] + ($isMatch ? 0 : 1)
);

Ok, that's a mouthful. What does it actually do? The formula looks at the left, the top-left and the top neighbors of a given cell. Together with the rest of the score matrix and the trace-back matrix, the algorithm examines all possible ways to align the two strings: By leaving out, replacing or matching characters. The path of least edits is then calculated by creating a path (up, left, diagonal) and always taking the smallest amount of change.

If a character needs to be removed in $a, we add 1 to the score of the cell to the left and that's our new score. If we need to remove a character from $b, we do the same for the above cell.

If the characters match, we don't add anything and take the previous score (no edit necessary). The same goes for a mismatch: A mismatch means 1 edit (replace), so we take the score of the top-left neighbor and add 1 to it.

Taking the smallest resulting score (determined using min()) decides which operation to take. This guarantees that only the smallest number of edits is ever measured. No need to add (go left) and remove (go up) a character, if replacing it (go diagonal) works as well.

But we need to prefill some fields before we can start calculating. I'll use the trace-back matrix to show which operation the resulting score would mean:

And off we go! We calculate the score for the first cell (F/F):

$isMatch = $a[0] === $b[0]; // true

$scoreMatrix[1][1] = min(
  // Remove a character from $a
  $scoreMatrix[1][0] + 1, // that's 2

  // Remove a character from $b
  $scoreMatrix[0][1] + 1, // that's also 2

  // Replace character or match
  $scoreMatrix[0][0] + ($isMatch ? 0 : 1) // true, so + 0
);

// Is evaluated to:
$scoreMatrix[1][1] = min(
  // Remove a character from $a
  2,

  // Remove a character from $b
  2,

  // Replace character or match
  0
); // === 0, a match was detected.

The formula detected a match: It doesn't add any edit. After looking at the first characters, the edit distance is therefore 0 (no edit necessary). Let's have a look at the next cell to the right (F of FRESH/R of FRIED):

$isMatch = $a[0] === $b[1]; // false

$scoreMatrix[1][2] = min(
  // Remove a character from $a
  $scoreMatrix[1][1] + 1, // that's 2 + 1

  // Remove a character from $b
  $scoreMatrix[0][2] + 1, // that's 0 + 1 === 1

  // Replace character or match
  $scoreMatrix[0][1] + ($isMatch ? 0 : 1) // false, so + 1
);

// Is evaluated to:
$scoreMatrix[1][1] = min(
  // Remove a character from $a
  1,

  // Remove a character from $b
  3,

  // Replace character or match
  2
); // === 1, so for this cell, to reach an ideal alignment, remove a character from $a.

Now the algorithm want's to remove a character! That makes no sense, why is that? The algorithm tries to find the amount of least edits necessary to transform one string into another. It might proof that removing a character at that place is actually the best option. It tries multiple different paths before ultimately settling for a single one. Let's have a look at the matrices with some more cells filled in:

And here we start to see a pattern in the trace-back matrix: The "ups" and "lefts" seem to always lead to the diagonal. This is because matches and replacements are actually the path of least edits. This is what the matrices look like when they're filled in completely:

We see that the bottom-right corner of the trace-back matrix says "diag", which means, it chose the "replace" operation. When we follow the path until "done", we'll see that the algorithm only ever does matches and replaces, which is exactly what we would have guessed.

The Levenshtein distance can now be read out: The bottom-right corner of the score matrix says 3 - which is the smallest possible edit distance.

Another example

Let's try hell and hello to illustrate what would happen if a character should be removed:

Indeed: The path on the trace-back matrix consists of 4 matches and one removing, resulting in an edit distance of 1. The score matrix says the same.

The implementation

So in order to calculate the score for a given cell, we introduce a new function getScore:

<?php

// ...

/**
 * Determines the score for a given cell
 * @param int $x The X-coordinate to determine the score for
 * @param int $y The Y-coordinate to determine the score for
 * @param array $matrix The entire matrix
 * @param string $a First word
 * @param string $b Second word
 * @return int The score for the given cell coordinates
 */
function getScore(int $x, int $y, array $matrix, string $a, string $b): int
{
  $isMatch = $a[$x - 1] === $b[$y - 1];

  return min(
    $matrix[$y - 1][$x] + 1,
    $matrix[$y][$x - 1] + 1,
    $matrix[$y - 1][$x - 1] + ($isMatch ? 0 : 1)
  );
}

As mentioned, we don't need the trace-back matrix in the implementation, so our final Levenshtein implementation looks like this:

<?php

// ...

/**
 * Calculates the Levenshtein distance
 * @param string $a Source string
 * @param string $b Target string
 * @return int Distance
 */
function myLevenshtein(string $a, string $b): int
{
  $m = strlen($a);
  $n = strlen($b);

  $scoreMatrix = declareMatrix($m + 1, $n + 1);

  // Prefill matrix edges
  $scoreMatrix[0][0] = 0;
  for ($y = 1; $y < $n + 1; $y++) {
    $scoreMatrix[$y][0] = $y;
  }
  for ($x = 1; $x < $m + 1; $x++) {
    $scoreMatrix[0][$x] = $x;
  }

  // Determine the score for every cell
  for ($y = 1; $y < $n + 1; $y++) {
    for ($x = 1; $x < $m + 1; $x++) {
      $scoreMatrix[$y][$x] = getScore($x, $y, $scoreMatrix, $a, $b);
    }
  }

  // Get the edit distance from the bottom-right cell
  return $scoreMatrix[$n][$m];
}

echo myLevenshtein('fried', 'fresh'); // 3
echo myLevenshtein('hello', 'hell'); // 0

Testing the implementation

Let's generate 1000 words and compare the Levenshtein distance our function calculates and the one calculated by the built-in function:

// Generate a number of random words.
$words = array_map(function() {
  $chars = range('A', 'Z');
  return implode('', array_map(function() use ($chars) {
    return $chars[mt_rand(0, 25)];
  }, range(0, mt_rand(1, 40))));
}, range(0, 1000));

$copiedWords = $words;
shuffle($copiedWords);
$pairs = array_map(null, $words, $copiedWords);

$result = true;
foreach ($pairs as $pair) {
    $result = $result && (myLevenshtein($pair[0], $pair[1]) === levenshtein($pair[0], $pair[1]));
}

var_dump($result); // true

Seems to be working just like the built-in version! Here's a PHPSandbox with the code!

Takeaway thoughts

In most cases the Levenshtein algorithm will follow the diagonal path. It only ever adds or removes characters if the two string lengths are not equal. It's also quite costly at O(m*n)

The Levenshtein distance is a very handy tool. I think it's pretty clever to use an algorithm from bioinformatics for NLP!

I write tech articles in my free time. If you enjoyed reading this post, consider buying me a coffee!

Top comments (2)

Cookie • Feb 9 '23

Hello, be aware that some (otherwise VERY nice) illustrations are wrong.
The score matrix and trackback matrix of the word pair "fresh, fried" contains calculation/logic errors:

for its score matrix:

cell E-E should read 1 not 2
cell E-D should therefor read 2 not 3
cell S-D should therefor read 2 not 3

for its trackback matrix:

cell E-D should indicate 'left'

Besides that, they are very clear and beautifully hand made! Nice diversion of the common computer generated ones.

Pascal Thormeier • Feb 10 '23

Hey Cookie,

I just re-executed the code and had a look at the score matrix generated for FRIED/FRESH. You're absolutely right! I don't know how these errors got in there, but they're there. Perks of doing stuff by hand, I guess :D

I'll correct them and update the post ASAP, thank you very much for noticing!