DEV Community

Cover image for ⚡ Levenshtein Distance: A Powerful Algorithm for String Comparison ⚡
Best Codes
Best Codes

Posted on

⚡ Levenshtein Distance: A Powerful Algorithm for String Comparison ⚡

Notice: Artificial Intelligence was a source of information, style changes, rephrasing, code generating, and ideas in this article. While this is not written by AI, AI was used.

Have you ever wondered how to measure the similarity between two strings? Whether you're working on spell-checking, DNA sequence analysis, Natural Language Processing, or even fuzzy search algorithms, the Levenshtein Distance algorithm is a powerful tool that can help you achieve accurate results. In this article, we'll look at the concept of Levenshtein Distance, explore how it works, and analyze Levenshtein Distance in JavaScript. So, let's get started!

What is Levenshtein Distance?

The Levenshtein Distance, also known as the Edit Distance, is a metric used to measure the difference between two strings. It calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. The higher the Levenshtein Distance, the more different the two strings are.
The concept of Levenshtein Distance was introduced by the Soviet mathematician Vladimir Levenshtein in 1965, and it was subsequently named after him. His work on this distance laid the foundation for its applications in various fields, making it a fundamental tool in string comparison and analysis.

How does Levenshtein Distance work?

The algorithm behind Levenshtein Distance is based on dynamic programming. It breaks down the problem into smaller sub-problems and solves them iteratively, building up the solution step by step. Here's an overview of how the algorithm works:

  1. The levenshteinDistance function takes two input strings, str1 and str2.

  2. It initializes variables m and n to store the lengths of str1 and str2 respectively.

  3. A matrix is created using the Array.from() method with the length of m + 1 and each element initialized with an array of length n + 1 filled with zeros. This matrix will store the Levenshtein distances between substrings of str1 and str2.

  4. The first row and column of the matrix are initialized with values representing the number of operations required to transform an empty string into the corresponding substring of str1 or str2.

  5. The matrix is then filled in using a nested loop. Each cell (i, j) in the matrix represents the minimum number of operations required to transform the substring str1[0...i-1] into the substring str2[0...j-1].

  6. If the characters at the corresponding positions in str1 and str2 are the same, the value in the current cell (i, j) is set to the value in the diagonal cell (i-1, j-1), as no operation is needed.

  7. If the characters are different, the value in the current cell (i, j) is set to the minimum of the three adjacent cells (i-1, j) + 1, (i, j-1) + 1, and (i-1, j-1) + 1. This accounts for the deletion, insertion, and substitution operations respectively.

  8. Finally, the function returns the value in the bottom right cell (m, n), which represents the minimum number of operations required to transform the entire str1 into str2.

Let's look at an example of transforming the word kitten into sitting:

      |   | s | i | t | t | i | n | g |
      |---|---|---|---|---|---|---|---|
      |   | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
      | k | 1 | 1 | 2 | 3 | 4 | 5 | 6 |
      | i | 2 | 1 | 1 | 2 | 3 | 4 | 5 |
      | t | 3 | 2 | 2 | 1 | 2 | 3 | 4 |
      | t | 4 | 3 | 3 | 2 | 1 | 2 | 3 |
      | e | 5 | 4 | 4 | 3 | 2 | 2 | 3 |
      | n | 6 | 5 | 5 | 4 | 3 | 3 | 2 |
Enter fullscreen mode Exit fullscreen mode

The text above represents a matrix that is generated by the Levenshtein distance algorithm when calculating the minimum number of operations required to transform one string into another.

Each cell in the matrix represents the minimum number of operations required to transform a substring of the first string into a substring of the second string. The rows and columns of the matrix correspond to the characters of the two strings, with the first row and column representing the empty string and the remaining rows and columns representing the characters of the strings.

In this specific example, the matrix represents the Levenshtein distance between the strings “kitten” and “sitting”. The numbers in the cells represent the minimum number of operations required to transform the corresponding substrings.

For example, the cell (1, 2) represents the minimum number of operations required to transform the substring “k” into the substring “si”, which is 1. Similarly, the cell (3, 5) represents the minimum number of operations required to transform the substring “kit” into the substring “sitt”, which is 2.

By examining the matrix, you can see the progression of the minimum number of operations required as you move across the rows and down the columns. The last cell in the bottom right corner of the matrix, (6, 7), represents the minimum number of operations required to transform the entire “kitten” into “sitting”, which is 2 in this case.

An image is 1000 words

Why is Levenshtein Distance useful?

The Levenshtein Distance algorithm has a wide range of applications in various fields. Here are a few examples:

  1. Spell-checking: Levenshtein Distance can be used to suggest corrections for misspelled words by finding the closest matching words in a dictionary.
  2. DNA sequence analysis: It helps identify similarities and differences between DNA sequences, aiding in genetic research and analysis.
  3. Fuzzy search: Levenshtein Distance can be used to implement fuzzy search algorithms that find approximate matches for search queries, even when there are minor spelling mistakes or variations in the input.
  4. Natural Language Processing: In NLP, Levenshtein Distance is used for Fuzzy Matching, Text Categorization, Keyword Extraction, and Named Entity Recognition (NER).

Implementing Levenshtein Distance in Common Languages

Now that we understand the concept and significance of Levenshtein Distance, let's dive into some code implementation in common languages:

JavaScript

function levenshteinDistance(str1, str2) {
  const m = str1.length;
  const n = str2.length;

  // Create the matrix
  const matrix = Array.from({ length: m + 1 }, () => Array(n + 1).fill(0));

  // Initialize the first row and column
  for (let i = 0; i <= m; i++) {
    matrix[i][0] = i;
  }

  for (let j = 0; j <= n; j++) {
    matrix[0][j] = j;
  }

  // Fill in the matrix
  for (let i = 1; i <= m; i++) {
    for (let j = 1; j <= n; j++) {
      if (str1[i - 1] === str2[j - 1]) {
        matrix[i][j] = matrix[i - 1][j - 1];
      } else {
        matrix[i][j] = Math.min(
          matrix[i - 1][j] + 1, // Deletion
          matrix[i][j - 1] + 1, // Insertion
          matrix[i - 1][j - 1] + 1 // Substitution
        );
      }
    }
  }

  // Return the Levenshtein Distance
  return matrix[m][n];
}

// Example usage
const distance = levenshteinDistance("kitten", "sitting");
console.log(distance); // Output: 3
Enter fullscreen mode Exit fullscreen mode

In the code snippet above, we define the Levenshtein Distance function that takes two strings, str1 and str2, as input. It follows the steps we discussed earlier to calculate the Levenshtein Distance between the two strings. Finally, we return the value in the bottom-right cell of the matrix, which represents the minimum number of edits required.

Python

def levenshtein_distance(str1, str2):
    m = len(str1)
    n = len(str2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i

    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if str1[i - 1] == str2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])

    return dp[m][n]
Enter fullscreen mode Exit fullscreen mode

C

#include <stdio.h>
#include <string.h>

int min(int a, int b, int c) {
    if (a < b) {
        if (a < c) {
            return a;
        } else {
            return c;
        }
    } else {
        if (b < c) {
            return b;
        } else {
            return c;
        }
    }
}

int levenshtein_distance(char* str1, char* str2) {
    int m = strlen(str1);
    int n = strlen(str2);
    int dp[m + 1][n + 1];

    for (int i = 0; i <= m; i++) {
        dp[i][0] = i;
    }

    for (int j = 0; j <= n; j++) {
        dp[0][j] = j;
    }

    for (int i = 1; i <= m; i++) {
        for (int j = 1; j <= n; j++) {
            if (str1[i - 1] == str2[j - 1]) {
                dp[i][j] = dp[i - 1][j - 1];
            } else {
                dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]);
            }
        }
    }

    return dp[m][n];
}

int main() {
    char str1[] = "kitten";
    char str2[] = "sitting";

    int distance = levenshtein_distance(str1, str2);

    printf("Levenshtein Distance: %d\n", distance);

    return 0;
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this blog post, we explored the concept of Levenshtein Distance, a powerful algorithm for measuring the difference between two strings. We discussed how it works, its applications in various fields, and provided a detailed code implementation in JavaScript. By understanding and implementing the Levenshtein Distance algorithm, you can enhance your string comparison and analysis capabilities in your projects.

Remember, the Levenshtein Distance algorithm is just one of many tools available for string comparison. Depending on your specific use case, there may be other algorithms or techniques that better suit your needs. However, the Levenshtein Distance algorithm remains a fundamental and widely-used approach in the field of string similarity measurement.

I hope this post has provided you with valuable insights into Levenshtein Distance and its applications. Feel free to experiment with the code and explore further possibilities. Happy coding!

Also, check out my site and the Wikipedia Articles:

https://en.wikipedia.org/wiki/Levenshtein_distance
https://the-best-codes.github.io/

Thanks for reading!

Top comments (8)

Collapse
 
robertobutti profile image
Roberto B.

Great article!

In PHP we are lucky, we have the native function:

levenshtein("kitten", "sitting");
Enter fullscreen mode Exit fullscreen mode
Collapse
 
best_codes profile image
Best Codes

Wow, that's awesome! It must be nice having it built in.

Collapse
 
artxe2 profile image
Yeom suyun

If you were to create a Levenshtein function as a database function, would it yield reasonable performance?
While functionalities like fuzzy queries may require the use of Elasticsearch, I'm curious about this approach.

Collapse
 
best_codes profile image
Best Codes

I believe it would, but I'm not completely sure what you're asking…

Collapse
 
artxe2 profile image
Yeom suyun

How much slower would it be, for example, compared to performing a Like search in a database?

Thread Thread
 
best_codes profile image
Best Codes

Not very, if any, as searches already use this concept (if they do similarity and not exact match already). It would take longer than a typical exact match search, but how much longer varies greatly by how much data you have.

Collapse
 
rgolawski profile image
Rafał Goławski

Cool post, added to my "Today I Learned" list 📚

Collapse
 
best_codes profile image
Best Codes

Thanks! I'm glad it was interesting.