Fabrizio Bagalà

Posted on May 26, 2023 • Edited on Jun 19, 2023

Levenshtein Distance

#programming #computerscience #algorithms #csharp

🔄 Updates
June 01, 2023 - Added an implementation with modifiable operation costs. Thanks to @cicirello for the suggestion.

The Levenshtein distance, or the edit distance, is a crucial algorithm in the landscape of computer science, especially pertinent in text analysis and natural language processing (NLP). This mathematical yardstick quantifies the 'distance' between two character strings by computing the least number of operations required to transform one string into the other.

Brief history

Named after Russian scientist Vladimir Levenshtein who introduced it in 1965, the Levenshtein distance calculates the difference between two sequences using insertion, deletion, and substitution operations. The significance of this concept has transcended its original purpose over time, finding applications in a myriad of contexts, ranging from spell-checking algorithms to voice recognition systems.

How it works

The calculation of the Levenshtein distance relies on dynamic programming. It involves constructing an m x n matrix, where m and n denote the lengths of the two strings being compared. Each cell (i, j) in this matrix encapsulates the Levenshtein distance between the first i characters of the first string and the first j characters of the second string.

The algorithm revolves around three elementary operations:

Insertion: Appending a new character to the string.
Deletion: Eradicating a character from the string.
Substitution: Replacing one character with another.

The algorithm computes the minimum distance for each cell, taking into account all potential operations. This computation depends on the values of neighboring cells, exemplifying the dynamic programming approach.

Here is a simple implementation in C#:

public int LevenshteinDistance(string source, string target)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(target);

    var m = source.Length;
    var n = target.Length;
    var distance = new int[m + 1, n + 1];

    for (var i = 0; i <= m; i++)
    {
        distance[i, 0] = i;
    }

    for (var j = 0; j <= n; j++)
    {
        distance[0, j] = j;
    }

    for (var i = 1; i <= m; i++)
    {
        for (var j = 1; j <= n; j++)
        {
            var cost = target[j - 1] == source[i - 1] ? 0 : 1;

            var deletion = distance[i - 1, j] + 1;
            var insertion = distance[i, j - 1] + 1;
            var substitution = distance[i - 1, j - 1] + cost;

            distance[i, j] = Math.Min(Math.Min(deletion, insertion), substitution);
        }
    }

    return distance[m, n];
}

Time complexity

Given that the Levenshtein distance algorithm operates on an m x n matrix, the time complexity is O(mn), making it polynomial. This is because each cell in the matrix has to be filled, and the number of cells is proportional to the product of the lengths of the two strings.

Space complexity

Similarly, the space complexity is also O(mn), as the entire matrix needs to be stored in memory. In some variations of the algorithm, the space complexity can be reduced to O(min(m, n)) by maintaining only the current and previous row (or column) in memory.

Other implementations

Let us explore below some implementations which, although similar to the original implementation, have some differences.

👉 Using two rows

public int LevenshteinDistance(string source, string target)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(target);

    var m = source.Length;
    var n = target.Length;

    if (m == 0)
    {
        return n;
    }

    if (n == 0)
    {
        return m;
    }

    var previousRow = new int[n + 1];
    var currentRow = new int[n + 1];

    for (var j = 0; j <= n; j++)
    {
        previousRow[j] = j;
    }

    for (var i = 1; i <= m; i++)
    {
        currentRow[0] = i;

        for (var j = 1; j <= n; j++)
        {
            var cost = source[i - 1] == target[j - 1] ? 0 : 1;

            var deletion = previousRow[j] + 1;
            var insertion = currentRow[j - 1] + 1;
            var substitution = previousRow[j - 1] + cost;

            currentRow[j] = Math.Min(Math.Min(deletion, insertion), substitution);
        }

        (previousRow, currentRow) = (currentRow, previousRow);
    }

    return previousRow[n];
}

In this version, a null check is first performed for the source and target strings are performed first, throwing an ArgumentNullException if either is null.

The algorithm returns the length of target or source if either string length is 0, as the Levenshtein distance is equivalent to the length of the other string.

It then initializes two integer arrays previousRow and currentRow, both of size n + 1, where n is the length of target. previousRow is filled with values from 0 to n.

Iterating over each character in source, currentRow[0] is set to the current source index. In a nested loop for each character in target, it calculates the cost of substitution, deletion, insertion, and substitution costs. Then, currentRow[j] is updated to the minimum of these costs.

At the end of the outer loop, previousRow and currentRow are swapped, preparing previousRow for the next iteration with the updated distances.

After all iterations, it returns the last value in previousRow, which is the Levenshtein distance between source and target.

This algorithm has a space complexity of O(n) and a time complexity of O(mn), where n and m are the lengths of the target and source strings, respectively.

👉 Using one row

public int LevenshteinDistance(string source, string target)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(target);

    if (source.Length < target.Length)
    {
        (source, target) = (target, source);
    }

    var m = source.Length;
    var n = target.Length;

    var currentRow = new int[n + 1];

    for (var i = 0; i <= n; i++)
    {
        currentRow[i] = i;
    }

    for (var i = 1; i <= m; i++)
    {
        var previousDiagonal = currentRow[0];
        currentRow[0] = i;

        for (var j = 1; j <= n; j++)
        {
            var previousDiagonalCopy = currentRow[j];

            var cost = source[i - 1] == target[j - 1] ? 0 : 1;

            var deletion = currentRow[j - 1] + 1;
            var insertion = currentRow[j] + 1;
            var substitution = previousDiagonal + cost;

            currentRow[j] = Math.Min(Math.Min(deletion, insertion), substitution);

            previousDiagonal = previousDiagonalCopy;
        }
    }

    return currentRow[n];
}

In this version, a null check is first performed for the source and target strings are performed first, throwing an ArgumentNullException if either is null.

The strings are then swapped if source is shorter than target to optimize space complexity.

The algorithm initializes an integer array currentRow of size n + 1, which represents the current row in the Levenshtein distance matrix, filled with values from 0 to n.

It iterates over each character in source, saving the first value of currentRow to previousDiagonal and updating the first value to the current source index.

In a nested loop, it iterates over each character in target. It calculates the cost of substitution, and also the deletion and insertion costs. Then, currentRow[j] is set to the smallest among deletion, insertion, and substitution. previousDiagonal is updated for the next iteration.

After all iterations, the function returns the last value in currentRow, representing the Levenshtein distance between source and target.

This function has a space complexity of O(n) and a time complexity of O(mn), where n and m are the lengths of the target and source strings, respectively.

👉 With modifiable operation costs

public int LevenshteinDistance(string source, string target, int insertionCost, int deletionCost, int substitutionCost)
{
    if (insertionCost < 0)
    {
        throw new ArgumentException("Value cannot be negative.", nameof(insertionCost));
    }

    if (deletionCost < 0)
    {
        throw new ArgumentException("Value cannot be negative.", nameof(deletionCost));
    }

    if (substitutionCost < 0)
    {
        throw new ArgumentException("Value cannot be negative.", nameof(substitutionCost));
    }

    var m = source.Length;
    var n = target.Length;

    var distance = new int[m + 1, n + 1];

    for (var i = 0; i <= m; i++)
    {
        distance[i, 0] = i * deletionCost;
    }

    for (var j = 0; j <= n; j++)
    {
        distance[0, j] = j * insertionCost;
    }

    for (var i = 1; i <= m; i++)
    {
        for (var j = 1; j <= n; j++)
        {
            var cost = source[i - 1] == target[j - 1] ? 0 : substitutionCost;

            var deletion = distance[i - 1, j] + deletionCost;
            var insertion = distance[i, j - 1] + insertionCost;
            var substitution = distance[i - 1, j - 1] + cost;

            distance[i, j] = Math.Min(Math.Min(deletion, insertion), substitution);
        }
    }

    return distance[m, n];
}

In this version, the user can define the costs of insertion, deletion, and substitution. Initially, the algorithm checks if these costs are negative, throwing an ArgumentException exception if they are.

Next, the algorithm calculates the length of the two strings to be compared, called source and target. These lengths are stored in the variables m and n.

Subsequently, the algorithm creates a two-dimensional matrix, distance, of size (m + 1) x (n + 1) to hold the intermediate and final transformation costs.

The algorithm then initializes the first column of the matrix with the progressive deletion costs of the source string, and the first row with the progressive insertion costs into the target string.

The next step involves iterating over each character in both strings. For each pair of characters, the algorithm calculates the substitution cost (which is 0 if the characters are identical, otherwise it's the provided substitution cost), the deletion cost, and the insertion cost. The minimum among these three costs is then assigned to the corresponding element in the distance matrix.

Finally, the bottom-right element of the matrix represents the minimum Levenshtein distance between the two strings. This value is returned as the algorithm's output.

Similarity percentage

To determinate the similarity percentage between two strings leveraging the Levenshtein distance, initially, the greatest potential Levenshtein distance is calculated - equivalent to the length of the longer string. Then, the ratio of the actual Levenshtein distance to this maximum is evaluated.

The formula is this:

In this formula, a similarity of 1.0 denotes identical strings, while a similarity of 0.0 denotes completely distinct strings.

A representation in C#:

public double CalculateSimilarity(string source, string target)
{
    var distance = LevenshteinDistance(source, target);
    var maxLength = Math.Max(source.Length, target.Length);
    return 1.0 - (double)distance / maxLength;
}

This function calculates the Levenshtein distance between the source and target strings, determines the maximum length between the two, and finally computes the similarity percentage based on the aforementioned formula.

Applications

The Levenshtein distance has a broad range of applications:

Spell-checking: Levenshtein assists in pinpointing words 'closest' to a misspelled word, thereby suggesting the likeliest spelling correction.
Computational biology: It quantifies the resemblance between two nucleotide strings in the analysis of DNA sequences.
Speech recognition: In the realm of speech recognition, it's used to compare uttered words with those in a database.

Conclusion

The Levenshtein distance is a potent instrument for gauging the difference between two strings. While the concept might appear simple, its applications are astonishingly diverse and relevant, underscoring the importance of this algorithm in contemporary technology. Consequently, understanding the Levenshtein distance is pivotal for anyone involved in computer science and related fields.

References

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

Top comments (11)

Vincent A. Cicirello • May 26 '23

There's a version of the algorithm that allows you to specify the costs of the different edit operations. Very similar to original. You can find the details in:

R. A. Wagner and M. J. Fischer, "The string-to-string correction problem," Journal of the ACM, vol. 21, no. 1, pp. 168–173, January 1974.

Fabrizio Bagalà • May 26 '23

Thank you for informing me. I was not familiar with this version. I will take a look and possibly implement it in the article.

Vincent A. Cicirello • May 27 '23

You're welcome

Fabrizio Bagalà • May 30 '23

@cicirello if I understand the article correctly, the algorithm is the Damerau-Levenshtein algorithm. It handles asymmetric costs and character transpositions. Am I right?

Vincent A. Cicirello • May 30 '23

No, Wagner and Fischer's algorithm doesn't handle transpositions. It is basically like Levenshtein, but instead of the costs of insertions, deletions, and substitutions each being 1, the algorithm has parameters for the costs of each of those operations. You can specify any non-negative cost for deletions, any non-negative cost for insertions, and any non-negative cost for substitutions.

The algorithm is almost identical to the one your post is about. Basically, where you have the + 1 in the statement for the deletion cost, you'd instead add whatever the cost of a deletion operation. And then in the statement for the insertion case, you'd add the cost of an insertion instead of adding 1. And in the conditional statement for the case of a substitution, that 1 becomes the cost of a substitution. The rest of algorithm remains the same as what you have.

You can then just pass 1 for all 3 costs if you want the Levenshtein case. Or you can pass different costs for the operations if in your application you want to treat the operations differently.

Vincent A. Cicirello • May 30 '23 • Edited

@fabriziobagala after my previous reply, it dawned on me that it might be useful to share my Java implementation of Wagner and Fischer's edit distance algorithm. I have a couple different classes that implements it in a Java library:

This one has methods for computing the edit distance between arrays of various types, between Strings, between lists, etc, and uses ints for the three cost values: github.com/cicirello/JavaPermutati...
This one is just like the above, but uses doubles for the three cost parameters: github.com/cicirello/JavaPermutati...
This last one is for the special case of computing the distance between permutations. The algorithm is the same really. It just predates the other implementations in this library, from when the library was entirely focused on permutations: github.com/cicirello/JavaPermutati...

Fabrizio Bagalà • May 30 '23

Based on what you told me and the code you sent me, I thought of writing this version you suggested in the following way:

public int LevenshteinDistance(string source, string target, int insertionCost, int deletionCost, int substitutionCost)
{
    if (insertionCost < 0)
    {
        throw new ArgumentException("Value cannot be negative.", nameof(insertionCost));
    }

    if (deletionCost < 0)
    {
        throw new ArgumentException("Value cannot be negative.", nameof(deletionCost));
    }

    if (substitutionCost < 0)
    {
        throw new ArgumentException("Value cannot be negative.", nameof(substitutionCost));
    }

    var m = source.Length;
    var n = target.Length;

    var distance = new int[m + 1, n + 1];

    for (var i = 0; i <= m; i++)
    {
        distance[i, 0] = i * deletionCost;
    }

    for (var j = 0; j <= n; j++)
    {
        distance[0, j] = j * insertionCost;
    }

    for (var i = 1; i <= m; i++)
    {
        for (var j = 1; j <= n; j++)
        {
            var cost = source[i - 1] == target[j - 1] ? 0 : substitutionCost;

            var deletion = distance[i - 1, j] + deletionCost;
            var insertion = distance[i, j - 1] + insertionCost;
            var substitution = distance[i - 1, j - 1] + cost;

            distance[i, j] = Math.Min(Math.Min(deletion, insertion), substitution);
        }
    }

    return distance[m, n];
}

What do you think of this implementation?

Vincent A. Cicirello • May 30 '23

I think that looks correct.

Fabrizio Bagalà • May 31 '23

Great! Then I will include this version in the article. Thank you

Anthony Fung • May 26 '23

Thanks for sharing.

I once used this algorithm (from a NuGet package) to do some fuzzy matching. I was trying to find the most likely thing that a user was typing in from a finite set of options - in case there were any typos.

Great to have a better insight into how it actually works.

Fabrizio Bagalà • May 26 '23

It is an algorithm that I particularly like. It can be used in different areas and it is worth knowing how it works 😉

View full discussion (11 comments)

DEV Community

Levenshtein Distance

Brief history

How it works

Time complexity

Space complexity

Other implementations

Similarity percentage

Applications

Conclusion

References

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Top comments (11)

A Workflow Copilot. Tailored to You.

Read next

Game Dev Digest — Issue #260 - Effect Breakdowns, Networking, and more

This Is Why We Don't Test Private Methods

Next.js: La Guía Definitiva del Framework React más Popular

Optimizando la Integración de APIs de Blog: Lecciones Aprendidas con Dev.to y Hashnode

Okay