Discussion on: Write a script to identify similarity from two string.

View post

This is interesting. I tried for a little while to come up with my own solution, but, unfortunately, when I was researching the problem, I came across an algorithm for Levenshtein Distance. I couldn't think of any solution that I liked better than the one that I found.

So I thought I'd share it in case anybody else is interested. Here's a link to the original write-up I found.

"""Finds the percent similarity between two strings."""

import numpy as np

def similarity(first, second):
    """Returns the percent similarity between two strings.

    Calculated as the # of non-changed characters divided by
    the total characters in the smallest input (assuming all 
    added characters are changes that need made).
    """
    changes = levenshtein_distance(first, second)
    min_total_chars = min(len(first), len(second))

    return (min_total_chars - changes)/min_total_chars


def levenshtein_distance(first, second):
    """Calculates the Levenshtein Distance 
    (https://en.wikipedia.org/wiki/Levenshtein_distance) between two strings.

    Algorithm from Michael Gilleland (https://goo.gl/aUYbqW).
    """

    first_len = len(first)
    second_len = len(second)
    if first_len == 0 or second_len == 0:
        raise ValueError("Inputs must not have length 0")

    matrix = np.zeros((first_len+1, second_len+1), dtype=np.int)
    matrix[:,0] = range(first_len+1)
    matrix[0,:] = range(second_len+1)

    for i, first_char in enumerate(first, start=1):
        for j, second_char in enumerate(second, start=1):
            if first_char == second_char:
                cost = 0
            else:
                cost = 1

            min_cost = min(
                matrix[i-1, j] + 1,
                matrix[i, j-1] + 1,
                matrix[i-1, j-1] + cost
            )
            matrix[i, j] = min_cost

    return matrix[first_len, second_len]

Essentially, from what I can tell, you create a 2-dimensional matrix. As you build the matrix, moving from top-left to bottom-right, you're weighing the cumulative cost of:

Deleting a character in one of the words.
Inserting a new character in one of the words.
Changing (or leaving alone) the existing character.

The bottom right corner of the matrix ends up holding the minimum number of changes you need to make to make the two strings match.

If anybody can explain the reasoning behind it to me better, I would love to hear about it!