This is interesting. I tried for a little while to come up with my own solution, but, unfortunately, when I was researching the problem, I came across an algorithm for Levenshtein Distance. I couldn't think of any solution that I liked better than the one that I found.
So I thought I'd share it in case anybody else is interested. Here's a link to the original write-up I found.
"""Finds the percent similarity between two strings."""
import numpy as np
def similarity(first, second):
"""Returns the percent similarity between two strings.
Calculated as the # of non-changed characters divided by
the total characters in the smallest input (assuming all
added characters are changes that need made).
changes = levenshtein_distance(first, second)
min_total_chars = min(len(first), len(second))
return (min_total_chars - changes)/min_total_chars
def levenshtein_distance(first, second):
"""Calculates the Levenshtein Distance
(https://en.wikipedia.org/wiki/Levenshtein_distance) between two strings.
Algorithm from Michael Gilleland (https://goo.gl/aUYbqW).
first_len = len(first)
second_len = len(second)
if first_len == 0 or second_len == 0:
raise ValueError("Inputs must not have length 0")
matrix = np.zeros((first_len+1, second_len+1), dtype=np.int)
matrix[:,0] = range(first_len+1)
matrix[0,:] = range(second_len+1)
for i, first_char in enumerate(first, start=1):
for j, second_char in enumerate(second, start=1):
if first_char == second_char:
cost = 0
cost = 1
min_cost = min(
matrix[i-1, j] + 1,
matrix[i, j-1] + 1,
matrix[i-1, j-1] + cost
matrix[i, j] = min_cost
return matrix[first_len, second_len]
Essentially, from what I can tell, you create a 2-dimensional matrix. As you build the matrix, moving from top-left to bottom-right, you're weighing the cumulative cost of:
The bottom right corner of the matrix ends up holding the minimum number of changes you need to make to make the two strings match.
If anybody can explain the reasoning behind it to me better, I would love to hear about it!
We're a place where coders share, stay up-to-date and grow their careers.
We strive for transparency and don't collect excess data.