DEV Community

Davide Santangelo
Davide Santangelo

Posted on

Write a script to identify similarity from two string.

Implement a similarity method that compares two strings and returns a number representing the percent similarity between the two strings.

The method must be able to calculate the minimum number of operations you must do to change 1 string into another. In other words the the Levenshtein distance is the model.

  def similarity(string1, string2)
    # ...
    return similarity_percentage
  end
Enter fullscreen mode Exit fullscreen mode

Top comments (4)

Collapse
 
rpalo profile image
Ryan Palo

This is interesting. I tried for a little while to come up with my own solution, but, unfortunately, when I was researching the problem, I came across an algorithm for Levenshtein Distance. I couldn't think of any solution that I liked better than the one that I found.

So I thought I'd share it in case anybody else is interested. Here's a link to the original write-up I found.

"""Finds the percent similarity between two strings."""

import numpy as np

def similarity(first, second):
    """Returns the percent similarity between two strings.

    Calculated as the # of non-changed characters divided by
    the total characters in the smallest input (assuming all 
    added characters are changes that need made).
    """
    changes = levenshtein_distance(first, second)
    min_total_chars = min(len(first), len(second))

    return (min_total_chars - changes)/min_total_chars


def levenshtein_distance(first, second):
    """Calculates the Levenshtein Distance 
    (https://en.wikipedia.org/wiki/Levenshtein_distance) between two strings.

    Algorithm from Michael Gilleland (https://goo.gl/aUYbqW).
    """

    first_len = len(first)
    second_len = len(second)
    if first_len == 0 or second_len == 0:
        raise ValueError("Inputs must not have length 0")

    matrix = np.zeros((first_len+1, second_len+1), dtype=np.int)
    matrix[:,0] = range(first_len+1)
    matrix[0,:] = range(second_len+1)

    for i, first_char in enumerate(first, start=1):
        for j, second_char in enumerate(second, start=1):
            if first_char == second_char:
                cost = 0
            else:
                cost = 1

            min_cost = min(
                matrix[i-1, j] + 1,
                matrix[i, j-1] + 1,
                matrix[i-1, j-1] + cost
            )
            matrix[i, j] = min_cost

    return matrix[first_len, second_len]

Essentially, from what I can tell, you create a 2-dimensional matrix. As you build the matrix, moving from top-left to bottom-right, you're weighing the cumulative cost of:

  1. Deleting a character in one of the words.
  2. Inserting a new character in one of the words.
  3. Changing (or leaving alone) the existing character.

The bottom right corner of the matrix ends up holding the minimum number of changes you need to make to make the two strings match.

A completed matrix comparing GUMBO with GAMBOL

If anybody can explain the reasoning behind it to me better, I would love to hear about it!

Collapse
 
jay profile image
Jay

Rust solution for similarity of strings. It uses iterative approach with dynamic programming fashion. It creates a matrix of values and the bottom last cell has the number of changes needed to make the 2 strings similar.

fn similarity(s1: &str, s2: &str) -> i32 {
    let s: Vec<char> = String::from(s1).chars().collect();
    let t: Vec<char> = s2.to_string().chars().collect();

    let m = s1.len();
    let n = s2.len();

    // If any of the strings is empty the Levenshtein distance equal to the length of the other string.
    if m == 0 {
        return n as i32;
    }

    if n == 0 {
        return m as i32;
    }

    // Create mutable matrix of size (m+1, n+1);
    let mut d = vec![vec![0i32; n + 1]; m + 1];

    for i in 1..=m {
        d[i][0] = i as i32
    }

    for j in 1..=n {
        d[0][j] = j as i32
    }

    for j in 1..=n {
        for i in 1..=m {
            let mut cost = 0;
            // Check substitutuion cost.
            if s[i - 1] != t[j - 1] {
                cost = 1;
            }
            d[i][j] = min(
                d[i - 1][j] + 1,        // Deletion
                d[i][j - 1] + 1,        // Insertion
                d[i - 1][j - 1] + cost, // Substituition
            );
        }
    }

    d[m][n]
}

// Custom min() to get minimum of 3 numbers
fn min(a: i32, b: i32, c: i32) -> i32 {
    std::cmp::min(a, std::cmp::min(b, c))
}

// Usage
fn main() {
    println!("{}", similarity(&"kitten", &"sitting"));    // 3
}

I used the explanation from Wikipedia here.
The dynamic part work by first starting with prefix of both strings (Single characters at first) and them working it's way onwards to the complete string. It checks the similarity and computer the distance between the sub-strings, by checking if insertion, deletion, or substitution help to make the sub-strings similar. The last computer value will be the value of the differences of both the strings.

If the name of the algo was not given in the challenge, I would have gone with LCS (due to familiarity with the algo) and it would be good enough, as it works on the same principle but does not take substitutions into account.
I loved working on this, and it introduced me to a new algo; but please add some examples (test cases). A little more explanation using the examples would help a lot. The question just feels like try and implement this algo.
Some test cases I used:

#[test]
fn test_similarity() {
    assert_eq!(1, similarity(&"son", &"song"));
    assert_eq!(3, similarity(&"kitten", &"sitting"));
    assert_eq!(5, similarity(&"", &"abcde"));
    assert_eq!(2, similarity(&"flaw", &"lawn"));
    assert_eq!(0, similarity(&"rust", &"rust"));
}
Collapse
 
oskarlarsson profile image
Oskar Larsson

Are the strings of the same length? Or how do you define similarity? Are we supposed to come up with our own model?

Collapse
 
daviducolo profile image
Davide Santangelo

Thanks Oskar for suggestion. I just updated the description of the challenge.