Being able to calculate the similarity or dissimilarity between strings in the most accurate and cost efficient manner is a crucial task in various domains like natural language processing, data mining and database management.
In regards to databases, string distance algorithms allow tasks like spell checking, record linkage, data deduplication, and information retrieval to be effectively carried out.
In this blog, we will be focusing on three popular string distance algorithms:
- Levenshtein distance
- Jaro similarity
- Naive Recursive (Edit Distance)
We will be discussing how each algorithm works, their significance in databases, and what the advantages and disadvantages are for each.
Additionally, we will also cover the UTL_MATCH package available in Oracle Database for string comparison.
Levenshtein Distance
The Levenshtein Distance was named after the Soviet mathematician Vladimir Levenshtein , who introduced it in 1965.
It follows a simple concept of calculating the minimum (least) number of single character edits required to transform one string into another. These edits include insertions, deletions and substitutions.
Let's try to understand it better with an example:
Suppose two words/strings: "kangaroo" and "potato"
This would be the initial matrix:
| | | k | a | n | g | a | r | o | o |
|---|---|---|---|---|---|---|---|---|---|
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| p | 1 | | | | | | | | |
| o | 2 | | | | | | | | |
| t | 3 | | | | | | | | |
| a | 4 | | | | | | | | |
| t | 5 | | | | | | | | |
| o | 6 | | | | | | | | |
Now let's fill the matrix. Compare each character of the two strings. If a character matches, the value from the diagonal cell is copied. If they do not match, the minimum value out of those present in the left, diagonal and upper cells is selected and incremented by one.
| | | k | a | n | g | a | r | o | o |
|---|---|---|---|---|---|---|---|---|---|
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| p | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| o | 2 | 2 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| t | 3 | 3 | 2 | 2 | 3 | 4 | 5 | 6 | 7 |
| a | 4 | 4 | 3 | 3 | 3 | 4 | 5 | 6 | 7 |
| t | 5 | 5 | 4 | 4 | 4 | 4 | 5 | 6 | 7 |
| o | 6 | 6 | 5 | 5 | 5 | 5 | 5 | 6 | 7 |
The final value for the Levenshtein distance is the value calculated for the bottom right cell of the matrix, which in our case is 7.
Advantages:
- Provides a precise value for string dissimilarity considering edits required.
- Suitable for fuzzy matching, spell checking, and data deduplication.
- Can handle strings of differing lengths.
Disadvantages:
- Computationally expensive. Time complexity is O(m*n) where m and n are the lengths of the two strings.
- May perform poorly with lengthier strings.
Jaro Similarity
The Jaro similarity algorithm was introduced in 1989 by William E. Jaro. The algorithm calculates the similarity between two strings by comparing their characters and the order in which they appear. The algorithm provides a resultant value between 0 and 1, where 0 indicates no similarity at all and 1 indicates a perfect match.
Let's use an example to try to understand better.
Suppose two strings: "marble" and "table"
Jaro similarity is calculated using the following formula:
- m is the number of matching character
- t is half the number of transpositions
- |s1| and |s2| are the lengths of string 1 and string 2 respectively.
For our example ("marble" & "table"):
- 4 matching characters. m = 4
- No transpositions. t = 0
Jaro Distance = (4 / 6 + 4 / 5 + (4 - 0) / 4) / 3 = 0.822
Jaro-Winkler Similarity
Jaro-Winkler similarity is a slight modification of the Jaro similarity. It gives additional weightage to common prefixes (substring that appears at the beginning of the string itself).
It is useful for calculating similarity between shorter strings or when you want to focus on strings with similar prefixes.
Here are the steps to calculate the Jaro-Winkler similarity:
- Calculate the Jaro similarity as done earlier. We will continue with the value from our example (8.22).
- Calculate prefix scaling factor (p). The formula for this is: p = 0.1 * common prefix length * (1 - Jaro Similarity) According to our example, this would be: p = 0.1 * 0 * (1 - 0.822) = 0
- Calculate Jaro-Winkler similarity using formula: Jaro-Winkler similarity = Jaro similarity + p ^(1 - Jaro similarity) We would get: Jaro-Winkler similarity = 8.22 + 0 * (1 - 8.22) = 8.22
Advantages:
- Consider the order of characters when calculating string similarity.
- Used in tasks such as record linkage, string matching, and deduplication.
- Can handle strings of different lengths as well as strings having transpositions.
Disadvantages:
- Not as accurate as Levenshtein distance as edits are not considered in the calculations.
- May not perform well in cases where common prefixes are not that significant.
Naive Recursive (Edit Distance):
The Naive Recursive algorithm, also known as Edit Distance algorithm, is similar to the Levenshtein distance algorithm in that it also uses a minimum number of edits to compare strings. However, unlike Levenshtein, this algorithm uses a recursive approach. It breaks down the problem into smaller subproblems and solves them recursively.
Let's once again use an example to understand this better using the strings "book" and "back".
- If either string is null/empty, the distance is set to the length of the other string.
- If the last characters of the strings match (common postfix), the distance for the remaining substrings is calculated.
- If the conditions above are false, the distance is calculated using the following three recursive calls:
- Insertion: EditDistance(s1, s2[:-1]) + 1
- Deletion: EditDistance(s1[:-1], s2) + 1
- Substitution: EditDistance(s1[:-1], s2[:-1]) + 1
For our example, the Edit Distance would be 2.
Advantages:
- Simplistic, easy to understand and implement.
- Calculates an accurate distance between strings.
Disadvantages:
- Exponential time complexity hence only suitable for short strings.
- Not suitable for real-world use with large datasets and real-time applications due to time complexity.
- Computationally costly.
Oracle Database
Oracle database offers all of the above capabilities in its UTL_MATCH package. This package offers a wide variety of string comparison functions including the Levenshtein distance and Jaro-Winkler similarity algorithms discussed above.
Here is how the Levenshtein Distance is calculated using UTL_MATCH.EDIT_DISTANCE function:
DECLARE
distance NUMBER;
BEGIN
distance := UTL_MATCH.EDIT_DISTANCE('kangaroo', 'potato');
DBMS_OUTPUT.PUT_LINE('Levenshtein Distance: ' || distance);
END;
Output will be: 'Levenshtein Distance: 7'
We can also use the UTL_MATCH.JARO_WINKLER_SIMILARITY function to calculate the Jaro-Winkler similarity:
DECLARE
similarity NUMBER;
BEGIN
similarity := UTL_MATCH.JARO_WINKLER_SIMILARITY('marble', 'table');
DBMS_OUTPUT.PUT_LINE('Jaro-Winkler Similarity: ' || similarity);
END;
The output here will be: 'Jaro-Winkler Similarity: 0.822'
The UTL_MATCH package in Oracle Database offers optimised functions for string comparison, further enhancing the capabilities of databases in handling string-related operations.
Understanding and utilising these algorithms and tools can greatly enhance the efficiency and accuracy of database systems.
Top comments (0)