Edit Distance is a metric used to quantify the dissimilarity between two strings by counting the amount of operation needed to transform one to another. Depending on the algorithm chosen, the operations allowed may include: addition, deletion, substitution and finally translation.
This measurement can then be used to calculate a percentage of similarity between two strings with this formula:
(maxLength - distance) / maxLength
To illustrate this, we will look at the Levenshtein algorithm. Levensthein is one of the most known edit distance algorithm. It was created by the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965. It allows insertions, deletions or substitutions.
For example, the Levenshtein distance between “kitten” and “sitting” is 3 since, at a minimum, 3 edits are required to change one into the other.
- kitten → sitten (substitution of “s” for “k”)
- sitten → sittin (substitution of “i” for “e”)
- sittin → sitting (insertion of “g” at the end).
Edit distance can be used in several application areas such as:
- Fuzzy search algorithms
- Correction of spelling mistakes
- Word completion
- DNA sequence alignment algorithms
- Pattern matching
- And many more!
Go-Edlib is a new open-source library for Golang that implements most popular edit distance algorithms and soon all of them! Currently, it includes: Levenshtein, LCS, Hamming, Damerau-Levenshtein (OSA and Adjacent transpositions algorithms), Jaro/Jaro-Winkler.
All these algorithms have been implemented in such a way as to be fully compatible with Unicode
It also includes fuzzy search algorithms based on edit distance and few others string comparisons functions.
I'm actively looking for feedback and/or contributions to improve this library or to have new functionality ideas to add! :)
PS: I would also like, if possible, opinions on the quality of my documentation and possibly points to improve