Calculating String Similarity in Python

#python #webdev #beginners

Calculating the similarity between strings is essential for various applications, from text analysis to similarity detection. Let's explore how to calculate the similarity between two strings in Python.

1. Levenshtein Distance

Levenshtein distance, also known as edit distance, is a common method for measuring the similarity between two strings. It represents the minimum number of edit operations (insertion, deletion, substitution) required to transform one string into the other. You can use the python-Levenshtein library in Python.

pip install python-Levenshtein

Here's a sample code snippet to calculate Levenshtein distance:

import Levenshtein

str1 = "kitten"
str2 = "sitting"

distance = Levenshtein.distance(str1, str2)
print(f"Levenshtein Distance: {distance}")

2. SequenceMatcher

The SequenceMatcher class from the difflib library can be used to calculate the similarity between two sequences, including strings. It's useful not only for exact matches but also for partial matches where parts of one string match parts of another.

Here's a sample code snippet using SequenceMatcher to calculate string similarity:

from difflib import SequenceMatcher

str1 = "kitten"
str2 = "sitting"

matcher = SequenceMatcher(None, str1, str2)
match_ratio = matcher.ratio()
print(f"Match Ratio: {match_ratio}")

3. Jaccard Similarity

Jaccard similarity is a statistical method for measuring the similarity between sets. It's often used for text analysis by splitting strings into tokens (e.g., words) and calculating the ratio of common tokens.

Here's a sample code snippet to calculate Jaccard similarity:

str1 = "hello world"
str2 = "world hello"

set1 = set(str1.split())
set2 = set(str2.split())

intersection = len(set1.intersection(set2))
union = len(set1.union(set2))

jaccard_similarity = intersection / union
print(f"Jaccard Similarity: {jaccard_similarity}")

4. Cosine Similarity

Cosine similarity measures the similarity between two texts using vector representations (e.g., TF-IDF). You can calculate it using the scikit-learn library.

pip install scikit-learn

Here's a sample code snippet to calculate Cosine similarity:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

str1 = "I love programming"
str2 = "Programming is my passion"

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([str1, str2])

cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]
print(f"Cosine Similarity: {cosine_sim}")

Choose the method that best suits your specific requirements and data. These techniques are valuable for text analysis and similarity detection in various applications.

DEV Community

Calculating String Similarity in Python

1. Levenshtein Distance

2. SequenceMatcher

3. Jaccard Similarity

4. Cosine Similarity

Top comments (0)

Read next

GraphQL: A Beginner's Guide

Introducing uv: Next-Gen Python Package Manager

How My Old Laptop Taught Me More About Coding Than Any Course Ever Could

Design Patterns: Your Secret Weapon in Software Engineering