HybridSimilarity: A Game-Changing Algorithm for Smart Matching in Python 🚀

#python #machinelearning

Explaining the HybridSimilarity Algorithm

In this article, we will delve into the HybridSimilarity algorithm, a custom-built neural network-based model for measuring the similarity between two pieces of text. This hybrid model leverages various techniques to combine lexical, phonetic, semantic, and syntactic similarities for a comprehensive similarity score.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sentence_transformers import SentenceTransformer
from Levenshtein import ratio as levenshtein_ratio
from phonetics import metaphone
import torch
import torch.nn as nn

class HybridSimilarity(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = SentenceTransformer('all-MiniLM-L6-v2')
        self.tfidf = TfidfVectorizer()
        self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4)
        self.fc = nn.Sequential(
            nn.Linear(1152, 256),
            nn.ReLU(),
            nn.LayerNorm(256),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def _extract_features(self, text1, text2):
        # Multiple features
        features = {}

        # Lexical similarity
        features['levenshtein'] = levenshtein_ratio(text1, text2)
        features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))

        # Phonetic similarity
        features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0

        # Semantic embedding (BERT)
        emb1 = self.bert.encode(text1, convert_to_tensor=True)
        emb2 = self.bert.encode(text2, convert_to_tensor=True)
        features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()

        # Syntactic similarity (LSA-TFIDF)
        tfidf_matrix = self.tfidf.fit_transform([text1, text2])
        svd = TruncatedSVD(n_components=1)
        lsa = svd.fit_transform(tfidf_matrix)
        features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]

        # Attention patterns
        att_output, _ = self.attention(
            emb1.unsqueeze(0).unsqueeze(0), 
            emb2.unsqueeze(0).unsqueeze(0), 
            emb2.unsqueeze(0).unsqueeze(0)
        )
        features['attention_score'] = att_output.mean().item()

        return torch.tensor(list(features.values())).unsqueeze(0)

    def forward(self, text1, text2):
        features = self._extract_features(text1, text2)
        return self.fc(features).item()

def similarity_coefficient(text1, text2):
    model = HybridSimilarity()
    return model(text1, text2)

Key Components of the Algorithm

The HybridSimilarity model utilizes the following libraries and technologies:

SentenceTransformers: For semantic embedding generation using pre-trained transformer models.
Levenshtein Ratio: To calculate lexical similarity.
Phonetics (Metaphone): For phonetic similarity.
TF-IDF and TruncatedSVD: For syntactic similarity through Latent Semantic Analysis (LSA).
PyTorch: To define a custom neural network with attention mechanisms and fully connected layers.

Step-by-Step Explanation

1. Model Initialization

The HybridSimilarity class inherits from nn.Module and initializes:

A BERT-based sentence embedding model (all-MiniLM-L6-v2).
A TF-IDF vectorizer for text vectorization.
A multi-head attention mechanism to capture interdependencies between text pairs.
A fully connected neural network for aggregating features and producing the final similarity score.

self.bert = SentenceTransformer('all-MiniLM-L6-v2')
self.tfidf = TfidfVectorizer()
self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4)
self.fc = nn.Sequential(
    nn.Linear(1152, 256),
    nn.ReLU(),
    nn.LayerNorm(256),
    nn.Linear(256, 1),
    nn.Sigmoid()
)

2. Feature Extraction

The _extract_features method calculates multiple similarity features:

Lexical Similarity
- Levenshtein ratio: Measures character-level edits needed to convert one text into another.
- Jaccard index: Compares sets of unique words in both texts.

features['levenshtein'] = levenshtein_ratio(text1, text2)
features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))

Phonetic Similarity
- Metaphone encoding: Checks if the phonetic representation of both texts matches.

features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0

Semantic Similarity
- Sentence embeddings are generated using BERT, and cosine similarity is calculated between them.

emb1 = self.bert.encode(text1, convert_to_tensor=True)
emb2 = self.bert.encode(text2, convert_to_tensor=True)
features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()

Syntactic Similarity
- TF-IDF is used to vectorize the text, and Latent Semantic Analysis (LSA) is applied via TruncatedSVD.

tfidf_matrix = self.tfidf.fit_transform([text1, text2])
svd = TruncatedSVD(n_components=1)
lsa = svd.fit_transform(tfidf_matrix)
features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]

Attention Mechanism
- A multi-head attention mechanism is applied to the embeddings, and the average attention score is used as a feature.

att_output, _ = self.attention(
    emb1.unsqueeze(0).unsqueeze(0),
    emb2.unsqueeze(0).unsqueeze(0),
    emb2.unsqueeze(0).unsqueeze(0)
)
features['attention_score'] = att_output.mean().item()

3. Neural Network Aggregation

The extracted features are concatenated and passed through a fully connected neural network. The network predicts a similarity score between 0 and 1.

def forward(self, text1, text2):
    features = self._extract_features(text1, text2)
    return self.fc(features).item()

Example Usage

The similarity_coefficient function initializes the model and calculates the similarity between two input texts.

text_a = "The quick brown fox jumps over the lazy dog"
text_b = "A fast brown fox leaps over a sleepy hound"

print(f"Similarity coefficient: {similarity_coefficient(text_a, text_b):.4f}")

This function calls the HybridSimilarity model and outputs a similarity score, which is a float value between 0 (completely dissimilar) and 1 (identical).

Conclusion

The HybridSimilarity algorithm is a robust solution that combines multiple dimensions of text similarity into a unified model. By integrating lexical, phonetic, semantic, and syntactic features, this hybrid approach ensures a nuanced and comprehensive similarity analysis. This makes it suitable for tasks such as duplicate detection, text clustering, and recommendation systems.