DEV Community

Davide Santangelo
Davide Santangelo

Posted on β€’ Edited on

1 1 1 1 1

HybridSimilarity: A Game-Changing Algorithm for Smart Matching in Python πŸš€

Explaining the HybridSimilarity Algorithm

In this article, we will delve into the HybridSimilarity algorithm, a custom-built neural network-based model for measuring the similarity between two pieces of text. This hybrid model leverages various techniques to combine lexical, phonetic, semantic, and syntactic similarities for a comprehensive similarity score.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sentence_transformers import SentenceTransformer
from Levenshtein import ratio as levenshtein_ratio
from phonetics import metaphone
import torch
import torch.nn as nn

class HybridSimilarity(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = SentenceTransformer('all-MiniLM-L6-v2')
        self.tfidf = TfidfVectorizer()
        self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4)
        self.fc = nn.Sequential(
            nn.Linear(1152, 256),
            nn.ReLU(),
            nn.LayerNorm(256),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def _extract_features(self, text1, text2):
        # Multiple features
        features = {}

        # Lexical similarity
        features['levenshtein'] = levenshtein_ratio(text1, text2)
        features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))

        # Phonetic similarity
        features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0

        # Semantic embedding (BERT)
        emb1 = self.bert.encode(text1, convert_to_tensor=True)
        emb2 = self.bert.encode(text2, convert_to_tensor=True)
        features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()

        # Syntactic similarity (LSA-TFIDF)
        tfidf_matrix = self.tfidf.fit_transform([text1, text2])
        svd = TruncatedSVD(n_components=1)
        lsa = svd.fit_transform(tfidf_matrix)
        features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]

        # Attention patterns
        att_output, _ = self.attention(
            emb1.unsqueeze(0).unsqueeze(0), 
            emb2.unsqueeze(0).unsqueeze(0), 
            emb2.unsqueeze(0).unsqueeze(0)
        )
        features['attention_score'] = att_output.mean().item()

        return torch.tensor(list(features.values())).unsqueeze(0)

    def forward(self, text1, text2):
        features = self._extract_features(text1, text2)
        return self.fc(features).item()

def similarity_coefficient(text1, text2):
    model = HybridSimilarity()
    return model(text1, text2)
Enter fullscreen mode Exit fullscreen mode

Key Components of the Algorithm

The HybridSimilarity model utilizes the following libraries and technologies:

  • SentenceTransformers: For semantic embedding generation using pre-trained transformer models.
  • Levenshtein Ratio: To calculate lexical similarity.
  • Phonetics (Metaphone): For phonetic similarity.
  • TF-IDF and TruncatedSVD: For syntactic similarity through Latent Semantic Analysis (LSA).
  • PyTorch: To define a custom neural network with attention mechanisms and fully connected layers.

Step-by-Step Explanation

1. Model Initialization

The HybridSimilarity class inherits from nn.Module and initializes:

  • A BERT-based sentence embedding model (all-MiniLM-L6-v2).
  • A TF-IDF vectorizer for text vectorization.
  • A multi-head attention mechanism to capture interdependencies between text pairs.
  • A fully connected neural network for aggregating features and producing the final similarity score.
self.bert = SentenceTransformer('all-MiniLM-L6-v2')
self.tfidf = TfidfVectorizer()
self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4)
self.fc = nn.Sequential(
    nn.Linear(1152, 256),
    nn.ReLU(),
    nn.LayerNorm(256),
    nn.Linear(256, 1),
    nn.Sigmoid()
)
Enter fullscreen mode Exit fullscreen mode
2. Feature Extraction

The _extract_features method calculates multiple similarity features:

  • Lexical Similarity
    • Levenshtein ratio: Measures character-level edits needed to convert one text into another.
    • Jaccard index: Compares sets of unique words in both texts.
features['levenshtein'] = levenshtein_ratio(text1, text2)
features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))
Enter fullscreen mode Exit fullscreen mode
  • Phonetic Similarity
    • Metaphone encoding: Checks if the phonetic representation of both texts matches.
features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0
Enter fullscreen mode Exit fullscreen mode
  • Semantic Similarity
    • Sentence embeddings are generated using BERT, and cosine similarity is calculated between them.
emb1 = self.bert.encode(text1, convert_to_tensor=True)
emb2 = self.bert.encode(text2, convert_to_tensor=True)
features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()
Enter fullscreen mode Exit fullscreen mode
  • Syntactic Similarity
    • TF-IDF is used to vectorize the text, and Latent Semantic Analysis (LSA) is applied via TruncatedSVD.
tfidf_matrix = self.tfidf.fit_transform([text1, text2])
svd = TruncatedSVD(n_components=1)
lsa = svd.fit_transform(tfidf_matrix)
features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]
Enter fullscreen mode Exit fullscreen mode
  • Attention Mechanism
    • A multi-head attention mechanism is applied to the embeddings, and the average attention score is used as a feature.
att_output, _ = self.attention(
    emb1.unsqueeze(0).unsqueeze(0),
    emb2.unsqueeze(0).unsqueeze(0),
    emb2.unsqueeze(0).unsqueeze(0)
)
features['attention_score'] = att_output.mean().item()
Enter fullscreen mode Exit fullscreen mode
3. Neural Network Aggregation

The extracted features are concatenated and passed through a fully connected neural network. The network predicts a similarity score between 0 and 1.

def forward(self, text1, text2):
    features = self._extract_features(text1, text2)
    return self.fc(features).item()
Enter fullscreen mode Exit fullscreen mode

Example Usage

The similarity_coefficient function initializes the model and calculates the similarity between two input texts.

text_a = "The quick brown fox jumps over the lazy dog"
text_b = "A fast brown fox leaps over a sleepy hound"

print(f"Similarity coefficient: {similarity_coefficient(text_a, text_b):.4f}")
Enter fullscreen mode Exit fullscreen mode

This function calls the HybridSimilarity model and outputs a similarity score, which is a float value between 0 (completely dissimilar) and 1 (identical).

Conclusion

The HybridSimilarity algorithm is a robust solution that combines multiple dimensions of text similarity into a unified model. By integrating lexical, phonetic, semantic, and syntactic features, this hybrid approach ensures a nuanced and comprehensive similarity analysis. This makes it suitable for tasks such as duplicate detection, text clustering, and recommendation systems.

Retry later

Top comments (0)

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

πŸ‘‹ Kindness is contagious

Please leave a ❀️ or a friendly comment on this post if you found it helpful!

Okay