Venu171

Posted on May 1

Understanding Text Similarity with Embeddings and Cosine Similarity

#ai #nlp #vectordatabase #webdev

How to measure semantic similarity between sentences using modern NLP techniques

Introduction

Have you ever wondered how search engines or chatbots understand that "Machine Learning affects all areas of life" is much more similar to "Artificial intelligence is transforming the world" than "Maradona was one of the best football players in history"?

This isn't magic — it's embeddings + cosine similarity.

In this blog post, we'll break down exactly how this works, starting from the mathematical foundation and ending with real, runnable Python code using Hugging Face Transformers.

By the end, you'll understand:

What text embeddings actually are
Why cosine similarity is the go-to metric
How to implement semantic text similarity from scratch
Real-world results using the BART model

Let's dive in!

What Are Text Embeddings?

Embeddings are numerical vectors that capture the meaning of text in a high-dimensional space.

Instead of treating words as isolated tokens, modern transformer models (like BERT, BART, or GPT) convert entire sentences into dense vectors (typically 768 or 1024 dimensions).

Key property: Semantically similar texts end up close to each other in this vector space.

"King" and "Queen" → close vectors
"King" and "Apple" → far apart vectors
"Artificial Intelligence" and "Machine Learning" → very close

This is the foundation of semantic search, RAG systems, recommendation engines, and plagiarism detection.

The Magic Metric: Cosine Similarity

Once we have two embedding vectors, how do we quantify how similar they are?

We use cosine similarity — the cosine of the angle between two vectors.

The Formula

Interpretation:

1.0 → Identical direction (very similar meaning)
0.0 → Orthogonal (unrelated)
-1.0 → Opposite direction (opposite meaning)

Why cosine and not Euclidean distance?

Cosine similarity is magnitude invariant. It only cares about the direction (i.e., the semantic orientation), not the length of the vectors. This makes it perfect for comparing texts of different lengths.

A Concrete Example (With Numbers!)

To make this crystal clear, let's use simplified 4-dimensional vectors (real models use 768-D, but the math is identical).

Our Sentences

Source: "Artificial intelligence is transforming the world."

Embedding: [0.85, 0.65, 0.12, 0.25]

Candidate 1 (sports): "Maradona was one of the best football players in history."

Embedding: [0.15, 0.08, 0.92, 0.30]

Candidate 2 (tech): "Machine Learning affects all areas of life."

Embedding: [0.78, 0.58, 0.18, 0.22]

Step-by-Step Calculation (Candidate 2)

1. Dot Product

0.85 × 0.78 + 0.65 × 0.58 + 0.12 × 0.18 + 0.25 × 0.22
= 0.663 + 0.377 + 0.0216 + 0.055 = 1.1166

2. Vector Magnitudes

Source: √(0.85² + 0.65² + 0.12² + 0.25²) ≈ 1.1054
Candidate 2: √(0.78² + 0.58² + 0.18² + 0.22²) ≈ 1.0127

3. Final Cosine Similarity

cos(θ) = 1.1166 / (1.1054 × 1.0127) ≈ **0.997**

Result:

Candidate 1 (sports): 0.336
Candidate 2 (tech): 0.997

The model correctly identifies that the second sentence is almost semantically identical to the source!

Real-World Implementation with BART

Now let's see how this works with actual transformer embeddings.

Here's the complete, production-ready code:

from transformers import pipeline
import torch

# Load feature extraction pipeline
feature_extractor = pipeline(
    "feature-extraction",
    model="facebook/bart-base"
)

def get_sentence_embedding(text):
    """Convert text to averaged embedding vector."""
    embeddings = feature_extractor(text)
    tensor = torch.tensor(embeddings).squeeze(0)  # Remove batch dim
    return tensor.mean(dim=0, keepdim=True)       # Average over tokens

def text_similarity(text1, text2):
    """Compute cosine similarity between two sentences."""
    emb1 = get_sentence_embedding(text1)
    emb2 = get_sentence_embedding(text2)
    return torch.nn.functional.cosine_similarity(emb1, emb2).item()

# Example usage
source = "Artificial intelligence is transforming the world."
candidates = [
    "Maradona was one of the best football players in history.",
    "Machine Learning affects all areas of life."
]

print(f"Source: {source}\n")
for cand in candidates:
    score = text_similarity(source, cand)
    print(f"→ {cand}")
    print(f"   Similarity: {score:.4f}\n")

Output (actual run):

Source: Artificial intelligence is transforming the world.

→ Maradona was one of the best football players in history.
   Similarity: 0.4625

→ Machine Learning affects all areas of life.
   Similarity: 0.7117

Beautiful! The model gives us exactly the expected behavior.

Why This Technique is So Powerful

This simple pattern powers many modern AI applications:

Application	How Cosine Similarity Helps
Semantic Search	Find documents with similar meaning, not just keywords
RAG Systems	Retrieve the most relevant context for LLMs
Duplicate Detection	Identify paraphrased content
Recommendation	Suggest similar articles, products, or movies
Clustering	Group documents by topic automatically

Key Takeaways

Embeddings turn text into numbers that capture meaning.
Averaging token embeddings gives you a robust sentence vector.
Cosine similarity is the standard way to compare these vectors.
You don't need massive models — even facebook/bart-base (535M params) works surprisingly well for this task.
This technique is foundational to almost every modern NLP application.

Try It Yourself

Want to experiment?

Install the dependencies:

   pip install transformers torch

Run the code above (first run will download ~535MB model).
Try your own sentences!

Conclusion

Text similarity using embeddings and cosine similarity is one of those "simple but incredibly powerful" techniques in NLP. Once you understand the vector space intuition and the math behind cosine similarity, a whole world of applications opens up — from building smarter search engines to improving RAG pipelines.

The best part? You now have the complete mental model and the working code to start building with it today.

What will you build with this technique?

Drop your ideas in the comments!

DEV Community