DEV Community

Venu171
Venu171

Posted on

Understanding Text Similarity with Embeddings and Cosine Similarity

How to measure semantic similarity between sentences using modern NLP techniques


Introduction

Have you ever wondered how search engines or chatbots understand that "Machine Learning affects all areas of life" is much more similar to "Artificial intelligence is transforming the world" than "Maradona was one of the best football players in history"?

This isn't magic — it's embeddings + cosine similarity.

In this blog post, we'll break down exactly how this works, starting from the mathematical foundation and ending with real, runnable Python code using Hugging Face Transformers.

By the end, you'll understand:

  • What text embeddings actually are
  • Why cosine similarity is the go-to metric
  • How to implement semantic text similarity from scratch
  • Real-world results using the BART model

Let's dive in!


What Are Text Embeddings?

Embeddings are numerical vectors that capture the meaning of text in a high-dimensional space.

Instead of treating words as isolated tokens, modern transformer models (like BERT, BART, or GPT) convert entire sentences into dense vectors (typically 768 or 1024 dimensions).

Key property: Semantically similar texts end up close to each other in this vector space.

  • "King" and "Queen" → close vectors
  • "King" and "Apple" → far apart vectors
  • "Artificial Intelligence" and "Machine Learning" → very close

This is the foundation of semantic search, RAG systems, recommendation engines, and plagiarism detection.


The Magic Metric: Cosine Similarity

Once we have two embedding vectors, how do we quantify how similar they are?

We use cosine similarity — the cosine of the angle between two vectors.

The Formula

Interpretation:

  • 1.0 → Identical direction (very similar meaning)
  • 0.0 → Orthogonal (unrelated)
  • -1.0 → Opposite direction (opposite meaning)

Why cosine and not Euclidean distance?

Cosine similarity is magnitude invariant. It only cares about the direction (i.e., the semantic orientation), not the length of the vectors. This makes it perfect for comparing texts of different lengths.


A Concrete Example (With Numbers!)

To make this crystal clear, let's use simplified 4-dimensional vectors (real models use 768-D, but the math is identical).

Our Sentences

Source: "Artificial intelligence is transforming the world."

Embedding: [0.85, 0.65, 0.12, 0.25]

Candidate 1 (sports): "Maradona was one of the best football players in history."

Embedding: [0.15, 0.08, 0.92, 0.30]

Candidate 2 (tech): "Machine Learning affects all areas of life."

Embedding: [0.78, 0.58, 0.18, 0.22]

Step-by-Step Calculation (Candidate 2)

1. Dot Product

0.85 × 0.78 + 0.65 × 0.58 + 0.12 × 0.18 + 0.25 × 0.22
= 0.663 + 0.377 + 0.0216 + 0.055 = 1.1166
Enter fullscreen mode Exit fullscreen mode

2. Vector Magnitudes

  • Source: √(0.85² + 0.65² + 0.12² + 0.25²) ≈ 1.1054
  • Candidate 2: √(0.78² + 0.58² + 0.18² + 0.22²) ≈ 1.0127

3. Final Cosine Similarity

cos(θ) = 1.1166 / (1.1054 × 1.0127) ≈ **0.997**
Enter fullscreen mode Exit fullscreen mode

Result:

  • Candidate 1 (sports): 0.336
  • Candidate 2 (tech): 0.997

The model correctly identifies that the second sentence is almost semantically identical to the source!


Real-World Implementation with BART

Now let's see how this works with actual transformer embeddings.

Here's the complete, production-ready code:

from transformers import pipeline
import torch

# Load feature extraction pipeline
feature_extractor = pipeline(
    "feature-extraction",
    model="facebook/bart-base"
)

def get_sentence_embedding(text):
    """Convert text to averaged embedding vector."""
    embeddings = feature_extractor(text)
    tensor = torch.tensor(embeddings).squeeze(0)  # Remove batch dim
    return tensor.mean(dim=0, keepdim=True)       # Average over tokens

def text_similarity(text1, text2):
    """Compute cosine similarity between two sentences."""
    emb1 = get_sentence_embedding(text1)
    emb2 = get_sentence_embedding(text2)
    return torch.nn.functional.cosine_similarity(emb1, emb2).item()

# Example usage
source = "Artificial intelligence is transforming the world."
candidates = [
    "Maradona was one of the best football players in history.",
    "Machine Learning affects all areas of life."
]

print(f"Source: {source}\n")
for cand in candidates:
    score = text_similarity(source, cand)
    print(f"{cand}")
    print(f"   Similarity: {score:.4f}\n")
Enter fullscreen mode Exit fullscreen mode

Output (actual run):

Source: Artificial intelligence is transforming the world.

→ Maradona was one of the best football players in history.
   Similarity: 0.4625

→ Machine Learning affects all areas of life.
   Similarity: 0.7117
Enter fullscreen mode Exit fullscreen mode

Beautiful! The model gives us exactly the expected behavior.


Why This Technique is So Powerful

This simple pattern powers many modern AI applications:

Application How Cosine Similarity Helps
Semantic Search Find documents with similar meaning, not just keywords
RAG Systems Retrieve the most relevant context for LLMs
Duplicate Detection Identify paraphrased content
Recommendation Suggest similar articles, products, or movies
Clustering Group documents by topic automatically

Key Takeaways

  1. Embeddings turn text into numbers that capture meaning.
  2. Averaging token embeddings gives you a robust sentence vector.
  3. Cosine similarity is the standard way to compare these vectors.
  4. You don't need massive models — even facebook/bart-base (535M params) works surprisingly well for this task.
  5. This technique is foundational to almost every modern NLP application.

Try It Yourself

Want to experiment?

  1. Install the dependencies:
   pip install transformers torch
Enter fullscreen mode Exit fullscreen mode
  1. Run the code above (first run will download ~535MB model).

  2. Try your own sentences!


Conclusion

Text similarity using embeddings and cosine similarity is one of those "simple but incredibly powerful" techniques in NLP. Once you understand the vector space intuition and the math behind cosine similarity, a whole world of applications opens up — from building smarter search engines to improving RAG pipelines.

The best part? You now have the complete mental model and the working code to start building with it today.


What will you build with this technique?

Drop your ideas in the comments!


Further Reading


Thanks for reading! If you found this helpful, consider sharing it with your network.

Written with ❤️ for the NLP community

May 2026

Top comments (0)