DEV Community

Vinicius Fagundes
Vinicius Fagundes

Posted on

Embeddings and Vector Similarity: How Machines Understand Meaning

πŸ“š Tech Acronyms Reference

Quick reference for acronyms used in this article:

  • ANN - Approximate Nearest Neighbor
  • API - Application Programming Interface
  • BERT - Bidirectional Encoder Representations from Transformers
  • BOW - Bag of Words
  • CBOW - Continuous Bag of Words
  • CPU - Central Processing Unit
  • GPU - Graphics Processing Unit
  • GloVe - Global Vectors for Word Representation
  • HNSW - Hierarchical Navigable Small World
  • LLM - Large Language Model
  • MiniLM - Mini Language Model (compact version of BERT)
  • MPNet - Masked and Permuted Pre-training Network
  • NLP - Natural Language Processing
  • RAM - Random Access Memory
  • RAG - Retrieval-Augmented Generation
  • ROI - Return on Investment
  • TF-IDF - Term Frequency-Inverse Document Frequency

Mathematical/Statistical Terms:

  • Cosine Similarity - Measures angle between vectors (direction similarity)
  • Dot Product - Sum of element-wise multiplication of two vectors
  • Euclidean Distance - Straight-line distance between two points
  • Magnitude (Norm) - Length of a vector
  • Vector - Array of numbers representing position in space
  • Dimension - Number of values in a vector (e.g., 384-dimensional = 384 numbers)

🎯 Introduction: Teaching Machines to Understand Meaning

Here's a problem: computers only understand numbers.

Text is just characters. "Cat" means nothing to a Central Processing Unit (CPU). It's just three bytes: 67, 97, 116.

But somehow, Large Language Models (LLMs) know that "cat" is closer to "dog" than to "airplane." They know "king - man + woman = queen." They understand that "I love this movie" and "This film is amazing" mean similar things.

How?

Embeddings.

An embedding is a way to represent words, sentences, or documents as vectors (lists of numbers) in a high-dimensional space. Similar meanings = nearby vectors. Different meanings = distant vectors.

As a data engineer, embeddings are everywhere in modern Artificial Intelligence (AI) systems:

  • Semantic search: Find documents by meaning, not keywords
  • Recommendations: "Users who liked X also liked Y"
  • Clustering: Group similar support tickets automatically
  • Retrieval-Augmented Generation (RAG): Find relevant context for LLMs
  • Anomaly detection: Identify outliers in text data

Understanding embeddingsβ€”how they work, how to compare them, and how to choose the right onesβ€”is fundamental to building production AI systems.


πŸ’‘ Data Engineer's ROI Lens

For this article, we're focusing on:

  1. How do embeddings actually work? (From Word2Vec to transformer-based)
  2. How do I measure similarity? (Distance metrics and their trade-offs)
  3. How do I choose the right embedding model? (Dimensions, cost, performance)

These decisions directly impact search quality, storage costs, and inference latency at scale.


πŸ—ΊοΈ Part 1: From Words to Vectors

The Problem with Traditional Text Representation

Before embeddings, we had simpler approaches:

One-Hot Encoding:

Represent each word as a vector with a 1 in its position and 0s everywhere else:

Vocabulary: [cat, dog, bird, airplane]

"cat"      β†’ [1, 0, 0, 0]
"dog"      β†’ [0, 1, 0, 0]
"bird"     β†’ [0, 0, 1, 0]
"airplane" β†’ [0, 0, 0, 1]
Enter fullscreen mode Exit fullscreen mode

Problems:

  • Huge vectors: 50,000-word vocabulary = 50,000-dimensional vectors
  • No relationships: "cat" and "dog" are as different as "cat" and "airplane"
  • Sparse: 99.99% zeros, wasting memory

Real-Life Analogy: The Phone Book Problem

One-hot encoding is like identifying people only by their phone number:

  • Person A: 555-0001
  • Person B: 555-0002
  • Person C: 555-9999

You can tell they're different, but you can't tell that A and B are neighbors while C lives across town. The numbers don't encode any meaningful relationships.

Bag of Words (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF):

Count word frequencies in documents:

Document 1: "The cat sat on the mat"
BOW: {the: 2, cat: 1, sat: 1, on: 1, mat: 1}

Document 2: "The dog sat on the rug"
BOW: {the: 2, dog: 1, sat: 1, on: 1, rug: 1}
Enter fullscreen mode Exit fullscreen mode

Better, but still problems:

  • Word order lost: "Dog bites man" = "Man bites dog"
  • No semantic understanding: "happy" and "joyful" are unrelated
  • High-dimensional and sparse

The Embedding Revolution: Word2Vec

In 2013, researchers at Google introduced Word2Vec, and everything changed.

The Core Idea:

Words that appear in similar contexts have similar meanings.

"The cat sat on the mat"
"The dog sat on the rug"

"Cat" and "dog" appear in similar contexts (after "The", before "sat"). They should have similar representations.

Real-Life Analogy 1: The Library Organization

Imagine a magical library where books physically move closer together based on their content:

  • All mystery novels cluster together on one shelf
  • All romance novels cluster together on another shelf
  • Books that are "mystery with romance" sit between both clusters
  • A book that's "mystery + romance + set in Paris" sits near the intersection of three sections

In this library, you can literally walk toward the type of book you want. Similar books are physically nearby.

Embeddings work the same way, but in 300-dimensional space instead of 3D physical space. Words with similar meanings cluster together.

Real-Life Analogy 2: You Are Who You Hang Out With

Imagine you know nothing about someone, but you observe who they spend time with:

  • Person A hangs out at: dog parks, pet stores, veterinary clinics
  • Person B hangs out at: dog parks, pet stores, grooming salons
  • Person C hangs out at: airports, hotels, travel agencies

Without knowing anything else, you'd guess A and B have something in common (pet owners), while C is different (traveler).

Word2Vec does the same thing with words. It learns that "cat" and "dog" appear in similar contexts (near words like "pet", "fur", "food bowl"), so they get similar vector representations.

How Word2Vec Works

Training Approach 1: Continuous Bag of Words (CBOW)

Predict the middle word from surrounding context:

Context: "The ___ sat on the mat"
Predict: "cat"

Context: "I love my pet ___"
Predict: "dog" (or "cat", "hamster", etc.)
Enter fullscreen mode Exit fullscreen mode

Training Approach 2: Skip-gram

Predict surrounding words from the middle word:

Given: "cat"
Predict: "The", "sat", "on", "mat" (words that appear nearby)
Enter fullscreen mode Exit fullscreen mode

The Result:

After training on billions of words, Word2Vec produces dense vectors (typically 100-300 dimensions) where:

"cat"  β†’ [0.23, -0.45, 0.67, 0.12, ...]  (300 numbers)
"dog"  β†’ [0.25, -0.43, 0.65, 0.14, ...]  (similar!)
"airplane" β†’ [-0.78, 0.91, -0.23, 0.56, ...]  (different!)
Enter fullscreen mode Exit fullscreen mode

The Famous Example: King - Man + Woman = Queen

Word2Vec captured something remarkable: semantic relationships as vector arithmetic.

# Vector math with meanings
king - man + woman β‰ˆ queen
paris - france + italy β‰ˆ rome
walking - walk + swim β‰ˆ swimming
Enter fullscreen mode Exit fullscreen mode

Real-Life Analogy: GPS Coordinates for Meaning

Think of embeddings as GPS coordinates in "meaning space":

  • Paris: (48.8566Β° N, 2.3522Β° E) β€” a capital city
  • France: The country containing Paris
  • Rome: (41.9028Β° N, 12.4964Β° E) β€” another capital city
  • Italy: The country containing Rome

The relationship "Paris is to France as Rome is to Italy" is encoded in the relative positions. The vector from France β†’ Paris is similar to the vector from Italy β†’ Rome.

Embeddings work the same way. "King" and "Queen" are offset by a "gender vector." "Paris" and "Rome" are offset by a "country vector."

GloVe: Global Vectors for Word Representation

Global Vectors for Word Representation (GloVe) (Stanford, 2014) improved on Word2Vec by using global co-occurrence statistics.

The Idea:

Instead of predicting context word-by-word, analyze the entire corpus at once:

  • Count how often each word pair appears together
  • Learn vectors that predict these co-occurrence counts

Example:

In a large corpus, "ice" and "cold" co-occur frequently, but "ice" and "hot" rarely do. GloVe learns vectors where:

similarity(ice, cold) > similarity(ice, hot)
Enter fullscreen mode Exit fullscreen mode

Trade-off:

  • Word2Vec: Faster to train, works well on smaller datasets
  • GloVe: Better at capturing global patterns, pre-computed vectors available

The Limitation: One Word, One Vector

Both Word2Vec and GloVe have a critical limitation: each word gets exactly one vector.

"bank" β†’ [0.34, -0.56, 0.78, ...] (always the same)
Enter fullscreen mode Exit fullscreen mode

But "bank" has multiple meanings:

  • "I went to the bank to deposit money" (financial institution)
  • "The river bank was muddy" (edge of a river)

Same word, same vector, different meanings. This is a problem.

Real-Life Analogy: The Name Problem

Imagine everyone named "John" had the same profile:

  • John the doctor
  • John the musician
  • John the criminal

One profile for all Johns. You'd lose critical information.

Word2Vec treats every "bank" the same, regardless of context.

Transformer-Based Embeddings: Context Matters

This is where Bidirectional Encoder Representations from Transformers (BERT) and modern transformers shine.

The Solution: Contextualized Embeddings

Instead of one vector per word, generate a different vector based on context:

sentence_1 = "I deposited money at the bank"
sentence_2 = "I sat by the river bank"

# Same word "bank", different embeddings!
bank_embedding_1 = [0.34, -0.56, 0.78, ...]  # financial meaning
bank_embedding_2 = [0.12, 0.89, -0.34, ...]  # river meaning
Enter fullscreen mode Exit fullscreen mode

How It Works:

BERT processes the entire sentence with attention (we covered this in Article 2). Every word attends to every other word. The embedding for "bank" incorporates information from:

  • "deposited", "money" β†’ financial context
  • "river", "sat" β†’ nature context

Real-Life Analogy: Reading the Room

Static embeddings (Word2Vec) are like having a fixed personality:

  • "I'm always the funny guy" (regardless of context)

Contextualized embeddings (BERT) are like reading the room:

  • At a funeral: serious, respectful
  • At a party: fun, energetic
  • Same person, different behavior based on context

Code Example:

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Two sentences with "bank"
sentences = [
    "I deposited money at the bank",
    "I sat by the river bank"
]

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)

    # Get all token embeddings
    embeddings = outputs.last_hidden_state[0]
    tokens = tokenizer.tokenize(sentence)

    # Find "bank" token and its embedding
    bank_idx = tokens.index("bank")
    bank_embedding = embeddings[bank_idx + 1]  # +1 for [CLS] token

    print(f"Sentence: {sentence}")
    print(f"'bank' embedding (first 5 dims): {bank_embedding[:5].tolist()}")
    print()
Enter fullscreen mode Exit fullscreen mode

Output:

Sentence: I deposited money at the bank
'bank' embedding (first 5 dims): [0.234, -0.567, 0.891, 0.123, -0.456]

Sentence: I sat by the river bank
'bank' embedding (first 5 dims): [0.789, 0.234, -0.345, 0.567, 0.012]
Enter fullscreen mode Exit fullscreen mode

Different contexts, different embeddings. The model understands "bank" means different things.

Sentence and Document Embeddings

So far, we've discussed word embeddings. But what about entire sentences or documents?

Approach 1: Average Word Embeddings

Simple: average all word vectors in a sentence.

sentence = "The cat sat on the mat"
word_vectors = [embed("The"), embed("cat"), embed("sat"), embed("on"), embed("the"), embed("mat")]
sentence_embedding = average(word_vectors)
Enter fullscreen mode Exit fullscreen mode

Problem: Loses word order. "Dog bites man" = "Man bites dog"

Approach 2: Use [CLS] Token (BERT)

BERT adds a special [CLS] token at the start. After processing, this token's embedding represents the entire sentence.

inputs = tokenizer("The cat sat on the mat", return_tensors="pt")
outputs = model(**inputs)

# [CLS] token embedding = sentence representation
sentence_embedding = outputs.last_hidden_state[0, 0, :]
Enter fullscreen mode Exit fullscreen mode

Better, but not optimized for similarity.

Approach 3: Sentence Transformers (Recommended)

Models specifically trained for sentence similarity:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",
    "Stock prices rose sharply"
]

embeddings = model.encode(sentences)

# embeddings[0] and embeddings[1] will be similar
# embeddings[2] will be different
Enter fullscreen mode Exit fullscreen mode

Popular Sentence Transformer Models:

Model Dimensions Speed Quality Use Case
all-MiniLM-L6-v2 384 ⚑ Fast Good General purpose, production
all-mpnet-base-v2 768 Medium Better Higher quality needed
text-embedding-ada-002 (OpenAI) 1536 API call Best Premium quality, budget allows

πŸ“ Part 2: Distance Metrics β€” Measuring Similarity

You have embeddings. Now how do you compare them?

But First: What IS a Vector? (Real-Life Intuition)

Before we talk about measuring distance, let's understand what we're actually measuring.

Real-Life Analogy: Your Daily Routine as a Vector

Imagine describing your entire day with just 5 numbers (a 5-dimensional vector):

You:        [8, 2, 3, 1, 6]
            β”‚  β”‚  β”‚  β”‚  └─ Hours of screen time
            β”‚  β”‚  β”‚  └──── Hours exercising
            β”‚  β”‚  └─────── Cups of coffee
            β”‚  └────────── Social interactions
            └───────────── Hours of sleep

Your friend: [7, 5, 2, 2, 4]
Enter fullscreen mode Exit fullscreen mode

Each number captures one aspect of your day. Together, they create a unique "signature" of how you spent your time.

In embeddings, it's the same concept:

Each dimension captures some linguistic feature (we don't know exactly whatβ€”the model learns this). Together, they represent the word's meaning.

Real-Life Analogy: "How Close Are Two Cities?"

There are multiple ways to answer this:

  • Straight-line distance: 500 km (Euclidean)
  • Driving distance: 650 km (accounts for roads)
  • Direction similarity: "Both are north of here" (Cosine)

Different metrics for different purposes. Same with embeddings.

Cosine Similarity (Most Common)

What it measures: The angle between two vectors, ignoring magnitude.

Formula:

cosine_similarity(A, B) = (A Β· B) / (||A|| Γ— ||B||)

Where:
- A Β· B = dot product (sum of element-wise multiplication)
- ||A|| = magnitude (length) of vector A
Enter fullscreen mode Exit fullscreen mode

Range: -1 to +1

  • +1 = identical direction (same meaning)
  • 0 = perpendicular (unrelated)
  • -1 = opposite direction (opposite meaning)

Real-Life Analogy 1: The Compass Direction

Imagine you and your friend are both walking in a city:

You: Walk 10 blocks north, 2 blocks east
Your friend: Walk 5 blocks north, 1 block east

You walked twice as far, but you're walking in the same direction (roughly northeast at the same angle).

Cosine similarity = 1.0 (same direction, distance doesn't matter)

This is crucial for text: A 50-word positive review and a 500-word positive review should be similar, even though one is 10x longer. Cosine focuses on the "direction" of sentiment, not the "magnitude" of word count.

Real-Life Analogy 2: Movie Preferences

Imagine rating 5 movie genres on a scale:

You:   [Action: 5, Comedy: 2, Horror: 1, Romance: 1, Sci-Fi: 4]
Alice: [Action: 5, Comedy: 2, Horror: 1, Romance: 1, Sci-Fi: 4]
Bob:   [Action: 1, Comedy: 1, Horror: 5, Romance: 2, Sci-Fi: 1]
Enter fullscreen mode Exit fullscreen mode

You vs Alice: Cosine similarity β‰ˆ 1.0 (identical preferences)
You vs Bob: Cosine similarity β‰ˆ -0.3 (opposite preferencesβ€”you like what Bob hates)

Now imagine Alice is a more enthusiastic reviewer:

You:   [Action: 5, Comedy: 2, Horror: 1, Romance: 1, Sci-Fi: 4]
Alice: [Action: 10, Comedy: 4, Horror: 2, Romance: 2, Sci-Fi: 8]
Enter fullscreen mode Exit fullscreen mode

Alice's numbers are all doubled (she rates more enthusiastically), but the pattern is identical.

Cosine similarity still β‰ˆ 1.0 (same preferences, just different rating scale)

Real-Life Analogy 3: Two Writers, Same Style

Writer A publishes a 200-word article about coffee:

  • Positive sentiment: 80%
  • Technical vocabulary: 20%
  • Casual tone: 70%

Writer B publishes a 2,000-word article about coffee:

  • Positive sentiment: 80%
  • Technical vocabulary: 20%
  • Casual tone: 70%

They're writing in the same style and sentiment, just different lengths.

Cosine similarity captures this: "These articles are similar in nature, even though one is 10x longer."

Why This Matters for Search:

When you search "best coffee shops in Brooklyn," you want results about Brooklyn coffee, whether it's:

  • A 50-word tweet
  • A 500-word blog post
  • A 5,000-word comprehensive guide

Cosine similarity treats them all fairly if they're about the same topic. The length doesn't bias the similarity score.

Code Example:

import numpy as np
from numpy.linalg import norm

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

# Example embeddings
cat = np.array([0.9, 0.2, 0.1])
dog = np.array([0.85, 0.25, 0.15])
airplane = np.array([0.1, 0.1, 0.95])

print(f"cat vs dog: {cosine_similarity(cat, dog):.3f}")      # 0.987 (very similar)
print(f"cat vs airplane: {cosine_similarity(cat, airplane):.3f}")  # 0.284 (different)
Enter fullscreen mode Exit fullscreen mode

Euclidean Distance

What it measures: Straight-line distance between two points.

Formula:

euclidean_distance(A, B) = sqrt(Ξ£(A_i - B_i)Β²)
Enter fullscreen mode Exit fullscreen mode

Range: 0 to infinity

  • 0 = identical vectors
  • Higher = more different

Real-Life Analogy 1: Walking in a City Grid

You're at the corner of 3rd Street and 5th Avenue.
Your friend is at the corner of 6th Street and 9th Avenue.

How far apart are you?

You:    (3, 5)
Friend: (6, 9)

Distance = sqrt((6-3)Β² + (9-5)Β²) 
        = sqrt(3Β² + 4Β²) 
        = sqrt(9 + 16) 
        = sqrt(25) 
        = 5 blocks (diagonal)
Enter fullscreen mode Exit fullscreen mode

This is Euclidean distance: the straight-line "as the crow flies" distance.

Real-Life Analogy 2: Finding Similar Houses

Imagine describing houses with 3 features:

House A: [Size: 2000 sqft, Price: $500K, Age: 10 years]
House B: [Size: 2100 sqft, Price: $520K, Age: 12 years]
House C: [Size: 5000 sqft, Price: $2M, Age: 50 years]
Enter fullscreen mode Exit fullscreen mode

Distance from A to B:

sqrt((2100-2000)Β² + (520-500)Β² + (12-10)Β²)
= sqrt(100Β² + 20Β² + 2Β²)
= sqrt(10,000 + 400 + 4)
= sqrt(10,404)
= 102
Enter fullscreen mode Exit fullscreen mode

Distance from A to C:

sqrt((5000-2000)Β² + (2000-500)Β² + (50-10)Β²)
= sqrt(3000Β² + 1500Β² + 40Β²)
= sqrt(9,000,000 + 2,250,000 + 1,600)
= sqrt(11,251,600)
= 3,354
Enter fullscreen mode Exit fullscreen mode

House B is much closer to A (102) than House C is (3,354). Euclidean distance captures this: B is a similar house, C is dramatically different.

Real-Life Analogy 3: Recipe Similarity

Two recipes for chocolate chip cookies:

Recipe A: [Flour: 2 cups, Sugar: 1 cup, Chocolate chips: 1 cup, Butter: 0.5 cups]
Recipe B: [Flour: 2.2 cups, Sugar: 1.1 cups, Chocolate chips: 1.2 cups, Butter: 0.6 cups]
Recipe C: [Flour: 0 cups, Sugar: 5 cups, Chocolate chips: 0 cups, Butter: 0 cups]
Enter fullscreen mode Exit fullscreen mode

Recipe B is almost identical to A (all ingredients within 10-20% difference) β†’ Small Euclidean distance

Recipe C is wildly different (basically just sugar!) β†’ Large Euclidean distance

When Euclidean Works Well:

When all dimensions matter equally and magnitude is important:

  • Clustering similar documents by multiple features
  • Finding similar products with numeric attributes
  • Grouping similar users by behavior metrics

When Euclidean Fails:

When magnitude shouldn't matter:

  • Comparing a tweet (50 words) to an article (5,000 words) about the same topic
  • Long reviews vs short reviews with the same sentiment

This is why cosine is preferred for textβ€”it ignores length and focuses on content direction.

Code Example:

def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

print(f"cat vs dog: {euclidean_distance(cat, dog):.3f}")      # 0.122 (close)
print(f"cat vs airplane: {euclidean_distance(cat, airplane):.3f}")  # 1.241 (far)
Enter fullscreen mode Exit fullscreen mode

When to Use Euclidean:

  • When magnitude matters (not just direction)
  • Lower-dimensional spaces
  • Clustering tasks (k-means uses Euclidean)

Dot Product

What it measures: Combination of similarity and magnitude.

Formula:

dot_product(A, B) = Ξ£(A_i Γ— B_i)
Enter fullscreen mode Exit fullscreen mode

Range: -infinity to +infinity (unbounded)

Real-Life Analogy 1: Agreement with Intensity

Imagine asking two friends to rate 5 topics (politics, sports, food, movies, music) from -5 (hate) to +5 (love):

You:     [Politics: 4, Sports: -2, Food: 5, Movies: 3, Music: 4]
Friend A: [Politics: 3, Sports: -1, Food: 4, Movies: 2, Music: 3]
Friend B: [Politics: 1, Sports: 0, Food: 2, Movies: 1, Music: 1]
Enter fullscreen mode Exit fullscreen mode

Dot Product (You Γ— Friend A):

(4Γ—3) + (-2Γ—-1) + (5Γ—4) + (3Γ—2) + (4Γ—3)
= 12 + 2 + 20 + 6 + 12
= 52
Enter fullscreen mode Exit fullscreen mode

Dot Product (You Γ— Friend B):

(4Γ—1) + (-2Γ—0) + (5Γ—2) + (3Γ—1) + (4Γ—1)
= 4 + 0 + 10 + 3 + 4
= 21
Enter fullscreen mode Exit fullscreen mode

Friend A has a dot product of 52, Friend B has 21. What does this mean?

Friend A: Not only agrees with you on what you like (politics, food, movies, music), but is equally enthusiastic. You both love food (you: 5, them: 4), you both moderate-like movies, you both dislike sports.

Friend B: Agrees with the general direction (likes what you like), but is less enthusiastic about everything. Lukewarm agreement.

The dot product captures both:

  1. Do you agree? (same sign = positive contribution)
  2. How strongly? (larger numbers = bigger contribution)

Real-Life Analogy 2: Work Collaboration Compatibility

Two colleagues rating their skill levels (1-10) in different areas:

Developer A: [Frontend: 9, Backend: 3, DevOps: 2, Design: 8, Testing: 5]
Developer B: [Frontend: 8, Backend: 2, DevOps: 1, Design: 9, Testing: 4]
Developer C: [Frontend: 2, Backend: 9, DevOps: 8, Design: 1, Testing: 7]
Enter fullscreen mode Exit fullscreen mode

Dot Product (A Γ— B): (9Γ—8) + (3Γ—2) + (2Γ—1) + (8Γ—9) + (5Γ—4) = 72 + 6 + 2 + 72 + 20 = 172

Dot Product (A Γ— C): (9Γ—2) + (3Γ—9) + (2Γ—8) + (8Γ—1) + (5Γ—7) = 18 + 27 + 16 + 8 + 35 = 104

A and B are highly compatible (dot product = 172): both strong in frontend and design, both weak in backend/DevOps. They'd work great on a UI project together.

A and C are less compatible (dot product = 104): A's strengths (frontend, design) are C's weaknesses. They complement each other but wouldn't naturally gravitate to the same projects.

Real-Life Analogy 3: Investment Portfolio Alignment

Two investors' portfolios (% allocation):

Investor 1: [Tech: 50%, Healthcare: 20%, Energy: 10%, Real Estate: 15%, Bonds: 5%]
Investor 2: [Tech: 45%, Healthcare: 25%, Energy: 5%, Real Estate: 20%, Bonds: 5%]
Investor 3: [Tech: 5%, Healthcare: 5%, Energy: 50%, Real Estate: 10%, Bonds: 30%]
Enter fullscreen mode Exit fullscreen mode

Dot Product (1 Γ— 2):

(50Γ—45) + (20Γ—25) + (10Γ—5) + (15Γ—20) + (5Γ—5)
= 2250 + 500 + 50 + 300 + 25
= 3,125
Enter fullscreen mode Exit fullscreen mode

Dot Product (1 Γ— 3):

(50Γ—5) + (20Γ—5) + (10Γ—50) + (15Γ—10) + (5Γ—30)
= 250 + 100 + 500 + 150 + 150
= 1,150
Enter fullscreen mode Exit fullscreen mode

Investors 1 and 2 have very similar strategies (dot product = 3,125)β€”both heavily invested in tech and healthcare.

Investors 1 and 3 have very different strategies (dot product = 1,150)β€”one is tech-focused, the other is energy-focused.

When to Use Dot Product:

  • When vectors are already normalized (then it equals cosine similarity!)
  • Speed-critical applications (no division needed)
  • Attention mechanisms in transformers (faster computation)
  • Recommendation systems where "intensity of preference" matters

Code Example:

def dot_product(a, b):
    return np.dot(a, b)

print(f"cat vs dog: {dot_product(cat, dog):.3f}")      # 0.830
print(f"cat vs airplane: {dot_product(cat, airplane):.3f}")  # 0.225
Enter fullscreen mode Exit fullscreen mode

When to Use Dot Product:

  • When vectors are already normalized (then it equals cosine similarity)
  • Attention mechanisms in transformers
  • Some recommendation systems

Which Metric to Choose?

Metric Best For Normalized Vectors? Handles Different Lengths?
Cosine Similarity Text similarity, search Doesn't matter βœ… Yes
Euclidean Distance Clustering, lower dimensions Recommended ❌ No
Dot Product Speed, attention, recommendations Required ❌ No

Rule of Thumb for Data Engineers:

  • Semantic search / RAG: Use cosine similarity
  • Clustering documents: Use Euclidean (after normalizing)
  • Speed-critical applications: Use dot product with normalized vectors

Why Cosine Wins for Text:

Document length varies wildly. A 10-word tweet and a 1,000-word article about the same topic should be similar. Cosine ignores magnitude, focusing only on meaning direction.


πŸ“Š Part 3: Dimensionality and Model Selection

The Dimension Trade-off

Embedding models produce vectors of different sizes:

Model Dimensions Memory per 1M docs
all-MiniLM-L6-v2 384 1.5 GB
all-mpnet-base-v2 768 3 GB
text-embedding-ada-002 1536 6 GB
text-embedding-3-large 3072 12 GB

More dimensions = More nuance, but higher cost

Real-Life Analogy 1: Describing a Person

3 dimensions: "Tall, young, male"

  • Fast to compare: "Are these two people similar?" β†’ Check 3 things
  • Misses a lot: Can't tell personality, interests, profession
  • Many false matches: Lots of tall young males in the world

10 dimensions: "Tall, young, male, brown hair, glasses, friendly, engineer, likes hiking, vegetarian, from Brazil"

  • More accurate matching
  • Richer representation
  • Still manageable to compare

100 dimensions: Add favorite music genres, political views, dietary restrictions, family history, education background, hobbies, work experience, language skills, travel preferences, social media behavior, spending habits, health metrics...

  • Extremely nuanced representation
  • Very accurate matching
  • But slower to compare
  • More storage needed

Real-Life Analogy 2: Restaurant Recommendations

Low Dimensions (5): [Price level, Cuisine type, Distance, Rating, Noise level]

  • Fast: Check 5 attributes
  • Decent recommendations
  • Might miss nuance: Can't capture "romantic ambiance" vs "family-friendly"

Medium Dimensions (50): Add ambiance, service speed, portion size, parking availability, kid-friendly, date-night appropriate, group accommodations, dietary options, outdoor seating, bar scene, live music, view quality, reservation difficulty...

  • Much better recommendations
  • Captures subtle preferences
  • Still reasonably fast

High Dimensions (500): Add every possible attribute including specific ingredients, chef background, wine list quality, gluten-free options, vegan menu size, Instagram-worthiness, bathroom cleanliness, WiFi speed, power outlet availability, lighting quality, chair comfort...

  • Perfect representation
  • But maybe overkill?
  • Slower comparisons
  • Expensive to store

Real-Life Analogy 3: Song Matching (Spotify-style)

Low Dimensions (20):

  • Genre, tempo, energy, danceability, acousticness, instrumentalness, liveness, speechiness, valence (happiness), loudness...
  • Works pretty well!
  • Fast recommendations
  • Might confuse similar-sounding songs from different moods

High Dimensions (200):

  • Everything above PLUS: specific instruments used, vocal characteristics, production style, decade, subgenre nuances, lyrical themes, cultural context, similar artists, chord progressions, time signatures, dynamic range...
  • Extremely accurate "you'll love this" recommendations
  • Catches subtle similarities
  • But is it worth 10x the storage and compute?

The Engineering Trade-off:

It's like choosing between:

  • Low-res photo (384d): Loads instantly, small file, you can still recognize faces
  • High-res photo (1536d): More detail, larger file, slower to load
  • Ultra high-res (3072d): Every pore visible, huge file, slow

For most applications, you don't need ultra high-res. A good "medium-res" embedding (384-768 dimensions) captures 90-95% of the meaning while being 4-8x cheaper.

Cost Implications for Data Engineers

Scenario: Index 10 million documents for semantic search

Model Dimensions Storage Indexing Time Query Latency
MiniLM (384d) 384 15 GB 2 hours 5ms
MPNet (768d) 768 30 GB 4 hours 12ms
Ada-002 (1536d) 1536 60 GB 8 hours 25ms

Plus API costs for generating embeddings:

Model Price per 1M tokens 10M docs (~50M tokens)
all-MiniLM-L6-v2 Free (local) $0 (just compute)
text-embedding-ada-002 $0.10 / 1M tokens $5
text-embedding-3-large $0.13 / 1M tokens $6.50

Real ROI Example:

E-commerce company with 50M product descriptions:

Option A: OpenAI Ada-002 (1536d)

  • Embedding cost: $25 (one-time)
  • Storage: 300 GB
  • Monthly vector DB cost: $500/month
  • Query quality: Excellent

Option B: MiniLM (384d)

  • Embedding cost: $0 (local GPU)
  • Storage: 75 GB
  • Monthly vector DB cost: $125/month
  • Query quality: Good (95% as good for most queries)

Annual savings with Option B: $4,500 (75% cost reduction)

For many use cases, the quality difference is negligible, but the cost difference is significant.

Choosing the Right Model

Decision Framework:

START
  β”‚
  β”œβ”€ Need multilingual support?
  β”‚    β”œβ”€ YES β†’ paraphrase-multilingual-MiniLM-L12-v2 (384d)
  β”‚    └─ NO β†’ Continue
  β”‚
  β”œβ”€ Need highest possible quality?
  β”‚    β”œβ”€ YES β†’ text-embedding-3-large (3072d) or Cohere embed-v3
  β”‚    └─ NO β†’ Continue
  β”‚
  β”œβ”€ Running locally (no API costs)?
  β”‚    β”œβ”€ YES β†’ all-MiniLM-L6-v2 (384d) or all-mpnet-base-v2 (768d)
  β”‚    └─ NO β†’ Continue
  β”‚
  β”œβ”€ Budget-constrained production?
  β”‚    β”œβ”€ YES β†’ text-embedding-ada-002 (1536d) β€” good balance
  β”‚    └─ NO β†’ text-embedding-3-large (3072d)
  β”‚
  └─ END
Enter fullscreen mode Exit fullscreen mode

Popular Embedding Models Comparison

Model Provider Dimensions Best For
all-MiniLM-L6-v2 HuggingFace 384 Fast, local, general purpose
all-mpnet-base-v2 HuggingFace 768 Higher quality, local
text-embedding-ada-002 OpenAI 1536 Production, via API
text-embedding-3-small OpenAI 1536 Cheaper alternative to ada
text-embedding-3-large OpenAI 3072 Highest quality OpenAI
embed-english-v3.0 Cohere 1024 Production alternative
voyage-large-2 Voyage AI 1536 Code and technical docs

Practical Example: Building a Semantic Search System

from sentence_transformers import SentenceTransformer
import numpy as np

# 1. Choose model based on requirements
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384d, fast, local

# 2. Your document corpus
documents = [
    "How to reset my password",
    "I forgot my login credentials",
    "Change account password steps",
    "Billing and payment issues",
    "Refund policy and returns",
    "How to contact support"
]

# 3. Generate embeddings (do this once, store in vector DB)
doc_embeddings = model.encode(documents)
print(f"Embedding shape: {doc_embeddings.shape}")  # (6, 384)

# 4. User query
query = "I can't log into my account"
query_embedding = model.encode([query])[0]

# 5. Find similar documents (cosine similarity)
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

# 6. Rank results
results = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)

print(f"\nQuery: '{query}'\n")
print("Top matches:")
for doc, score in results[:3]:
    print(f"  {score:.3f}: {doc}")
Enter fullscreen mode Exit fullscreen mode

Output:

Embedding shape: (6, 384)

Query: 'I can't log into my account'

Top matches:
  0.823: I forgot my login credentials
  0.756: How to reset my password
  0.689: Change account password steps
Enter fullscreen mode Exit fullscreen mode

The system found semantically similar documents even though "log into" doesn't appear in any of them. That's the power of embeddings.


🎯 Conclusion: Embeddings as the Foundation of Modern AI

Embeddings transform the fuzzy world of human language into precise mathematical space where machines can compute, compare, and reason.

The Business Impact:

These fundamentals directly control:

πŸ’° Cost:

  • Dimension choice = storage and compute costs (384d vs 1536d = 4x difference)
  • Local vs API = $0 vs $5+ per million documents
  • Distance metric choice affects query latency at scale

πŸ“Š Quality:

  • Contextualized embeddings (BERT) vs static (Word2Vec) = better semantic understanding
  • Model choice affects search relevance directly
  • Wrong metric = wrong results

⚑ Performance:

  • Lower dimensions = faster similarity computation
  • Normalized vectors + dot product = maximum speed
  • Pre-computing embeddings = fast query time

Key Takeaways for Data Engineers

On Embeddings:

  • Embeddings convert text to dense vectors where similar meanings are nearby
  • Static embeddings (Word2Vec, GloVe): one vector per word, fast, limited
  • Contextualized embeddings (BERT): different vector per context, powerful
  • Sentence Transformers: optimized for comparing sentences/documents
  • Action: Use sentence transformers for search and similarity tasks
  • ROI Impact: Choosing MiniLM over Ada-002 can save 75% on vector storage costs

On Distance Metrics:

  • Cosine similarity: measures angle, ignores magnitude (best for text)
  • Euclidean distance: measures straight-line distance (good for clustering)
  • Dot product: fastest when vectors are normalized
  • Action: Default to cosine similarity for semantic search
  • ROI Impact: Wrong metric = irrelevant results = user churn

On Model Selection:

  • More dimensions = more nuance = more cost
  • 384d is sufficient for 90% of use cases
  • Test before committing: run quality benchmarks on YOUR data
  • Action: Start with MiniLM, upgrade only if quality metrics demand it
  • ROI Impact: Right-sizing your embedding model saves thousands monthly at scale

The Embedding ROI Pattern

Every decision follows the same pattern:

  1. Understand the representation β†’ How embeddings encode meaning
  2. Choose the right comparison β†’ Cosine for text, Euclidean for clustering
  3. Right-size for your needs β†’ Don't over-engineer dimensions

Real-World ROI Example:

A customer support platform processing 100K tickets/day:

Before optimization:

  • OpenAI Ada-002 (1536d) for all embeddings
  • 300 GB vector storage
  • $800/month vector DB + $15/month API
  • Query latency: 45ms

After understanding embeddings:

  • MiniLM (384d) for ticket routing (95% of volume)
  • Ada-002 only for complex semantic search (5% of volume)
  • 80 GB vector storage
  • $200/month vector DB + $2/month API
  • Query latency: 12ms

Annual savings: $7,356 (73% cost reduction)
Latency improvement: 73% (45ms β†’ 12ms)

This is why understanding embeddings matters. Not to implement them from scratchβ€”but to make informed decisions that impact your bottom line.


Found this helpful? Share your experience with embedding models in production. What worked? What surprised you?

Top comments (0)