Vinicius Fagundes

Posted on Nov 24

Embeddings and Vector Similarity: How Machines Understand Meaning

#ai #datascience #dataengineering #llm

📚 Tech Acronyms Reference

Quick reference for acronyms used in this article:

ANN - Approximate Nearest Neighbor
API - Application Programming Interface
BERT - Bidirectional Encoder Representations from Transformers
BOW - Bag of Words
CBOW - Continuous Bag of Words
CPU - Central Processing Unit
GPU - Graphics Processing Unit
GloVe - Global Vectors for Word Representation
HNSW - Hierarchical Navigable Small World
LLM - Large Language Model
MiniLM - Mini Language Model (compact version of BERT)
MPNet - Masked and Permuted Pre-training Network
NLP - Natural Language Processing
RAM - Random Access Memory
RAG - Retrieval-Augmented Generation
ROI - Return on Investment
TF-IDF - Term Frequency-Inverse Document Frequency

Mathematical/Statistical Terms:

Cosine Similarity - Measures angle between vectors (direction similarity)
Dot Product - Sum of element-wise multiplication of two vectors
Euclidean Distance - Straight-line distance between two points
Magnitude (Norm) - Length of a vector
Vector - Array of numbers representing position in space
Dimension - Number of values in a vector (e.g., 384-dimensional = 384 numbers)

🎯 Introduction: Teaching Machines to Understand Meaning

Here's a problem: computers only understand numbers.

Text is just characters. "Cat" means nothing to a Central Processing Unit (CPU). It's just three bytes: 67, 97, 116.

But somehow, Large Language Models (LLMs) know that "cat" is closer to "dog" than to "airplane." They know "king - man + woman = queen." They understand that "I love this movie" and "This film is amazing" mean similar things.

How?

Embeddings.

An embedding is a way to represent words, sentences, or documents as vectors (lists of numbers) in a high-dimensional space. Similar meanings = nearby vectors. Different meanings = distant vectors.

As a data engineer, embeddings are everywhere in modern Artificial Intelligence (AI) systems:

Semantic search: Find documents by meaning, not keywords
Recommendations: "Users who liked X also liked Y"
Clustering: Group similar support tickets automatically
Retrieval-Augmented Generation (RAG): Find relevant context for LLMs
Anomaly detection: Identify outliers in text data

Understanding embeddings—how they work, how to compare them, and how to choose the right ones—is fundamental to building production AI systems.

💡 Data Engineer's ROI Lens

For this article, we're focusing on:

How do embeddings actually work? (From Word2Vec to transformer-based)
How do I measure similarity? (Distance metrics and their trade-offs)
How do I choose the right embedding model? (Dimensions, cost, performance)

These decisions directly impact search quality, storage costs, and inference latency at scale.

🗺️ Part 1: From Words to Vectors

The Problem with Traditional Text Representation

Before embeddings, we had simpler approaches:

One-Hot Encoding:

Represent each word as a vector with a 1 in its position and 0s everywhere else:

Vocabulary: [cat, dog, bird, airplane]

"cat"      → [1, 0, 0, 0]
"dog"      → [0, 1, 0, 0]
"bird"     → [0, 0, 1, 0]
"airplane" → [0, 0, 0, 1]

Problems:

Huge vectors: 50,000-word vocabulary = 50,000-dimensional vectors
No relationships: "cat" and "dog" are as different as "cat" and "airplane"
Sparse: 99.99% zeros, wasting memory

Real-Life Analogy: The Phone Book Problem

One-hot encoding is like identifying people only by their phone number:

Person A: 555-0001
Person B: 555-0002
Person C: 555-9999

You can tell they're different, but you can't tell that A and B are neighbors while C lives across town. The numbers don't encode any meaningful relationships.

Bag of Words (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF):

Count word frequencies in documents:

Document 1: "The cat sat on the mat"
BOW: {the: 2, cat: 1, sat: 1, on: 1, mat: 1}

Document 2: "The dog sat on the rug"
BOW: {the: 2, dog: 1, sat: 1, on: 1, rug: 1}

Better, but still problems:

Word order lost: "Dog bites man" = "Man bites dog"
No semantic understanding: "happy" and "joyful" are unrelated
High-dimensional and sparse

The Embedding Revolution: Word2Vec

In 2013, researchers at Google introduced Word2Vec, and everything changed.

The Core Idea:

Words that appear in similar contexts have similar meanings.

"The cat sat on the mat"
"The dog sat on the rug"

"Cat" and "dog" appear in similar contexts (after "The", before "sat"). They should have similar representations.

Real-Life Analogy 1: The Library Organization

Imagine a magical library where books physically move closer together based on their content:

All mystery novels cluster together on one shelf
All romance novels cluster together on another shelf
Books that are "mystery with romance" sit between both clusters
A book that's "mystery + romance + set in Paris" sits near the intersection of three sections

In this library, you can literally walk toward the type of book you want. Similar books are physically nearby.

Embeddings work the same way, but in 300-dimensional space instead of 3D physical space. Words with similar meanings cluster together.

Real-Life Analogy 2: You Are Who You Hang Out With

Imagine you know nothing about someone, but you observe who they spend time with:

Person A hangs out at: dog parks, pet stores, veterinary clinics
Person B hangs out at: dog parks, pet stores, grooming salons
Person C hangs out at: airports, hotels, travel agencies

Without knowing anything else, you'd guess A and B have something in common (pet owners), while C is different (traveler).

Word2Vec does the same thing with words. It learns that "cat" and "dog" appear in similar contexts (near words like "pet", "fur", "food bowl"), so they get similar vector representations.

How Word2Vec Works

Training Approach 1: Continuous Bag of Words (CBOW)

Predict the middle word from surrounding context:

Context: "The ___ sat on the mat"
Predict: "cat"

Context: "I love my pet ___"
Predict: "dog" (or "cat", "hamster", etc.)

Training Approach 2: Skip-gram

Predict surrounding words from the middle word:

Given: "cat"
Predict: "The", "sat", "on", "mat" (words that appear nearby)

The Result:

After training on billions of words, Word2Vec produces dense vectors (typically 100-300 dimensions) where:

"cat"  → [0.23, -0.45, 0.67, 0.12, ...]  (300 numbers)
"dog"  → [0.25, -0.43, 0.65, 0.14, ...]  (similar!)
"airplane" → [-0.78, 0.91, -0.23, 0.56, ...]  (different!)

The Famous Example: King - Man + Woman = Queen

Word2Vec captured something remarkable: semantic relationships as vector arithmetic.

# Vector math with meanings
king - man + woman ≈ queen
paris - france + italy ≈ rome
walking - walk + swim ≈ swimming

Real-Life Analogy: GPS Coordinates for Meaning

Think of embeddings as GPS coordinates in "meaning space":

Paris: (48.8566° N, 2.3522° E) — a capital city
France: The country containing Paris
Rome: (41.9028° N, 12.4964° E) — another capital city
Italy: The country containing Rome

The relationship "Paris is to France as Rome is to Italy" is encoded in the relative positions. The vector from France → Paris is similar to the vector from Italy → Rome.

Embeddings work the same way. "King" and "Queen" are offset by a "gender vector." "Paris" and "Rome" are offset by a "country vector."

GloVe: Global Vectors for Word Representation

Global Vectors for Word Representation (GloVe) (Stanford, 2014) improved on Word2Vec by using global co-occurrence statistics.

The Idea:

Instead of predicting context word-by-word, analyze the entire corpus at once:

Count how often each word pair appears together
Learn vectors that predict these co-occurrence counts

Example:

In a large corpus, "ice" and "cold" co-occur frequently, but "ice" and "hot" rarely do. GloVe learns vectors where:

similarity(ice, cold) > similarity(ice, hot)

Trade-off:

Word2Vec: Faster to train, works well on smaller datasets
GloVe: Better at capturing global patterns, pre-computed vectors available

The Limitation: One Word, One Vector

Both Word2Vec and GloVe have a critical limitation: each word gets exactly one vector.

"bank" → [0.34, -0.56, 0.78, ...] (always the same)

But "bank" has multiple meanings:

"I went to the bank to deposit money" (financial institution)
"The river bank was muddy" (edge of a river)

Same word, same vector, different meanings. This is a problem.

Real-Life Analogy: The Name Problem

Imagine everyone named "John" had the same profile:

John the doctor
John the musician
John the criminal

One profile for all Johns. You'd lose critical information.

Word2Vec treats every "bank" the same, regardless of context.

Transformer-Based Embeddings: Context Matters

This is where Bidirectional Encoder Representations from Transformers (BERT) and modern transformers shine.

The Solution: Contextualized Embeddings

Instead of one vector per word, generate a different vector based on context:

sentence_1 = "I deposited money at the bank"
sentence_2 = "I sat by the river bank"

# Same word "bank", different embeddings!
bank_embedding_1 = [0.34, -0.56, 0.78, ...]  # financial meaning
bank_embedding_2 = [0.12, 0.89, -0.34, ...]  # river meaning

How It Works:

BERT processes the entire sentence with attention (we covered this in Article 2). Every word attends to every other word. The embedding for "bank" incorporates information from:

"deposited", "money" → financial context
"river", "sat" → nature context

Real-Life Analogy: Reading the Room

Static embeddings (Word2Vec) are like having a fixed personality:

"I'm always the funny guy" (regardless of context)

Contextualized embeddings (BERT) are like reading the room:

At a funeral: serious, respectful
At a party: fun, energetic
Same person, different behavior based on context

Code Example:

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Two sentences with "bank"
sentences = [
    "I deposited money at the bank",
    "I sat by the river bank"
]

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)

    # Get all token embeddings
    embeddings = outputs.last_hidden_state[0]
    tokens = tokenizer.tokenize(sentence)

    # Find "bank" token and its embedding
    bank_idx = tokens.index("bank")
    bank_embedding = embeddings[bank_idx + 1]  # +1 for [CLS] token

    print(f"Sentence: {sentence}")
    print(f"'bank' embedding (first 5 dims): {bank_embedding[:5].tolist()}")
    print()

Output:

Sentence: I deposited money at the bank
'bank' embedding (first 5 dims): [0.234, -0.567, 0.891, 0.123, -0.456]

Sentence: I sat by the river bank
'bank' embedding (first 5 dims): [0.789, 0.234, -0.345, 0.567, 0.012]

Different contexts, different embeddings. The model understands "bank" means different things.

Sentence and Document Embeddings

So far, we've discussed word embeddings. But what about entire sentences or documents?

Approach 1: Average Word Embeddings

Simple: average all word vectors in a sentence.

sentence = "The cat sat on the mat"
word_vectors = [embed("The"), embed("cat"), embed("sat"), embed("on"), embed("the"), embed("mat")]
sentence_embedding = average(word_vectors)

Problem: Loses word order. "Dog bites man" = "Man bites dog"

Approach 2: Use [CLS] Token (BERT)

BERT adds a special [CLS] token at the start. After processing, this token's embedding represents the entire sentence.

inputs = tokenizer("The cat sat on the mat", return_tensors="pt")
outputs = model(**inputs)

# [CLS] token embedding = sentence representation
sentence_embedding = outputs.last_hidden_state[0, 0, :]

Better, but not optimized for similarity.

Approach 3: Sentence Transformers (Recommended)

Models specifically trained for sentence similarity:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",
    "Stock prices rose sharply"
]

embeddings = model.encode(sentences)

# embeddings[0] and embeddings[1] will be similar
# embeddings[2] will be different

Popular Sentence Transformer Models:

Model	Dimensions	Speed	Quality	Use Case
`all-MiniLM-L6-v2`	384	⚡ Fast	Good	General purpose, production
`all-mpnet-base-v2`	768	Medium	Better	Higher quality needed
`text-embedding-ada-002` (OpenAI)	1536	API call	Best	Premium quality, budget allows

📐 Part 2: Distance Metrics — Measuring Similarity

You have embeddings. Now how do you compare them?

But First: What IS a Vector? (Real-Life Intuition)

Before we talk about measuring distance, let's understand what we're actually measuring.

Real-Life Analogy: Your Daily Routine as a Vector

Imagine describing your entire day with just 5 numbers (a 5-dimensional vector):

You:        [8, 2, 3, 1, 6]
            │  │  │  │  └─ Hours of screen time
            │  │  │  └──── Hours exercising
            │  │  └─────── Cups of coffee
            │  └────────── Social interactions
            └───────────── Hours of sleep

Your friend: [7, 5, 2, 2, 4]

Each number captures one aspect of your day. Together, they create a unique "signature" of how you spent your time.

In embeddings, it's the same concept:

"cat" → 0.23, -0.45, 0.67, 0.12, ...
"dog" → 0.25, -0.43, 0.65, 0.14, ...

Each dimension captures some linguistic feature (we don't know exactly what—the model learns this). Together, they represent the word's meaning.

Real-Life Analogy: "How Close Are Two Cities?"

There are multiple ways to answer this:

Straight-line distance: 500 km (Euclidean)
Driving distance: 650 km (accounts for roads)
Direction similarity: "Both are north of here" (Cosine)

Different metrics for different purposes. Same with embeddings.

Cosine Similarity (Most Common)

What it measures: The angle between two vectors, ignoring magnitude.

Formula:

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

Where:
- A · B = dot product (sum of element-wise multiplication)
- ||A|| = magnitude (length) of vector A

Range: -1 to +1

+1 = identical direction (same meaning)
0 = perpendicular (unrelated)
-1 = opposite direction (opposite meaning)

Real-Life Analogy 1: The Compass Direction

Imagine you and your friend are both walking in a city:

You: Walk 10 blocks north, 2 blocks east
Your friend: Walk 5 blocks north, 1 block east

You walked twice as far, but you're walking in the same direction (roughly northeast at the same angle).

Cosine similarity = 1.0 (same direction, distance doesn't matter)

This is crucial for text: A 50-word positive review and a 500-word positive review should be similar, even though one is 10x longer. Cosine focuses on the "direction" of sentiment, not the "magnitude" of word count.

Real-Life Analogy 2: Movie Preferences

Imagine rating 5 movie genres on a scale:

You:   [Action: 5, Comedy: 2, Horror: 1, Romance: 1, Sci-Fi: 4]
Alice: [Action: 5, Comedy: 2, Horror: 1, Romance: 1, Sci-Fi: 4]
Bob:   [Action: 1, Comedy: 1, Horror: 5, Romance: 2, Sci-Fi: 1]

You vs Alice: Cosine similarity ≈ 1.0 (identical preferences)
You vs Bob: Cosine similarity ≈ -0.3 (opposite preferences—you like what Bob hates)

Now imagine Alice is a more enthusiastic reviewer:

You:   [Action: 5, Comedy: 2, Horror: 1, Romance: 1, Sci-Fi: 4]
Alice: [Action: 10, Comedy: 4, Horror: 2, Romance: 2, Sci-Fi: 8]

Alice's numbers are all doubled (she rates more enthusiastically), but the pattern is identical.

Cosine similarity still ≈ 1.0 (same preferences, just different rating scale)

Real-Life Analogy 3: Two Writers, Same Style

Writer A publishes a 200-word article about coffee:

Positive sentiment: 80%
Technical vocabulary: 20%
Casual tone: 70%

Writer B publishes a 2,000-word article about coffee:

Positive sentiment: 80%
Technical vocabulary: 20%
Casual tone: 70%

They're writing in the same style and sentiment, just different lengths.

Cosine similarity captures this: "These articles are similar in nature, even though one is 10x longer."

Why This Matters for Search:

When you search "best coffee shops in Brooklyn," you want results about Brooklyn coffee, whether it's:

A 50-word tweet
A 500-word blog post
A 5,000-word comprehensive guide

Cosine similarity treats them all fairly if they're about the same topic. The length doesn't bias the similarity score.

Code Example:

import numpy as np
from numpy.linalg import norm

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

# Example embeddings
cat = np.array([0.9, 0.2, 0.1])
dog = np.array([0.85, 0.25, 0.15])
airplane = np.array([0.1, 0.1, 0.95])

print(f"cat vs dog: {cosine_similarity(cat, dog):.3f}")      # 0.987 (very similar)
print(f"cat vs airplane: {cosine_similarity(cat, airplane):.3f}")  # 0.284 (different)

Euclidean Distance

What it measures: Straight-line distance between two points.

Formula:

euclidean_distance(A, B) = sqrt(Σ(A_i - B_i)²)

Range: 0 to infinity

0 = identical vectors
Higher = more different

Real-Life Analogy 1: Walking in a City Grid

You're at the corner of 3rd Street and 5th Avenue.
Your friend is at the corner of 6th Street and 9th Avenue.

How far apart are you?

You:    (3, 5)
Friend: (6, 9)

Distance = sqrt((6-3)² + (9-5)²) 
        = sqrt(3² + 4²) 
        = sqrt(9 + 16) 
        = sqrt(25) 
        = 5 blocks (diagonal)

This is Euclidean distance: the straight-line "as the crow flies" distance.

Real-Life Analogy 2: Finding Similar Houses

Imagine describing houses with 3 features:

House A: [Size: 2000 sqft, Price: $500K, Age: 10 years]
House B: [Size: 2100 sqft, Price: $520K, Age: 12 years]
House C: [Size: 5000 sqft, Price: $2M, Age: 50 years]

Distance from A to B:

sqrt((2100-2000)² + (520-500)² + (12-10)²)
= sqrt(100² + 20² + 2²)
= sqrt(10,000 + 400 + 4)
= sqrt(10,404)
= 102

Distance from A to C:

sqrt((5000-2000)² + (2000-500)² + (50-10)²)
= sqrt(3000² + 1500² + 40²)
= sqrt(9,000,000 + 2,250,000 + 1,600)
= sqrt(11,251,600)
= 3,354

House B is much closer to A (102) than House C is (3,354). Euclidean distance captures this: B is a similar house, C is dramatically different.

Real-Life Analogy 3: Recipe Similarity

Two recipes for chocolate chip cookies:

Recipe A: [Flour: 2 cups, Sugar: 1 cup, Chocolate chips: 1 cup, Butter: 0.5 cups]
Recipe B: [Flour: 2.2 cups, Sugar: 1.1 cups, Chocolate chips: 1.2 cups, Butter: 0.6 cups]
Recipe C: [Flour: 0 cups, Sugar: 5 cups, Chocolate chips: 0 cups, Butter: 0 cups]

Recipe B is almost identical to A (all ingredients within 10-20% difference) → Small Euclidean distance

Recipe C is wildly different (basically just sugar!) → Large Euclidean distance

When Euclidean Works Well:

When all dimensions matter equally and magnitude is important:

Clustering similar documents by multiple features
Finding similar products with numeric attributes
Grouping similar users by behavior metrics

When Euclidean Fails:

When magnitude shouldn't matter:

Comparing a tweet (50 words) to an article (5,000 words) about the same topic
Long reviews vs short reviews with the same sentiment

This is why cosine is preferred for text—it ignores length and focuses on content direction.

Code Example:

def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

print(f"cat vs dog: {euclidean_distance(cat, dog):.3f}")      # 0.122 (close)
print(f"cat vs airplane: {euclidean_distance(cat, airplane):.3f}")  # 1.241 (far)

When to Use Euclidean:

When magnitude matters (not just direction)
Lower-dimensional spaces
Clustering tasks (k-means uses Euclidean)

Dot Product

What it measures: Combination of similarity and magnitude.

Formula:

dot_product(A, B) = Σ(A_i × B_i)

Range: -infinity to +infinity (unbounded)

Real-Life Analogy 1: Agreement with Intensity

Imagine asking two friends to rate 5 topics (politics, sports, food, movies, music) from -5 (hate) to +5 (love):

You:     [Politics: 4, Sports: -2, Food: 5, Movies: 3, Music: 4]
Friend A: [Politics: 3, Sports: -1, Food: 4, Movies: 2, Music: 3]
Friend B: [Politics: 1, Sports: 0, Food: 2, Movies: 1, Music: 1]

Dot Product (You × Friend A):

(4×3) + (-2×-1) + (5×4) + (3×2) + (4×3)
= 12 + 2 + 20 + 6 + 12
= 52

Dot Product (You × Friend B):

(4×1) + (-2×0) + (5×2) + (3×1) + (4×1)
= 4 + 0 + 10 + 3 + 4
= 21

Friend A has a dot product of 52, Friend B has 21. What does this mean?

Friend A: Not only agrees with you on what you like (politics, food, movies, music), but is equally enthusiastic. You both love food (you: 5, them: 4), you both moderate-like movies, you both dislike sports.

Friend B: Agrees with the general direction (likes what you like), but is less enthusiastic about everything. Lukewarm agreement.

The dot product captures both:

Do you agree? (same sign = positive contribution)
How strongly? (larger numbers = bigger contribution)

Real-Life Analogy 2: Work Collaboration Compatibility

Two colleagues rating their skill levels (1-10) in different areas:

Developer A: [Frontend: 9, Backend: 3, DevOps: 2, Design: 8, Testing: 5]
Developer B: [Frontend: 8, Backend: 2, DevOps: 1, Design: 9, Testing: 4]
Developer C: [Frontend: 2, Backend: 9, DevOps: 8, Design: 1, Testing: 7]

Dot Product (A × B): (9×8) + (3×2) + (2×1) + (8×9) + (5×4) = 72 + 6 + 2 + 72 + 20 = 172

Dot Product (A × C): (9×2) + (3×9) + (2×8) + (8×1) + (5×7) = 18 + 27 + 16 + 8 + 35 = 104

A and B are highly compatible (dot product = 172): both strong in frontend and design, both weak in backend/DevOps. They'd work great on a UI project together.

A and C are less compatible (dot product = 104): A's strengths (frontend, design) are C's weaknesses. They complement each other but wouldn't naturally gravitate to the same projects.

Real-Life Analogy 3: Investment Portfolio Alignment

Two investors' portfolios (% allocation):

Investor 1: [Tech: 50%, Healthcare: 20%, Energy: 10%, Real Estate: 15%, Bonds: 5%]
Investor 2: [Tech: 45%, Healthcare: 25%, Energy: 5%, Real Estate: 20%, Bonds: 5%]
Investor 3: [Tech: 5%, Healthcare: 5%, Energy: 50%, Real Estate: 10%, Bonds: 30%]

Dot Product (1 × 2):

(50×45) + (20×25) + (10×5) + (15×20) + (5×5)
= 2250 + 500 + 50 + 300 + 25
= 3,125

Dot Product (1 × 3):

(50×5) + (20×5) + (10×50) + (15×10) + (5×30)
= 250 + 100 + 500 + 150 + 150
= 1,150

Investors 1 and 2 have very similar strategies (dot product = 3,125)—both heavily invested in tech and healthcare.

Investors 1 and 3 have very different strategies (dot product = 1,150)—one is tech-focused, the other is energy-focused.

When to Use Dot Product:

When vectors are already normalized (then it equals cosine similarity!)
Speed-critical applications (no division needed)
Attention mechanisms in transformers (faster computation)
Recommendation systems where "intensity of preference" matters

Code Example:

def dot_product(a, b):
    return np.dot(a, b)

print(f"cat vs dog: {dot_product(cat, dog):.3f}")      # 0.830
print(f"cat vs airplane: {dot_product(cat, airplane):.3f}")  # 0.225

When to Use Dot Product:

When vectors are already normalized (then it equals cosine similarity)
Attention mechanisms in transformers
Some recommendation systems

Which Metric to Choose?

Metric	Best For	Normalized Vectors?	Handles Different Lengths?
Cosine Similarity	Text similarity, search	Doesn't matter	✅ Yes
Euclidean Distance	Clustering, lower dimensions	Recommended	❌ No
Dot Product	Speed, attention, recommendations	Required	❌ No

Rule of Thumb for Data Engineers:

Semantic search / RAG: Use cosine similarity
Clustering documents: Use Euclidean (after normalizing)
Speed-critical applications: Use dot product with normalized vectors

Why Cosine Wins for Text:

Document length varies wildly. A 10-word tweet and a 1,000-word article about the same topic should be similar. Cosine ignores magnitude, focusing only on meaning direction.

📊 Part 3: Dimensionality and Model Selection

The Dimension Trade-off

Embedding models produce vectors of different sizes:

Model	Dimensions	Memory per 1M docs
`all-MiniLM-L6-v2`	384	1.5 GB
`all-mpnet-base-v2`	768	3 GB
`text-embedding-ada-002`	1536	6 GB
`text-embedding-3-large`	3072	12 GB

More dimensions = More nuance, but higher cost

Real-Life Analogy 1: Describing a Person

3 dimensions: "Tall, young, male"

Fast to compare: "Are these two people similar?" → Check 3 things
Misses a lot: Can't tell personality, interests, profession
Many false matches: Lots of tall young males in the world

10 dimensions: "Tall, young, male, brown hair, glasses, friendly, engineer, likes hiking, vegetarian, from Brazil"

More accurate matching
Richer representation
Still manageable to compare

100 dimensions: Add favorite music genres, political views, dietary restrictions, family history, education background, hobbies, work experience, language skills, travel preferences, social media behavior, spending habits, health metrics...

Extremely nuanced representation
Very accurate matching
But slower to compare
More storage needed

Real-Life Analogy 2: Restaurant Recommendations

Low Dimensions (5): [Price level, Cuisine type, Distance, Rating, Noise level]

Fast: Check 5 attributes
Decent recommendations
Might miss nuance: Can't capture "romantic ambiance" vs "family-friendly"

Medium Dimensions (50): Add ambiance, service speed, portion size, parking availability, kid-friendly, date-night appropriate, group accommodations, dietary options, outdoor seating, bar scene, live music, view quality, reservation difficulty...

Much better recommendations
Captures subtle preferences
Still reasonably fast

High Dimensions (500): Add every possible attribute including specific ingredients, chef background, wine list quality, gluten-free options, vegan menu size, Instagram-worthiness, bathroom cleanliness, WiFi speed, power outlet availability, lighting quality, chair comfort...

Perfect representation
But maybe overkill?
Slower comparisons
Expensive to store

Real-Life Analogy 3: Song Matching (Spotify-style)

Low Dimensions (20):

Genre, tempo, energy, danceability, acousticness, instrumentalness, liveness, speechiness, valence (happiness), loudness...
Works pretty well!
Fast recommendations
Might confuse similar-sounding songs from different moods

High Dimensions (200):

Everything above PLUS: specific instruments used, vocal characteristics, production style, decade, subgenre nuances, lyrical themes, cultural context, similar artists, chord progressions, time signatures, dynamic range...
Extremely accurate "you'll love this" recommendations
Catches subtle similarities
But is it worth 10x the storage and compute?

The Engineering Trade-off:

It's like choosing between:

Low-res photo (384d): Loads instantly, small file, you can still recognize faces
High-res photo (1536d): More detail, larger file, slower to load
Ultra high-res (3072d): Every pore visible, huge file, slow

For most applications, you don't need ultra high-res. A good "medium-res" embedding (384-768 dimensions) captures 90-95% of the meaning while being 4-8x cheaper.

Cost Implications for Data Engineers

Scenario: Index 10 million documents for semantic search

Model	Dimensions	Storage	Indexing Time	Query Latency
MiniLM (384d)	384	15 GB	2 hours	5ms
MPNet (768d)	768	30 GB	4 hours	12ms
Ada-002 (1536d)	1536	60 GB	8 hours	25ms

Plus API costs for generating embeddings:

Model	Price per 1M tokens	10M docs (~50M tokens)
`all-MiniLM-L6-v2`	Free (local)	$0 (just compute)
`text-embedding-ada-002`	$0.10 / 1M tokens	$5
`text-embedding-3-large`	$0.13 / 1M tokens	$6.50

Real ROI Example:

E-commerce company with 50M product descriptions:

Option A: OpenAI Ada-002 (1536d)

Embedding cost: $25 (one-time)
Storage: 300 GB
Monthly vector DB cost: $500/month
Query quality: Excellent

Option B: MiniLM (384d)

Embedding cost: $0 (local GPU)
Storage: 75 GB
Monthly vector DB cost: $125/month
Query quality: Good (95% as good for most queries)

Annual savings with Option B: $4,500 (75% cost reduction)

For many use cases, the quality difference is negligible, but the cost difference is significant.

Choosing the Right Model

Decision Framework:

START
  │
  ├─ Need multilingual support?
  │    ├─ YES → paraphrase-multilingual-MiniLM-L12-v2 (384d)
  │    └─ NO → Continue
  │
  ├─ Need highest possible quality?
  │    ├─ YES → text-embedding-3-large (3072d) or Cohere embed-v3
  │    └─ NO → Continue
  │
  ├─ Running locally (no API costs)?
  │    ├─ YES → all-MiniLM-L6-v2 (384d) or all-mpnet-base-v2 (768d)
  │    └─ NO → Continue
  │
  ├─ Budget-constrained production?
  │    ├─ YES → text-embedding-ada-002 (1536d) — good balance
  │    └─ NO → text-embedding-3-large (3072d)
  │
  └─ END

Popular Embedding Models Comparison

Model	Provider	Dimensions	Best For
`all-MiniLM-L6-v2`	HuggingFace	384	Fast, local, general purpose
`all-mpnet-base-v2`	HuggingFace	768	Higher quality, local
`text-embedding-ada-002`	OpenAI	1536	Production, via API
`text-embedding-3-small`	OpenAI	1536	Cheaper alternative to ada
`text-embedding-3-large`	OpenAI	3072	Highest quality OpenAI
`embed-english-v3.0`	Cohere	1024	Production alternative
`voyage-large-2`	Voyage AI	1536	Code and technical docs

Practical Example: Building a Semantic Search System

from sentence_transformers import SentenceTransformer
import numpy as np

# 1. Choose model based on requirements
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384d, fast, local

# 2. Your document corpus
documents = [
    "How to reset my password",
    "I forgot my login credentials",
    "Change account password steps",
    "Billing and payment issues",
    "Refund policy and returns",
    "How to contact support"
]

# 3. Generate embeddings (do this once, store in vector DB)
doc_embeddings = model.encode(documents)
print(f"Embedding shape: {doc_embeddings.shape}")  # (6, 384)

# 4. User query
query = "I can't log into my account"
query_embedding = model.encode([query])[0]

# 5. Find similar documents (cosine similarity)
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

# 6. Rank results
results = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)

print(f"\nQuery: '{query}'\n")
print("Top matches:")
for doc, score in results[:3]:
    print(f"  {score:.3f}: {doc}")

Output:

Embedding shape: (6, 384)

Query: 'I can't log into my account'

Top matches:
  0.823: I forgot my login credentials
  0.756: How to reset my password
  0.689: Change account password steps

The system found semantically similar documents even though "log into" doesn't appear in any of them. That's the power of embeddings.

🎯 Conclusion: Embeddings as the Foundation of Modern AI

Embeddings transform the fuzzy world of human language into precise mathematical space where machines can compute, compare, and reason.

The Business Impact:

These fundamentals directly control:

💰 Cost:

Dimension choice = storage and compute costs (384d vs 1536d = 4x difference)
Local vs API = $0 vs $5+ per million documents
Distance metric choice affects query latency at scale

📊 Quality:

Contextualized embeddings (BERT) vs static (Word2Vec) = better semantic understanding
Model choice affects search relevance directly
Wrong metric = wrong results

⚡ Performance:

Lower dimensions = faster similarity computation
Normalized vectors + dot product = maximum speed
Pre-computing embeddings = fast query time

Key Takeaways for Data Engineers

On Embeddings:

Embeddings convert text to dense vectors where similar meanings are nearby
Static embeddings (Word2Vec, GloVe): one vector per word, fast, limited
Contextualized embeddings (BERT): different vector per context, powerful
Sentence Transformers: optimized for comparing sentences/documents
Action: Use sentence transformers for search and similarity tasks
ROI Impact: Choosing MiniLM over Ada-002 can save 75% on vector storage costs

On Distance Metrics:

Cosine similarity: measures angle, ignores magnitude (best for text)
Euclidean distance: measures straight-line distance (good for clustering)
Dot product: fastest when vectors are normalized
Action: Default to cosine similarity for semantic search
ROI Impact: Wrong metric = irrelevant results = user churn

On Model Selection:

More dimensions = more nuance = more cost
384d is sufficient for 90% of use cases
Test before committing: run quality benchmarks on YOUR data
Action: Start with MiniLM, upgrade only if quality metrics demand it
ROI Impact: Right-sizing your embedding model saves thousands monthly at scale

The Embedding ROI Pattern

Every decision follows the same pattern:

Understand the representation → How embeddings encode meaning
Choose the right comparison → Cosine for text, Euclidean for clustering
Right-size for your needs → Don't over-engineer dimensions

Real-World ROI Example:

A customer support platform processing 100K tickets/day:

Before optimization:

OpenAI Ada-002 (1536d) for all embeddings
300 GB vector storage
$800/month vector DB + $15/month API
Query latency: 45ms

After understanding embeddings:

MiniLM (384d) for ticket routing (95% of volume)
Ada-002 only for complex semantic search (5% of volume)
80 GB vector storage
$200/month vector DB + $2/month API
Query latency: 12ms

Annual savings: $7,356 (73% cost reduction)
Latency improvement: 73% (45ms → 12ms)

This is why understanding embeddings matters. Not to implement them from scratch—but to make informed decisions that impact your bottom line.

Found this helpful? Share your experience with embedding models in production. What worked? What surprised you?

DEV Community

Embeddings and Vector Similarity: How Machines Understand Meaning

📚 Tech Acronyms Reference

🎯 Introduction: Teaching Machines to Understand Meaning

🗺️ Part 1: From Words to Vectors

The Problem with Traditional Text Representation

The Embedding Revolution: Word2Vec

How Word2Vec Works

The Famous Example: King - Man + Woman = Queen

GloVe: Global Vectors for Word Representation

The Limitation: One Word, One Vector

Transformer-Based Embeddings: Context Matters

Sentence and Document Embeddings

📐 Part 2: Distance Metrics — Measuring Similarity

Cosine Similarity (Most Common)

Euclidean Distance

Dot Product

Which Metric to Choose?

📊 Part 3: Dimensionality and Model Selection

The Dimension Trade-off

Cost Implications for Data Engineers

Choosing the Right Model

Popular Embedding Models Comparison

Practical Example: Building a Semantic Search System

🎯 Conclusion: Embeddings as the Foundation of Modern AI

Key Takeaways for Data Engineers

The Embedding ROI Pattern

Top comments (0)