Originally published on malcolmlow.net
When a search engine retrieves a document about automobiles in response to a query about cars, it is not matching text character by character. Somewhere beneath the interface, the system understands that these two words are semantically related. The mechanism behind that understanding is the word embedding — and once you see the geometry, you cannot unsee it.
This article walks through the key mathematical operations that make embeddings work: distance, similarity, arithmetic, scaling, and the dot product. Each concept is illustrated with concrete numerical vectors so the math is visible, not just described. Real embeddings typically use hundreds of dimensions; the 3- and 4-dimensional examples here preserve all the structure while staying readable on a page.
1 · What is a Word Embedding?
A word embedding is a representation of a word as a vector — an ordered list of numbers — in a high-dimensional space. A typical embedding model might use 300 dimensions, so the word cat becomes a point with 300 coordinates. The key insight: the position of that point encodes meaning.
This is what researchers call a semantic space. Words with related meanings end up positioned close to each other. King and Queen live near each other. Paris and London live near each other. Bicycle and democracy live far apart.
Example: 4-dimensional vectors (simplified from real 300-dim embeddings)
vec("King") = [ 0.9, 0.7, 0.4, +0.6 ]
vec("Queen") = [ 0.9, 0.7, 0.4, -0.6 ]
vec("Man") = [ 0.5, 0.3, 0.1, +0.8 ]
vec("Woman") = [ 0.5, 0.3, 0.1, -0.8 ]
The first three dimensions encode royalty, authority, and age. The fourth dimension encodes gender: positive = masculine, negative = feminine.
2 · The Geometry of Meaning: Distance and Similarity
Once words are points in space, we need a way to measure how close they are. Two approaches dominate: Euclidean distance and cosine similarity.
Temperature vectors (3 dimensions)
vec("Hot") = [ 1.0, 0.8, 0.6 ]
vec("Warm") = [ 0.8, 0.6, 0.4 ]
vec("Cold") = [-0.6, 0.4, -0.8 ]
2.1 Euclidean Distance
The straight-line gap between the tips of two vectors:
d(a, b) = √ Σᵢ (aᵢ − bᵢ)²
Worked example:
d(Hot, Warm) = √[(1.0-0.8)² + (0.8-0.6)² + (0.6-0.4)²]
= √[0.04 + 0.04 + 0.04] = √0.12 ≈ 0.346 ← small: close together
d(Hot, Cold) = √[(1.0-(-0.6))² + (0.8-0.4)² + (0.6-(-0.8))²]
= √[2.56 + 0.16 + 1.96] = √4.68 ≈ 2.163 ← large: far apart
2.2 Cosine Similarity — The Industry Standard
In practice, NLP systems almost universally prefer cosine similarity. It ignores vector length entirely and focuses only on the angle between them — two vectors pointing the same direction score 1.0 regardless of magnitude.
cos(θ) = (a · b) / (‖a‖ × ‖b‖)
Range: −1 (opposite) → 0 (orthogonal) → +1 (identical direction)
Worked example:
‖Hot‖ = √(1.0² + 0.8² + 0.6²) = √2.00 ≈ 1.414
‖Warm‖ = √(0.8² + 0.6² + 0.4²) = √1.16 ≈ 1.077
‖Cold‖ = √(0.6² + 0.4² + 0.8²) = √1.16 ≈ 1.077
dot(Hot, Warm) = (1.0)(0.8) + (0.8)(0.6) + (0.6)(0.4) = 1.52
cos(Hot, Warm) = 1.52 / (1.414 × 1.077) ≈ +0.998 ← nearly identical direction
dot(Hot, Cold) = (1.0)(-0.6) + (0.8)(0.4) + (0.6)(-0.8) = -0.76
cos(Hot, Cold) = -0.76 / (1.414 × 1.077) ≈ -0.499 ← opposite directions
| Word Pair | Euclidean d | cos(θ) | Interpretation |
|---|---|---|---|
| Hot vs Warm | 0.346 | +0.998 | Nearly identical direction — closely related |
| Hot vs Cold | 2.163 | −0.499 | Opposite directions — antonyms |
3 · Vector Arithmetic: Meaning You Can Add and Subtract
Because words are vectors, you can perform arithmetic on them — and the results are semantically meaningful. The most famous example:
vec("King") − vec("Man") + vec("Woman") ≈ vec("Queen")
Worked example:
King = [ 0.9, 0.7, 0.4, +0.6 ]
Man = [ 0.5, 0.3, 0.1, +0.8 ]
Woman = [ 0.5, 0.3, 0.1, -0.8 ]
King − Man = [ 0.4, 0.4, 0.3, -0.2 ]
+ Woman = [ 0.9, 0.7, 0.4, -1.0 ]
d(result, Queen) ≈ 0.400 ← nearest
d(result, Woman) ≈ 0.671
d(result, King) = 1.600
cos(result, Queen) ≈ 0.974 ← highest cosine similarity
What happened geometrically? Subtracting Man stripped out the gender dimension, leaving the royalty structure intact. Adding Woman injected the feminine gender value. The result sits 0.4 units from Queen — the nearest word in this vocabulary.
4 · Scalar Multiplication and Division: Changing Intensity
Multiplying or dividing a vector by a scalar changes its magnitude without changing its direction. This maps onto the idea of degree in language — Tiny, Large, and Gigantic all point in roughly the same semantic direction, at different intensities.
Size vectors (3 dimensions)
vec("Tiny") = [ 0.10, 0.20, 0.10 ]
vec("Large") = [ 0.50, 0.70, 0.40 ]
vec("Gigantic") = [ 1.10, 1.50, 0.90 ]
Worked example:
Large × 2 = [ 1.00, 1.40, 0.80 ]
vec("Gigantic") = [ 1.10, 1.50, 0.90 ]
d(Large × 2, Gigantic) ≈ 0.173 ← very close
Large × 0.2 = [ 0.10, 0.14, 0.08 ]
vec("Tiny") = [ 0.10, 0.20, 0.10 ]
d(Large × 0.2, Tiny) ≈ 0.063 ← very close
Loud ÷ 2 lands near "Soft" — direction unchanged, intensity halved.
Key intuition: Scalar operations change how much of something a vector represents, without changing what kind of thing it represents.
5 · The Dot Product: Agreement and Magnitude Together
a · b = Σᵢ (aᵢ × bᵢ) = a₁b₁ + a₂b₂ + … + aₙbₙ
The dot product captures two things simultaneously: the direction of agreement and the combined magnitude. Cosine similarity captures only the first.
Worked example — Very Loud vs A Little Loud:
vec("A Little Loud") = [ 0.30, 0.40, 0.20 ] |magnitude| = 0.539
vec("Very Loud") = [ 0.90, 1.20, 0.60 ] |magnitude| = 1.616
dot(AL, VL) = (0.3)(0.9) + (0.4)(1.2) + (0.2)(0.6) = 0.87
cos(AL, VL) = 0.87 / (0.539 × 1.616) ≈ 1.000 ← perfect alignment
AL · AL = 0.09 + 0.16 + 0.04 = 0.29
VL · VL = 0.81 + 1.44 + 0.36 = 2.61
| Comparison | Magnitude | cos(θ) | v · v |
|---|---|---|---|
| A Little Loud | 0.539 | 1.000 | 0.29 |
| Very Loud | 1.616 | 1.000 | 2.61 |
Both are perfectly collinear — cosine similarity is 1.0 in both cases. But the dot products are 0.29 vs 2.61, a 9× difference. This is why recommendation systems and attention mechanisms in transformer models often prefer raw dot products: when you want to know not just whether a document is relevant but also how prominently it discusses a topic, the dot product gives you both dimensions at once.
6 · Practical Applications
Search engines convert your query into a vector and retrieve documents whose vectors are nearest to it — using cosine similarity to rank by relevance regardless of exact word match. When you search for car insurance and get results about vehicle coverage, that's nearest-neighbour lookup in embedding space.
Recommendation systems represent your interests as a vector computed from your history, then find products whose vectors are closest. The dot product is particularly useful here: a highly-relevant item with a large magnitude will score higher than a mildly-relevant item even if they point in the same direction.
Large language models use the scaled dot product directly inside the attention mechanism. For every token, a query vector and a set of key vectors are compared via dot product to determine which parts of the context deserve attention — a direct descendant of the arithmetic in Section 5.
Quick Reference: Embedding Operations
| Operation | Formula | Result |
|---|---|---|
| Euclidean Distance | √(Σ(aᵢ−bᵢ)²) | d(Hot,Warm)=0.346 / d(Hot,Cold)=2.163 |
| Cosine Similarity | (a·b)/(‖a‖×‖b‖) | cos(Hot,Warm)=+0.998 / cos(Hot,Cold)=-0.499 |
| Vector Arithmetic | a ± b | King−Man+Woman → nearest Queen (d=0.400) |
| Scalar Multiplication | λ · a | Large×2 → near Gigantic / Loud÷2 → near Soft |
| Dot Product | a·b = Σaᵢbᵢ | cos=1.00 for both; dot 0.29 (soft) vs 2.61 (loud) |
Top comments (0)