https://www.youtube.com/watch?v=1sQffYYe6Y8
How Computers Understand Meaning: The Math Behind Embeddings
Here's a question that sounds impossible.
How do you teach a computer that "king" and "queen" are related, but "king" and "sandwich" are not?
You can't use a dictionary — dictionaries give definitions, not relationships. You can't hard-code every connection — there are too many words, too many meanings, too many ways context shifts what something means. So how do modern AI systems do it?
The answer is embeddings. And once you understand them, almost everything in AI starts to make sense.
Words as Coordinates
An embedding is a list of numbers. A vector. Every word, sentence, image, or piece of code gets mapped to a point in space.
But the power isn't in any single number — it's in the geometry. Words with similar meanings end up close together in this space. Words with unrelated meanings end up far apart. "Dog" and "puppy" are neighbors. "Dog" and "democracy" are distant.
The question is: who decides which numbers go where? Nobody. You let the data decide.
The Embarrassingly Simple Training Trick
The method is almost anticlimactic in its simplicity. Take a massive pile of text — Wikipedia, books, the internet — and give a model one job: predict the missing word.
"The cat sat on the ___." What goes there? Mat. Floor. Rug. The model guesses, checks the real answer, and adjusts its numbers. Do this billions of times across billions of sentences, and something remarkable emerges.
Words that appear in similar contexts get pushed toward similar coordinates. "Dog" and "puppy" keep showing up near the same words — "bark," "leash," "treats" — so their vectors drift together. "Dog" and "airplane" almost never share context, so they drift apart.
No one programmed these relationships. They emerged automatically from patterns in raw text. This idea has a name: distributional semantics. You shall know a word by the company it keeps.
The Geometry Gets Weird (in the Best Way)
Here's where things stop being just useful and become genuinely strange.
These vectors don't just cluster by similarity. They encode relationships as directions.
Take the vector for "king." Subtract "man." Add "woman." You land almost exactly on "queen."
That's not a coincidence or a party trick. The direction from "man" to "woman" captures the concept of gender — and that same direction works everywhere. "Uncle" minus "man" plus "woman" lands near "aunt." "Boy" → "girl." "He" → "she." The embedding space has converted an abstract concept into geometry.
Gender is a direction. Verb tense is a direction. Country-to-capital is a direction. The model never studied grammar or geography. It just found these structures by predicting missing words, billions of times.
And these aren't two-dimensional or three-dimensional spaces where you could draw the picture. Real embeddings use hundreds of dimensions — 768, 1,536, sometimes more. Each dimension captures some subtle feature of meaning that no human explicitly defined. We can't visualize it, but the math works the same: distance equals similarity, direction equals relationship.
What You Actually Do with Vectors
So now you have words as points in high-dimensional space. What's that good for?
Semantic search. Not keyword matching — meaning matching. You type "how to fix a slow website." Traditional search looks for those exact words. Embedding-based search converts your query into a vector and finds the closest vectors in the database. It might surface a document about "optimizing page load performance" even though it shares zero words with your query. Same meaning, different words.
RAG — Retrieval Augmented Generation. This is how you give a language model knowledge it was never trained on. You embed your documents, embed an incoming question, find the nearest matches, and feed that context to the model. The model gets relevant information just-in-time, not baked in at training.
Cross-modal search. This is where it really breaks your intuition. You can embed images and text into the same space. That's how CLIP works — a photo of a dog and the text "a photo of a dog" end up near the same point in the shared embedding space. Now you can search images with text, or search text with images. The vector space becomes a universal translator.
Recommendations. Spotify doesn't just look at what songs you've liked — it embeds listening behavior, audio features, and playlist context into vectors. Songs that "feel" similar end up nearby. Your next recommendation is just a nearest-neighbor lookup.
The Invisible Infrastructure
Here's the thing about embeddings: they're everywhere, and almost nobody knows.
Every time you search Google, embeddings helped rank results. Every time ChatGPT answers a question, it's operating in an embedding space. Every time Spotify suggests a song, embeddings drove it. Every time GitHub Copilot autocompletes code, embeddings connected your current context to relevant patterns.
They're the invisible geometry underneath modern AI. Not just a feature of large language models — the foundation they're built on.
The insight that made all of this possible is almost philosophical: meaning isn't a property of words themselves. It's a property of relationships between words. And relationships can be encoded as geometry.
King minus man plus woman equals queen. That equation shouldn't work. But it does. And that tells you something deep about what language actually is.
Watch the full animated breakdown — including how training works step by step, the vector arithmetic visualized, and how embeddings power real systems — on YouTube:
How Computers Understand Meaning: Word Embeddings Explained
Neural Download — visual mental models for computer science and machine learning.
Top comments (0)