A computer has no idea what "king" means. If you hand it words, the first thing you have to decide is how to turn them into numbers. The naive way is a one-hot vector: one slot per word in the vocabulary, all zeros except a single 1. The trouble is that every one-hot vector sits exactly the same distance from every other. In that scheme "king" is no closer to "queen" than it is to "banana". You've encoded identity, not meaning.
Word2Vec fixes this by learning a dense vector for every word, and the vectors come out so good that you can do arithmetic with meaning: king − man + woman lands right next to queen. I built a tiny version that trains in the browser, and it reproduces exactly that. Here's how it works.
The distributional hypothesis
The whole thing rests on one old idea from linguistics: "you shall know a word by the company it keeps." Words that show up in similar contexts tend to mean similar things. "King" and "queen" both appear near "rules", "crown", "kingdom". So if we learn vectors such that words with similar neighbours get similar vectors, those vectors will capture meaning — with no dictionary, no labels, just raw text.
Skip-gram: predict the neighbours
Word2Vec turns that idea into a prediction game called skip-gram. Slide a window over the text. Take the center word and try to predict each surrounding context word. From "the king rules the kingdom" with a window of 2, the center "king" should predict "the", "rules", "the", "kingdom". Every sentence hands you training pairs for free — it's self-supervised. To predict its context well, a word's vector has to capture what kind of company it keeps, which is precisely the meaning we're after.
Each word actually gets two vectors: an input vector for when it's the center word, and an output vector for when it's a context word. Prediction is just a dot product between a center vector and a candidate context vector, squeezed through a sigmoid. After training, we keep the input vectors as our embeddings.
Negative sampling: skip the giant softmax
There's a catch. A proper probability over context words needs a softmax over the entire vocabulary — one dot product per word, per training pair. With 100,000 words that's brutal. Negative sampling replaces it with something far cheaper. For each real (center, context) pair, call it a positive example. Then draw a handful of random words — say five — and call those negatives. Now the task is a tiny binary question: is this pair real or fake? That's O(6) work instead of O(100000). The negatives are drawn from a unigram-frequency-to-the-0.75 distribution, which samples common words a bit more often without letting them dominate.
The update is a tug-of-war
The gradient of the binary cross-entropy loss is beautifully simple: (predicted − label). For a positive pair, the update nudges the center and context vectors toward each other, raising their dot product. For each negative pair, it pushes them apart. That's it. Run this over millions of pairs and co-occurring words get dragged together in the space while unrelated words drift away. The geometry of meaning is carved out entirely by that pull-together, push-apart dance.
To compare finished vectors we use cosine similarity — the normalised dot product, measuring the angle between them. Direction carries the meaning, not magnitude. Ranking the whole vocabulary by cosine to a query vector gives you its nearest neighbours.
Why the analogy falls out
Here's the striking part. Relationships turn into consistent directions. Because "king" and "queen" differ mainly by gender, just like "man" and "woman" do, the vector from man to woman is roughly the same as the one from king to queen. So subtracting "man" from "king" strips out the male part, and adding "woman" supplies the female part — landing you near "queen". Nobody engineered this; it just emerges from the linear structure the co-occurrence objective builds. My toy demo, trained on about forty little sentences about royalty, animals and food, gets queen as the top hit for king − man + woman most of the time, and always ranks it above random words. (The corpus is deliberately small and clustered so the analogy is stable.)
The limit worth knowing
Word2Vec gives each word one static vector. So "bank" the riverside and "bank" the money place collapse into a single blurry vector, no matter the sentence. That limitation is exactly what pushed the field toward contextual embeddings — ELMo, then BERT and the Transformer family — where a word's vector is computed from its actual sentence. But the core idea, meaning as geometry learned from context, is the foundation they all build on.
Play with the live version — train it, click words to see their neighbours, and solve your own analogies: https://dev48v.infy.uk/dl/day23-word2vec.html
Top comments (0)