Akash

Posted on Apr 8 • Edited on Apr 11

From Counting Words to Learning Meaning

#nlp #machinelearning #embeddings #ai

TF-IDF, Cosine Similarity, and Word2Vec

By the end of this post, you'll understand two fundamentally different ways of representing words as vectors:

sparse count-based vectors from information retrieval, and
dense learned vectors from Word2Vec.

You'll know how cosine similarity measures word closeness, how the skip-gram algorithm learns embeddings by training and then discarding a binary classifier, and why the resulting vectors can solve analogies like king - man + woman ≈ queen without anyone teaching the algorithm what "gender" or "royalty" means. You'll also understand why these embeddings inherit the biases of their training data, and what the difference is between static embeddings (one vector per word) and contextual embeddings (one vector per word per sentence).

Two ideas connect everything here. First: you can represent a word's meaning by the company it keeps. Second: predicting context is a better way to learn meaning than counting context. Those two ideas took NLP from sparse lookup tables to dense learned representations, which is what made modern language models feasible.

Quick Recap: Why We Need Word Vectors

Last post ended with Wittgenstein's principle, "the meaning of a word is its use in the language," and the distributional hypothesis: words in similar contexts have similar meanings. Now we turn that into math.

The practical motivation is simple. In a sentiment classifier, "terrible" in training and "awful" in testing are unrelated as raw strings. The classifier breaks. But if both words map to nearby vectors, the classifier generalizes. That's the payoff.

There are two approaches. The first one's limitations are exactly what motivate the second.

Approach 1: Count [Sparse Vectors and TF-IDF]

Words as Rows, Documents as Columns

Take a corpus of Shakespeare's plays. For each play, count how many times each word appears. Arrange this into a matrix: words as rows, plays as columns. This is the term-document matrix. Each column is now a vector representing a play.

Compute cosine similarity between the column vectors, and you get something useful: comedies cluster together, tragedies cluster together. "As You Like It" is more similar to "Twelfth Night" than to "Julius Caesar" because they share vocabulary. More "fool" and "love," less "battle" and "sword."

The same idea works at the word level. Build a word-word co-occurrence matrix: for each word, count how often every other word appears nearby (within some context window). Each row is now a word vector. Words that co-occur with similar neighbors tend to have similar vectors.

Measuring Closeness: Cosine Similarity

Given two word vectors, how similar are they? The standard metric is cosine similarity (the dot product of the vectors, normalized by their lengths):

\text{cosine}(\vec{v}, \vec{w}) = \frac{\vec{v} \cdot \vec{w}}{|\vec{v}||\vec{w}|} = \frac{\sum_{i=1}^{N} v_i w_i}{\sqrt{\sum_{i=1}^{N} v_i^2} \cdot \sqrt{\sum_{i=1}^{N} w_i^2}}

The normalization matters. Without it, frequent words would always appear more similar just because their count vectors are longer. Cosine measures the angle between vectors, not their magnitudes, so word frequency doesn't distort the comparison.

For non-negative count vectors, cosine ranges from 0 (no overlap) to 1 (identical direction). From the textbook: cosine(digital, information) = 0.996, while cosine(cherry, information) = 0.018. "Digital" and "information" share contexts involving "computer" and "data." "Cherry" doesn't.

TF-IDF: Not All Counts Are Equal

Raw co-occurrence counts have a problem. The word "the" co-occurs with everything. It dominates every vector without carrying useful meaning. TF-IDF fixes this with two adjustments:

Term Frequency (TF): use a log-scaled count instead of a raw count:

\text{tf}(t, d) = \log_{10}(\text{count}(t, d) + 1)

This squashes large differences. A word appearing 10,000 times is not 10x more informative than one appearing 1,000 times.

Inverse Document Frequency (IDF): weight by how rare the word is across the collection:

\text{idf}(t) = \log_{10}\left(\frac{N}{\text{df}(t)}\right)

$N$ = total documents, $\text{df}(t)$ = documents containing word $t$ . A word in every document (like "the") gets IDF near zero. A word in only one document (like "Romeo" among Shakespeare's plays) gets high IDF.

TF-IDF = TF × IDF. Credit for frequency, penalized by commonness. This is roughly what search engines compute behind the scenes: cosine similarity between your query vector and document vectors, weighted by TF-IDF.

Where Sparse Vectors Break Down

TF-IDF vectors work well for search, but they have a basic limitation: they're huge and mostly empty. A vocabulary of 50,000 words means each vector has 50,000 dimensions, and most entries are zero.

Two main consequences:

No synonymy. "Car" and "automobile" occupy separate dimensions with no connection. If two words never directly co-occur, their similarity is zero, even if they mean the same thing.
No generalization. The vector knows what it observed. Nothing more. And storing 50,000-dimensional vectors is wasteful when 99% of entries are zeros.

Sparse vectors capture something about word meaning, and they're useful for information retrieval. But they're not representations that a neural network can learn from well. For that, we need something denser.

Approach 2: Predict [Dense Vectors and Word2Vec]

Dense vectors are short (50–300 dimensions), with most elements non-zero. Each dimension captures some abstract, learned aspect of meaning rather than corresponding to a specific vocabulary word.

Why 300 Dimensions?

Too few (say, 2): over-generalizes. All of word meaning compressed into two numbers. Everything blurs together.
Too many (say, 30,000): you're back to sparse vectors. "Car" and "automobile" are in separate dimensions again.
200–300: enough capacity to represent word relationships while still forcing generalization between synonyms.

You can't label individual dimensions. Dimension 47 isn't "size" and dimension 183 isn't "animacy." The dimensions are abstractions that the algorithm discovers on its own.

Word2Vec: Train a Classifier, Keep the Weights

Word2Vec doesn't build a co-occurrence matrix. It trains a binary classifier on a task it doesn't actually care about, then throws the classifier away and keeps the learned weights as word embeddings.

Self-Supervision: The Corpus Is the Training Signal

Given a sentence from the corpus:

"...tablespoon of apricot jam a..."

The target word is "apricot." The context window (±2 words) gives positive examples: (apricot, tablespoon), (apricot, of), (apricot, jam), (apricot, a). These are word pairs that actually co-occur.

For each positive pair, randomly sample $k$ noise words from the vocabulary: (apricot, aardvark), (apricot, seven), (apricot, forever). These are negative examples, random words that probably don't belong near "apricot."

No human labeled anything. The running text is the training signal. This is self-supervision.

The Skip-Gram Algorithm (SGNS)

Four steps:

1. For each target word $w$ , collect positive context pairs and sample $k$ negative noise pairs.

2. Model the probability that $c$ is a real context word of $w$ using the sigmoid of their dot product:

P(+ \mid w, c) = \sigma(\vec{c} \cdot \vec{w}) = \frac{1}{1 + \exp(-\vec{c} \cdot \vec{w})}

If the dot product is large (vectors point in similar directions), the probability is high. If it's small or negative, the probability is low.

3. The loss function pushes real context words closer and noise words farther away:

L = -\left[\log \sigma(\vec{c_{pos}} \cdot \vec{w}) + \textstyle\sum_{i=1}^k \log \sigma(-\vec{c}_{neg_i} \cdot \vec{w})\right]

4. Gradient descent adjusts the embedding vectors. After training on the whole corpus, discard the classifier. The learned vectors are the embeddings.

Two Matrices, One Choice

Word2Vec learns two embedding matrices: a target matrix $W$ and a context matrix $C$ . Each word gets a vector in both. In practice, you either use $W$ alone or combine them as $W + C$ . Both work.

Why two matrices instead of one?

A word might play different roles as a target vs. as context. The word "the" as a target wants to be near every content word. But "the" as a context word doesn't tell you much about the target (because it's too common). Having separate matrices lets the model handle this asymmetry. In practice, the difference is small, and many people just use W.

FastText: What About Unknown Words?

Word2Vec has a blind spot: words never seen during training have no embedding. FastText fixes this by breaking each word into character n-grams.

"Where" becomes: $\langle \text{wh, whe, her, ere, re} \rangle$ plus $\langle \text{where} \rangle$ . The word's embedding is the sum of all its n-gram embeddings. At test time, an unknown word can still be represented from its constituent n-grams.

This matters for social media text (abbreviations, slang, misspellings) and for morphologically rich languages like Hindi, Turkish, or Finnish, where a single root generates dozens of inflected forms that may not all appear in training.

Static vs. Contextual: One Vector or Many?

Word2Vec gives each word one fixed vector regardless of context. "Bank" gets a single embedding that blends financial institution, river bank, and everything else. This is a static embedding.

Contextual embeddings (like BERT, which we'll cover later) produce a different vector for "bank" depending on the sentence. "I deposited money at the bank" and "I sat by the river bank" yield different vectors for the same word. Context-sensitivity is a major upgrade, but static embeddings remain useful and fast.

What the Vectors Actually Capture

Window Size Changes the Neighbors

The context window size during training shapes what kind of similarity the embeddings learn:

Small window (±2): nearest neighbors tend to be taxonomically similar, same part of speech. "Hogwarts" → Sunnydale, Evernight (other fictional schools).
Large window (±5): nearest neighbors are topically related, same semantic field. "Hogwarts" → Dumbledore, Malfoy, half-blood (Harry Potter universe).

Neither is "better." It depends on whether you need similarity or relatedness for your task.

The Parallelogram Test

If embeddings capture relational meaning, they should solve analogies. The parallelogram method works by vector arithmetic:

\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}

\vec{\text{Paris}} - \vec{\text{France}} + \vec{\text{Italy}} \approx \vec{\text{Rome}}

\vec{\text{apple}} - \vec{\text{tree}} + \vec{\text{grape}} \approx \vec{\text{wine}}

The embedding space encodes relationships like gender, capital-city-of, and comparative morphology as consistent directional offsets. Nobody told the algorithm about countries or royalty. It picked up these relationships from word co-occurrence patterns alone.

Caveats on analogy testing

The parallelogram method works best for frequent words, short relational distances, and specific relationship types (capitals, inflections, gender). It's less reliable for complex or abstract analogies. The closest vector returned is often one of the input words or a morphological variant, so those must be excluded. Some researchers argue that the method is too simple to model how humans actually form analogies.

Meaning Shifts Over Time

Train separate embedding spaces on text from different decades, and you can watch meanings drift. "Awful" meant "full of awe" in the 1850s. By the 1900s, it meant "terrible." "Gay" shifted from "cheerful" to its modern meaning over the 20th century. "Broadcast" went from agricultural (casting seeds) to radio/TV transmission.

Embeddings make these changes computable, not just anecdotal. You train on decade-specific corpora and measure how a word's nearest neighbors change.

Bias: Embeddings Learn What the Corpus Contains

Embeddings reflect the text they're trained on. If that text contains gendered associations, the embeddings reproduce them:

father : doctor :: mother : nurse
man : computer programmer :: woman : homemaker

This is not bias the algorithm invented. It's bias already present in the written text the model just trained on. But the consequences are real. Embeddings used in hiring tools or search engines will perpetuate whatever stereotypes exist in their training data.

Debiasing is an active research area. Some approaches adjust the embedding space after training. Others change the training procedure itself. Neither approach fully solves the problem yet.

What You Now Have

Seven things you didn't have before this post:

Sparse vs. dense vectors. Sparse vectors (TF-IDF) use one dimension per vocabulary word, are mostly zeros, and can't capture synonymy. Dense vectors (50–300 dimensions) generalize between words by compressing meaning into learned abstractions. Sparse vectors are the near miss — useful for retrieval, but not for learning.
TF-IDF weighting. Term frequency (log-scaled) times inverse document frequency. Gives credit for a word appearing while penalizing words that appear everywhere. The standard weighting scheme in information retrieval.
Cosine similarity. The normalized dot product. Ranges 0–1 for count vectors. Measures the angle between vectors rather than their magnitude, so word frequency doesn't distort similarity.
Word2Vec (skip-gram with negative sampling). Train a logistic classifier to distinguish real context words from random noise. Throw away the classifier. Keep the weight vectors. Self-supervised: the corpus generates its own training labels.
FastText. Extends Word2Vec with character n-grams, letting unknown words get embeddings from their subword components. Handles misspellings, rare morphological forms, and out-of-vocabulary words.
What embeddings encode. Window size controls whether you get taxonomic similarity or topical relatedness. Analogy arithmetic (king - man + woman ≈ queen) shows that relational meaning is preserved as directional offsets. Historical corpora reveal how word meanings drift over decades.
Bias is inherited. Embeddings absorb whatever associations exist in training text. Gender stereotypes, racial biases, cultural assumptions — they all show up in the vector space. Debiasing is an open problem.

Next post: neural networks — feedforward architectures, backpropagation, and how they change the way we process sequences of words.

DEV Community