Akash

Posted on Apr 2 • Edited on Apr 11

Perplexity, Smoothing, and What Words Mean

#nlp #machinelearning #ai #languagemodels

By the end of this post, you'll know how to evaluate a language model using 'perplexity', why unseen n-grams break everything and how smoothing patches the holes, and how interpolation lets you mix n-gram orders instead of betting on one. You'll also understand why word meaning is harder to pin down than it looks, what kinds of relationships exist between words, and how a 1951 insight from philosopher Ludwig Wittgenstein laid the intellectual groundwork for word embeddings.

Two halves, one thread: the first half shows you the limits of n-gram language models. The second half shows you why those limits forced NLP to rethink how words are represented, which is where the deep learning side of NLP starts.

Where We Left Off

Last post, we built n-gram language models: chain rule, Markov assumption, unigrams, bigrams, MLE. We left knowing how to build one. Two questions were still open: how do you know if your model is any good? and what happens when the training data doesn't cover a word combination your test data needs?

MLE on Real Data

The MLE bigram formula from last time:

P(w_i \mid w_{i-1}) = \frac{C(w_{i-1},\; w_i)}{C(w_{i-1})}

Applying this to the Berkeley Restaurant Project corpus (9,222 sentences of people asking about restaurants in Berkeley), you build a bigram count table. The first thing that stands out: most cells are zero. The majority of word pairs just never appear together.

The non-zero entries are interesting, though. $P(\text{want} \mid \text{I}) = 0.33$ , which makes sense since "I want" is a common English construction. $P(\text{to} \mid \text{want}) = 0.66$ , because "want to" is practically a single unit.

Sentence probability is just a product of bigrams:

\begin{aligned} P(\langle s \rangle \;\text{I want English food}\; \langle /s \rangle) &= P(\text{I} \mid \langle s \rangle) \times P(\text{want} \mid \text{I}) \times P(\text{English} \mid \text{want}) \times P(\text{food} \mid \text{English}) \times P(\langle /s \rangle \mid \text{food}) \newline &= 0.25 \times 0.33 \times 0.0011 \times 0.5 \times 0.68 \newline &= 0.000031 \end{aligned}

Different bigram probabilities encode different kinds of knowledge. $P(\text{to} \mid \text{want}) = 0.66$ is syntactic, reflecting that "want to" is a verb construction. $P(\text{Chinese} \mid \text{want}) > P(\text{English} \mid \text{want})$ might be cultural, reflecting Berkeley's dining preferences.

Log Space

Multiplying many small probabilities causes numerical underflow. Always work in log space:

\log(p_1 \times p_2 \times p_3 \times p_4) = \log p_1 + \log p_2 + \log p_3 + \log p_4

Store log-probabilities. Add them. Convert back with $\exp()$ only at the end. Addition is faster than multiplication, too.

Perplexity: Measuring How Good a Language Model Is

You've built two language models. Which is better?

Extrinsic evaluation: plug the LM into a real application (speech recognition, machine translation) and measure task performance. Reliable, but it can take days to run.

Intrinsic evaluation: compute a metric directly on a held-out test set. Faster, and the standard metric is perplexity.

The Intuition: A Guessing Game

Perplexity measures how surprised the model is by the actual next word. Picture a fill-in-the-blank game:

"I always order pizza with cheese and ___": a few plausible options. Low surprise.
"The 33rd President of the U.S. was ___": basically one answer. Very low surprise.
"I saw a ___": could be anything. High surprise.

A model with low perplexity guesses well, assigning high probability to the words that actually appear. A model with high perplexity is consistently wrong about what comes next.

The Math

\text{PP}(W) = P(w_1, w_2, \dots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \dots, w_N)}}

Inverse probability of the test set, normalized by the number of words. Lower perplexity = better model. Minimizing perplexity is the same as maximizing the probability the model assigns to the test data.

For bigrams:

\text{PP}(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{i-1})}}

Perplexity as Branching Factor

Another angle: perplexity is the weighted average number of choices the model faces at each step.

Recognizing one of 10 equally likely digits? Perplexity = 10. That's the branching factor — 10 options, equally uncertain.

Now imagine a call-routing phone system. It gets 120,000 calls. Three-quarters are for "operator," "sales," or "tech support" (each 1 in 4). The remaining 30,000 calls are for 30,000 different employee names (each appears once). The perplexity of this sequence works out to 52.6, not 30,003. The common categories dominate, pulling the weighted average way down.

More information about what's likely = lower perplexity.

Real Numbers

Wall Street Journal, trained on 38M words, tested on 1.5M:

Model	Perplexity
Unigram	962
Bigram	170
Trigram	109

One word of context (bigram) cuts perplexity by ~5.5x over no context. Two words of context (trigram) cuts it further. More context = less surprise.

Generating Text from a Language Model

Language models aren't just scorers; they can also generate text. The procedure for bigram generation:

Start with $\langle s \rangle$
Sample a word from $P(w \mid \langle s \rangle)$ — say "I"
Sample from $P(w \mid \text{I})$ — say "want"
Keep going: "want" → "to" → "eat" → "Chinese" → "food" → $\langle /s \rangle$
Result: "I want to eat Chinese food"

This is the same loop running inside every LLM: predict, sample, append, repeat. The only difference is the machinery doing the prediction.

The output mirrors the training corpus. Shakespeare trigrams produce pseudo-Shakespeare. WSJ trigrams produce pseudo-financial news. Jane Austen trigrams produce pseudo-Austen. The n-gram model essentially becomes a stylistic fingerprint of its training data, which is the basis for author identification.

The Zero Problem

This is where n-gram models break down.

Shakespeare's corpus: 884,647 tokens, vocabulary $V = 29{,}066$ . Possible bigrams: $V^2 \approx 844$ million. Actually observed: 300,000. That's 99.96% zeros.

If "denied the offer" never appeared in training:

$P(\text{offer} \mid \text{denied the}) = 0$

One zero anywhere in the test set, and the entire test set probability becomes zero. Perplexity becomes undefined. You can't evaluate the model at all.

The fix is smoothing: take a little probability mass from the things you did see and spread it to the things you didn't.

Add-One (Laplace) Smoothing

The simplest possible fix. Pretend every bigram was seen one extra time:

P_{\text{Laplace}}(w_i \mid w_{i-1}) = \frac{C(w_{i-1}, w_i) + 1}{C(w_{i-1}) + V}

Add 1 to every numerator. Add $V$ to the denominator to keep things normalized.

This is not a good fix. It's a working fix. With

V = 1446

in the Berkeley corpus, adding 1 to each of 1,446 possible bigrams per context word dilutes the probability mass heavily. "Want to" drops from an effective count of 608 to 238. Add-one smoothing eliminates zeros, but it distorts the counts you actually trusted.

Good enough for text classification where the vocabulary is small. Not good enough for language modeling. We need something smarter.

Backoff and Interpolation

A better idea: don't commit to one n-gram order.

Sometimes you have enough data for a reliable trigram. Sometimes you don't, and the bigram is more trustworthy. Sometimes even the bigram is sparse and you need the unigram.

Backoff picks the highest-order n-gram you have good counts for and uses that alone. If the trigram count is zero, fall back to bigram. If that's zero, fall back to unigram.

Interpolation mixes all orders simultaneously:

\hat{P}(w_n \mid w_{n-2}, w_{n-1}) = \lambda_1 P(w_n \mid w_{n-2}, w_{n-1}) + \lambda_2 P(w_n \mid w_{n-1}) + \lambda_3 P(w_n)

where $\lambda_1 + \lambda_2 + \lambda_3 = 1$ .

The $\lambda$ weights are learned from a held-out corpus. You search for the combination that makes the held-out data most probable. Interpolation beats backoff because you're always using signal from every order, not discarding the lower ones when the higher one happens to have counts.

Backoff vs. Interpolation: when does each make sense?

Backoff is simpler to implement and computationally cheaper — you only compute one probability. It works well when you have a massive corpus (like Google's web n-grams) where high-order counts are usually reliable and you only fall back rarely. Interpolation is better when data is sparser, because it always hedges — even a weak trigram estimate contributes something when blended with a strong bigram.

Part 2: What Does a Word Even Mean?

Everything up to this point has a shared limitation: words are just strings. In n-gram models, "cat" is index 4,217 in the vocabulary. "Dog" is index 2,903. They are as unrelated as "cat" and "photosynthesis." For models that reason about language, we need representations that carry meaning.

That's what embeddings are about. But before jumping to algorithms, we need to ask: what does word meaning actually involve? It's messier than you'd expect.

Lemmas and Senses

Take "pepper." One word — the lemma, the dictionary entry form. But it has at least five senses:

The spice (black pepper, peppercorns)
The plant (Piper nigrum)
Capsicum varieties (bell pepper, chili)
California pepper tree
Extended uses ("pepper someone with questions")

One form, many meanings. WordNet, a structured lexical database, catalogs all of this: senses, definitions, usage frequencies, and relationships between words. For decades, WordNet was the backbone of NLP systems that needed to reason about meaning.

Seven Ways Words Relate to Each Other

Words don't exist in isolation. They connect through multiple kinds of relationships. Embeddings need to capture all of them, which is part of what makes the problem hard.

1. Synonymy: roughly the same meaning. Couch/sofa, big/large, car/automobile. But true perfect synonymy may not exist. If two words meant exactly the same thing in every context, why would the language keep both? This is the principle of contrast: a difference in form always signals some difference in meaning. "Water" and "H₂O" name the same substance, but you'd never write "H₂O" in a hiking guide.

2. Similarity: shared elements of meaning, but not interchangeable. Car and bicycle are similar (both vehicles). Cow and horse are similar (both large animals). Humans rate these reliably: vanish/disappear scores 9.8 out of 10 on the SimLex-999 dataset, hole/agreement scores 0.3.

3. Relatedness: connected not by shared meaning, but by co-participation in situations. Car and gasoline aren't similar — one is a vehicle, the other is a liquid. But they're tightly related because they show up in the same events. Scalpel and surgeon: completely different objects, strongly associated.

This distinction matters. Similarity and relatedness are different signals, and embeddings that confuse them will make downstream mistakes.

4. Semantic fields: clusters of words that cover a domain. Hospital: surgeon, scalpel, nurse, anesthetic. Restaurant: waiter, menu, plate, chef. These field structures give embeddings their neighborhood quality; words from the same field land near each other.

5. Antonymy: opposites. Dark/light, hot/cold, up/down, rise/fall. The tricky part is that antonyms are actually very similar. Dark and light share almost all features of meaning — both are about illumination. They differ on just one dimension. This creates a problem for embeddings: should "dark" and "light" be close together (similar concept) or far apart (opposite value)?

6. Taxonomic relations: hierarchies. Vehicle is a superordinate of car. Mango is a subordinate of fruit. These IS-A chains form the skeleton of meaning.

7. Basic level categories: not all levels in a taxonomy are equal. Show someone a beagle, and they say "dog." Not "beagle." Not "animal." "Dog" is the basic level, the one humans default to. Basic-level words are learned first by children, are the shortest, and are the most frequent. We perceive the world at this level.

Connotation: on top of all the above, words carry affective charge. Happy = positive. Sad = negative. Near-synonyms can diverge sharply: "innocent" (positive) vs. "naive" (negative). "Replica" (neutral) vs. "forgery" (negative). Words vary along three affective dimensions: valence (pleasant/unpleasant), arousal (exciting/calm), and dominance (controlling/controlled).

Why Formal Definitions Failed

Early NLP tried to pin down word meaning with logic. A square: four sides, all straight, a closed figure, planar, equal-length sides, right angles. Done. Clean. Works for geometry.

Now try "cup."

William Labov did. His formal definition of "cup" involved ratios of depth to width, the presence or absence of handles, material opacity, whether it's used for hot liquid, and probability functions over these features. A full paragraph of mathematical notation - to define a cup. And it still broke on edge cases. At what point does a cup become a bowl? A mug? A vase?

This was real NLP for decades: hand-building lexicons of feature-based definitions. Slow, brittle, and it never scaled.

Wittgenstein's Way Out

Ludwig Wittgenstein, philosopher of language, offered one sentence that reframed the whole problem:

"The meaning of a word is its use in the language."

Stop trying to write definitions. A word's meaning is just the contexts it shows up in (the words that surround it). If two words consistently appear in the same environments, they mean similar things.

This is testable. Consider a word you've never seen: ongchoi. You encounter:

"Ongchoi is delicious sautéed with garlic."
"Ongchoi is superb over rice."
"Ongchoi leaves with salty sauces."

And you've seen similar contexts for spinach, chard, and collard greens. Without a definition, without a feature list, without WordNet, you know ongchoi is a leafy green vegetable. The context told you.

This principle, that meaning lives in usage patterns, is the distributional hypothesis. It's the idea that eventually became word embeddings. An embedding is a vector that encodes a word's usage across a massive corpus. You don't define "dog" with a feature list. You let millions of contexts define it for you.

The next post turns this insight into math: co-occurrence matrices, sparse vs. dense vectors, and Word2Vec.

What You Now Have

Six things from this post:

MLE on real data: bigram count tables are mostly zeros, the non-zero entries encode syntactic and cultural patterns, and you always compute in log space.
Perplexity: the standard intrinsic metric for language models. Inverse probability of the test set, normalized by length. Lower = better. Interpretable as the weighted average branching factor: how many options the model is confused between at each step.
Sentence generation: sample from the probability distribution, append, repeat. Same loop in n-grams and LLMs. The output mirrors the training corpus so faithfully that you can identify the author from n-gram statistics alone.
The zero problem and smoothing: most possible n-grams are unseen. One zero kills the whole computation. Add-one smoothing is a working fix, not a good one. Interpolation mixes n-gram orders with learned weights and actually works well.
The landscape of word meaning: synonymy, similarity, relatedness, antonymy, taxonomic hierarchies, basic level categories, connotation. These are the phenomena that embeddings need to capture. Formal definitions tried and failed.
Wittgenstein's principle: "the meaning of a word is its use in the language." This one idea is the philosophical foundation of word embeddings: meaning is not a feature list, it's a usage pattern. The distributional hypothesis made it computational.

Next post: turning words into actual vectors — count-based embeddings, Word2Vec, and the cosine similarity measure that ties it all together.

DEV Community