Akash

Posted on Apr 1 • Edited on Apr 11

Before LLMs Could Predict, They Had to Count

#ai #machinelearning #beginners #nlp

By the end of this post, you'll understand exactly how the simplest language models work, the chain rule, the Markov assumption, n-grams, maximum likelihood estimation, and you'll see why every one of these ideas is still alive inside the LLMs you use daily. You'll also understand the specific limitations that forced the field to move beyond counting and into neural prediction. This isn't history for history's sake. This is the conceptual foundation without which transformers don't make sense. This is how n-gram language models laid the foundation for every idea that transformers run on today.

One Task, One Question

Every language model, from a 1990s bigram counter to GPT-4, does the same job: given some words, figure out what word comes next.

More precisely, a language model computes one of two things:

The probability of a full sentence: $P(w_1, w_2, \dots, w_n)$
The probability of the next word given everything before it: $P(w_n \mid w_1, w_2, \dots, w_{n-1})$

That's the whole definition. Any model that computes either of these is a language model. The difference between n-grams and LLMs isn't the task, it's the machinery.

Why Would Anyone Need Sentence Probabilities?

Before we get into how language models work, let's ground this in real tasks where you need one:

Machine translation: Your system translates a Spanish sentence and produces two candidates. $P(\text{high winds tonight}) > P(\text{large winds tonight})$ . "High winds" sounds right. The language model picks it.
Spell correction: "The office is about fifteen minuets from my house." Both "minutes" and "minuets" are real English words. (Minuet is a dance.) But $P(\text{fifteen minutes from}) > P(\text{fifteen minuets from})$ , and the language model knows the difference.
Speech recognition: Audio is ambiguous. "I saw a van" or "eyes awe of an"? $P(\text{I saw a van}) \gg P(\text{eyes awe of an})$ . Obvious to you. Not obvious to a machine without a language model.

Language models also power autocomplete, summarization, and question answering. And yes, LLMs are language models. They're language models trained at a scale that changes what's possible. But the core task hasn't moved.

One more property to flag before we move on: language models are generative. Predict the next word, sample it, append it, repeat. That generate-one-word-at-a-time loop is exactly what ChatGPT does. The idea is older than you think.

The Counting Problem

Here's the first real question. We want to compute $P(\text{``its water is so transparent that"})$ . How?

The brute-force answer: go to a corpus, count how many times this exact six-word sequence appears, and divide by total sentences. But language is creative. People produce new sentences constantly. You'll almost never find an exact match for any long sentence in your data.

We need something smarter. Three ideas, stacked on top of each other, get us there.

Idea 1: The Chain Rule (Break the Sentence Apart)

Instead of computing the probability of the full sentence at once, we decompose it:

P(w_1, w_2, \dots, w_n) = P(w_1) \cdot P(w_2 \mid w_1) \cdot P(w_3 \mid w_1, w_2) \cdots P(w_n \mid w_1, \dots, w_{n-1})

Compactly:

P(w_{1:n}) = \prod_{k=1}^{n} P(w_k \mid w_{1:k-1})

The probability of a sentence is the probability of the first word, times the probability of the second word given the first, times the third given the first two, and so on.

For our example:

P(\text{its water is so transparent}) = P(\text{its}) \times P(\text{water} \mid \text{its}) \times P(\text{is} \mid \text{its water}) \times P(\text{so} \mid \text{its water is}) \times P(\text{transparent} \mid \text{its water is so})

This is mathematically exact. No approximation. But look at that last term: $P(\text{transparent} \mid \text{its water is so})$
You need to have seen "its water is so" enough times in your corpus to estimate anything. And the conditioning context grows with every word. For long sentences, you'll never have enough data.

How would we estimate each of these terms? Count and divide:

P(w_n \mid w_{1:n-1}) = \frac{C(w_{1:n})}{C(w_{1:n-1})}

Count how many times the full sequence appears. Divide by how many times the prefix appears. Simple, but impossible for long sequences. Nobody's corpus is big enough.

Idea 2: The Markov Assumption (Forget Most of the History)

Andrei Markov's insight: You don't need the entire history. The last few words are enough.

Instead of:

P(\text{transparent} \mid \text{its water is so})

Approximate with:

P(\text{transparent} \mid \text{so}) \quad \text{(bigram — 1 previous word)}

P(\text{transparent} \mid \text{is so}) \quad \text{(trigram — 2 previous words)}

This is the Markov assumption: the next word depends only on the recent past, not the full history.

It's wrong. Language has long-range dependencies. "The computer which I had just put into the machine room on the fifth floor crashed." The verb "crashed" depends on "computer," nine words back. A bigram model can't see that far.

But it works well enough to be useful. The general n-gram approximation:

P(w_n \mid w_{1:n-1}) \approx P(w_n \mid w_{n-N+1:n-1})

where $N$ is the n-gram order, $N=2$ for bigrams, $N=3$ for trigrams.

Idea 3: The N-gram Models (Count Short Sequences)

An n-gram is a contiguous sequence of $n$ words. The n-gram model uses the previous $n-1$ words to predict the next one. Three versions, each a little less naive than the last.

Unigram: No Context At All

The simplest possible language model. Zero context. Each word generated independently:

P(w_1, w_2, \dots, w_n) \approx \prod_{i=1}^{n} P(w_i)

Words are drawn purely by frequency. Generate from a unigram model, and you get word soup:

"fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass, thrift, did, eighty, said, hard, 'm, july, bullish"

"The" appears a lot because it's the most frequent English word, not because it belongs next to "an" or "of." Every word is independent of every other word. This model technically is a language model, but barely.

Bigram: One Word of Memory

Now each word is conditioned on the one previous word: $P(w_i \mid w_{i-1})$ .

P(w_{1:n}) \approx \prod_{k=1}^{n} P(w_k \mid w_{k-1})

One word of context. Already noticeably better:

"texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, **without, permission, from, five, hundred, fifty, five, yen"

"Without permission." "Five hundred fifty five." Real collocations, word pairs that naturally occur together. The bigram model captures local patterns. But zoom out, and the sentence is still nonsense.

How do we get bigram probabilities? Maximum Likelihood Estimation (count and divide):

P(w_i \mid w_{i-1}) = \frac{C(w_{i-1},\; w_i)}{C(w_{i-1})}

Scan your corpus. Count how many times the pair $(w_{i-1}, w_i)$ appears. Divide by how many times $w_{i-1}$ appears alone. That's the whole algorithm. An n-gram language model is, at bottom, a lookup table of counts turned into ratios.

Trigrams and Beyond: More Context, More Data Hunger

Trigrams ( $n=3$ ), 4-grams, 5-grams; each step up means more context, better text. But the count-based method hits a wall. The number of possible n-grams grows exponentially with $n$ , and you never have enough data to get reliable counts for most of them.

There's also an overfitting trap: with small corpora, high-order n-grams just memorize chunks of training data instead of learning general patterns. Generate from a 4-gram model trained on Shakespeare, and you get... Shakespeare. Verbatim. Not because the model learned English, but because it ran out of options.

Google compiled a massive n-gram corpus from the web in 2006, pushing the limits of count-based models. Even at web scale, the approach has hard ceilings.

Why does overfitting happen with high-order n-grams?

Shakespeare's corpus has about 884,000 tokens and a vocabulary of ~29,000 words (𝑉 = 29,066). That means 𝑉² ≈ 844 million possible bigrams, but only 300,000 were ever observed. That's 99.96% zeros. For 4-grams, the possible space is 𝑉⁴ ≈ 7 × 10¹⁷. Almost every 4-gram in the model was seen exactly once, so "generating" just replays the training data.

So What Changed? N-grams vs. LLMs

Let's cycle back to where we started. N-gram models and LLMs both generate text. Both use context to pick the next word. Both are language models. What's actually different?

Not the task. The machinery.

Context size. N-grams look at 1, 2, maybe 5 previous words. LLMs condition on thousands to millions of tokens. A bigram sees one word back. GPT sees the whole conversation.

Counting vs. predicting. This is the distinction that matters most. N-gram models estimate probabilities from counts. You tally co-occurrences, compute ratios, and store them in a table. If a word pair never appeared in training, its probability is zero. Done. No recovery.

LLMs predict the next word through learned parameters. They build continuous representations of words and contexts. If "I want Japanese food" never appeared in training, but "I want Chinese food" and "I want Italian food" did, an LLM can bridge the gap. An n-gram model cannot.

This is not the same thing done better. It's a different kind of operation, estimation from observations vs. prediction from learned structure.

Training data. N-gram models use modest corpora. LLMs consume the internet. And instead of storing count tables that grow exponentially, the neural architecture compresses everything into fixed-size parameters.

	N-gram LMs	LLMs
How	Estimate probabilities from counts	Predict next word via learned parameters
Context	1-5 words (practical limit)	Thousands to millions of tokens
Training data	Modest corpora	The entire internet
Generalization	Can only use what was literally observed	Can generalize to unseen combinations
Representation	Words are discrete symbols	Words are dense vectors in continuous space

That last row is the deep issue. N-gram models treat words as atomic, unrelated symbols. "Cat" and "dog" are as different as "cat" and "quantum." No similarity, no transfer, no generalization. Neural language models fix this with embeddings, mapping words into continuous vector spaces where similar words land near each other. But that's the next post.

What You Now Have

Five things you didn't have before reading this:

The definition of a language model: any model that assigns probabilities to word sequences or predicts the next word. N-grams and LLMs are both language models. The task is identical.
The chain rule decomposition: how to break a sentence's probability into a product of conditional probabilities, and why you'd want to.
The Markov assumption: the decision to throw away most of the history and keep only the last few words. Wrong in theory, useful in practice, and the reason n-grams are computationally tractable.
How n-gram estimation actually works: count and divide. Unigrams produce word soup. Bigrams produce local coherence. Higher-order n-grams overfit small data. The whole thing is a lookup table of ratios.
The specific gap that LLMs fill: n-grams can't generalize, can't handle long context, and can't represent word similarity. LLMs solve all three by moving from count-based estimation to neural prediction. Different machinery, same task.

Next post: Perplexity (how you measure whether a language model is any good), the zero problem (what happens when your model has never seen a word pair), and smoothing (how you fix it). That's where the math gets interesting.

DEV Community