When a language model or a translation decoder writes text, it never hands you a finished sentence. It only ever gives you one thing: the probability of the next token, given everything written so far. Turning that into a whole sentence is a decision you make at every step — and how you make it changes the output completely.
Today (Day 25 of my DeepLearningFromZero series) I built a tiny, fully interactive beam-search visualiser so you can watch those decisions play out on a real probability tree.
Play with it here: https://dev48v.infy.uk/dl/day25-beam-search.html
Decoding is a search problem
The probability of a finished sentence is the product of its per-step probabilities. The catch: the number of possible sentences is the vocabulary size raised to the length — billions upon billions. You cannot score them all and pick the best. So decoding is really a search through an enormous tree, where each node is a partial sentence and each branch is a possible next word.
Greedy: fast but short-sighted
The simplest search is greedy: at every step take the single highest-probability token and never look back. It is quick and needs no extra memory. But it is myopic. A word that looks best right now can lead into a dead end where every continuation is mediocre, while a slightly worse word now might open a far better sentence. Greedy commits at each step and can never undo that early mistake.
In the demo, set the beam width to k = 1 (that is exactly greedy) and watch it grab the likeliest first word — then get stuck with a whole sentence that scores lower than one a wider search finds.
Beam search: keep the top-k
Beam search generalises greedy with a single knob, the beam width k. Instead of keeping one partial sentence, you keep the best k of them at every step. Each iteration is three moves:
- Expand — grow every surviving hypothesis by every possible next token.
- Score — give each candidate its parent's running score plus the log-probability of the new token.
- Prune — sort all candidates and keep only the top k; throw the rest away.
Any hypothesis that emits the end-of-sequence token is set aside as finished. Repeat until the beam is empty. With k = 1 you have greedy; as k grows toward the whole vocabulary you approach an exhaustive search that is guaranteed to find the most probable sentence. In practice, translation uses a small beam, around k = 4 to 10, because the gains flatten out fast while the cost keeps rising.
Why log-probabilities?
Scoring uses the cumulative log-probability, the sum of log p along the path — not the raw product of probabilities. Multiplying many numbers below 1 underflows to zero in floating point for anything but the shortest sequences. Taking logs turns that fragile product into a stable sum, and because log is monotonic the ranking is unchanged: the likeliest sentence still has the highest score. Every box in the demo shows this running sum.
Beam is a heuristic, not a guarantee
It is tempting to think a wider beam always returns the single most probable sentence. It does not, unless k equals the entire vocabulary. If the true best sentence starts with weak-looking tokens, its prefix can be pruned before its strong ending ever appears. Beam search is a bounded-width best-first search — much better than greedy, but still an approximation.
The short-sequence bias, and how to fix it
Here is a subtle trap. Every extra token adds a negative log-probability, so longer sentences almost always score lower than short ones. Beam search therefore learns to love the end-of-sequence token and returns stunted outputs. The standard cure is length normalization: divide the final score by the length raised to a power alpha (often around 0.6 to 0.7). Toggle it in the demo and watch a longer, well-formed sentence overtake a short one — the winner actually flips.
Beam vs sampling
Beam is deterministic and chases the most probable text, which is what you want when there is a correct answer. But for open-ended writing the likeliest continuation is often bland or repetitive. Sampling does the opposite: it draws the next token from the distribution instead of maximising it, shaped by temperature and top-k / top-p (nucleus). The result is diverse and creative, at the risk of wandering off.
The rule of thumb: beam for translation, summarization, speech-to-text and code; sampling for chatbots, stories and brainstorming. Modern chat models sample; Google Translate beams. Maximum likelihood is not the same as best quality.
The visualiser runs genuine log-prob math and real top-k pruning on a scripted probability tree, plus a step-by-step explainer and a from-scratch Python implementation. Have a look:
Top comments (0)