How to Make an LLM 2-3x Faster Without Changing a Single Word It Says

#ai #beginners #llm #machinelearning

Large language models are slow for one stubborn reason: they write one token at a time. To produce a 200-token answer, the model runs its full stack of billions of parameters 200 separate times, and each run has to finish before the next can start. You can't compute token 5 until you know token 4. It's a strictly sequential grind.

Worse, each run barely uses your hardware. A forward pass spends most of its time hauling weights out of memory, not doing math, so your expensive GPU sits mostly idle, one token at a time. That single fact — one token per slow, memory-bound pass — is the wall every fast-inference trick is trying to knock down.

Speculative decoding knocks it down with a trick that sounds too cheap to work: guess ahead with a small model, then have the big model check all the guesses at once. And the output comes out exactly the same as if you'd never used the trick. Same words, same order, just faster.

The insight: checking is cheaper than writing

Here's the asymmetry everything hinges on. Generating five tokens the normal way costs five slow passes. But checking five already-written tokens costs almost the same as checking one — because you pay the memory cost once and get all five verdicts in a single pass. Writing is sequential and slow. Verifying is parallel and nearly free.

So if some faster process could propose the next few tokens, the big model could confirm a whole batch of them in one shot instead of grinding them out individually. That "faster process" is a second, much smaller model.

Two models: a draft and a target

You run two models that share the same vocabulary.

The target is the big, accurate model whose output you actually want. Its answers must not change.
The draft is a small, fast model whose only job is to guess ahead cheaply.

The draft doesn't need to be smart. It just needs to be right often enough that its guesses usually survive. You keep the quality of the big model and borrow the speed of the small one.

The loop, one round at a time

1. Propose. The draft decodes K tokens ahead on its own — say 4. Because the draft is tiny, those 4 sequential guesses are quick and cheap. You now have a little chain of speculative tokens.

2. Verify. The target runs once over your current text plus all 4 guesses. Thanks to how attention works, that single pass produces the target's own predicted token at every position in parallel — as if you'd asked "what would you have written here?" at each step, simultaneously. One expensive pass, five predictions.

3. Accept. Now walk the guesses left to right and keep each one as long as it matches what the target wanted at that spot. The instant a guess disagrees, you stop. That mismatched token is rejected, and everything after it gets thrown away — those later guesses were built on a token the target won't keep. A round might accept all 4, or just 1, or 0.

4. Correct. Even when a guess is rejected, that same pass already computed the target's own token for that position, so you take it as a free correction. This guarantees progress: every round writes at least one genuine target token, even when the draft got everything wrong. And if all K guesses are accepted, the pass also gives you a bonus token for the position just past them.

So each round commits between 1 and K+1 tokens — always including at least one the target itself chose — for the cost of a single target pass.

Why the output never changes

This is the part that makes it more than a hack. The final text is identical to what the target would have produced alone, because the target has veto power at every position. A draft token only survives if the target agrees; any disagreement is overwritten by the target's own choice. For sampling (not just greedy decoding) there's a cleverer probabilistic accept/reject rule that provably reproduces the exact same output distribution as sampling from the target directly. The draft never injects its opinions — it only proposes candidates the target is free to confirm or reject.

Lossless. That word matters. You are not trading quality for speed here.

What actually drives the speedup

Everything rides on how often the draft guesses right — the acceptance rate.

Draft usually correct → most rounds accept the whole run of K tokens → many tokens committed per target pass → speedup approaches K+1.
Draft often wrong → rounds accept a token or two → you barely beat the plain baseline.

In practice, a decent draft on predictable text gets you roughly 2–3× fewer target passes for the same output. That's why it's a staple of production serving stacks.

When it helps, when it hurts

It shines on predictable, low-entropy text — code, structured formats, obvious continuations — where guesses land often and accepted runs are long. It helps most when the target is large and memory-bound, so parallel verification is a big relative win.

It helps less, or can even hurt, when the draft is a poor match for the target, when the text is highly creative or random, or when K is set so large that most proposed tokens are wasted. The craft is picking a fast-but-decent draft and a K that fits your workload.

You rarely hand-roll the loop in production. transformers exposes it as assisted generation; vLLM and TensorRT-LLM enable it with a flag, using a draft model, n-gram lookups, or Medusa heads. Same output, fewer passes.

I built an interactive version where you drag a "draft accuracy" slider and watch the accept rate — and the speedup — climb in real time:

https://dev48v.infy.uk/ai/days/day21-speculative-decoding.html