DEV Community

jidonglab
jidonglab

Posted on

Speculative Decoding: Why Two Models Decode Faster Than One

A 70B model generating one token at a time spends almost none of its time computing. It spends its time moving weights from HBM into the compute units, doing a tiny matrix-vector multiply, and throwing the loaded weights away. Autoregressive decoding is memory-bandwidth bound, not compute bound. That single fact is why speculative decoding — running a small draft model alongside the big one — makes inference 2-3x faster without changing a single token of the output distribution. You are paying for memory bandwidth anyway, so you might as well verify several tokens per trip to HBM.

This post is the mechanism, the correctness proof, the acceptance-rate math, and the knobs that actually move latency.

TL;DR

  • Speculative decoding runs a cheap draft model to propose γ tokens, then the expensive target model verifies all γ+1 in one parallel forward pass, accepting the longest valid prefix.
  • It is lossless: a modified rejection-sampling step guarantees accepted tokens are distributed exactly as if sampled from the target model alone. No quality trade-off.
  • Speedup comes from arithmetic intensity: autoregressive decode is memory-bound, so verifying K positions costs roughly the same wall-clock as decoding 1.
  • The win is governed by the acceptance rate α. Expected tokens per target call ≈ (1 - α^(γ+1)) / (1 - α). High α (well-aligned draft) is everything.
  • Gains shrink at large batch sizes, where decode becomes compute-bound. Variants like EAGLE, Medusa, and n-gram lookahead remove the separate draft model.

Why is autoregressive decoding slow in the first place?

Because each token requires reading the entire model's weights from memory to produce a single column of activations. For a 70B model in fp16, that is ~140 GB of reads per token. On an H100 with ~3.35 TB/s of bandwidth, you are bounded near ~24 tokens/sec from memory alone, and the FLOPs for that one token are a rounding error against the chip's compute ceiling. The GPU is starving.

The key asymmetry: a forward pass over a sequence of K tokens reads those same weights once and does K times the math. Prefill (processing the prompt) exploits this — it is fast per token. Decode does not, because token t+1 depends on token t.

Speculative decoding breaks the dependency by guessing. If you have candidate tokens for positions t+1…t+γ from somewhere cheap, you can score all of them in one target forward pass, for nearly the price of one decode step.

How does speculative decoding actually work?

Three steps, repeated:

  1. Draft. A small model M_q (e.g. a 1-7B model) autoregressively generates γ candidate tokens x₁…x_γ, recording its probabilities q(x_i).
  2. Verify. The large target model M_p runs one forward pass over the γ candidates in parallel, producing target probabilities p(x_i | context) for every position simultaneously, plus a distribution for the position after the last accepted token.
  3. Accept/reject. Walk left to right. Accept each draft token with probability min(1, p(x_i)/q(x_i)). On the first rejection, discard the rest and resample one token from the adjusted distribution. Then loop.

Because the target's forward pass is parallel across positions and decode is memory-bound, that verification pass costs roughly one normal decode step. If the draft is good, you committed several tokens for the price of one.

Why is speculative decoding lossless?

This is the part people assume must hurt quality, and it doesn't. The accept/reject rule is a modified rejection sampler that provably reproduces the target distribution p(x).

Accept candidate x (drawn from draft q) with probability min(1, p(x)/q(x)). If rejected, sample a replacement from the residual distribution:

p'(x) = normalize( max(0, p(x) - q(x)) )
Enter fullscreen mode Exit fullscreen mode

The combined probability of emitting token x — either accepting it from the draft or producing it after a rejection — works out to exactly p(x). Sketch: probability of accepting x is q(x)·min(1, p(x)/q(x)) = min(q(x), p(x)). The rejection branch fires with probability 1 - Σ min(q,p) = Σ max(0, p-q), and conditioned on rejection you draw from p', contributing max(0, p(x)-q(x)). Sum the two branches: min(q(x),p(x)) + max(0,p(x)-q(x)) = p(x). QED.

So with temperature sampling, speculative decoding's output is statistically indistinguishable from sampling the target directly. For greedy decoding it degenerates to a simpler rule: accept while the draft's argmax equals the target's argmax. Either way — no approximation, no quality loss. This is the property that separates speculative decoding from lossy tricks like aggressive quantization or distillation.

def verify(draft_tokens, q_probs, p_probs, rng):
    """One verification step. p_probs[i] is the target dist BEFORE token i.
    Returns the list of accepted tokens plus one bonus/resampled token."""
    accepted = []
    for i, tok in enumerate(draft_tokens):
        p_x, q_x = p_probs[i][tok], q_probs[i][tok]
        if rng.random() < min(1.0, p_x / q_x):
            accepted.append(tok)            # draft token survives
        else:
            # rejection: sample from normalized (p - q)_+
            residual = (p_probs[i] - q_probs[i]).clamp(min=0)
            residual /= residual.sum()
            accepted.append(sample(residual, rng))
            return accepted                 # discard the rest of the draft
    # all γ accepted → free bonus token from the trailing target dist
    accepted.append(sample(p_probs[len(draft_tokens)], rng))
    return accepted
Enter fullscreen mode Exit fullscreen mode

Note the bonus token: if all γ drafts are accepted, the target already computed the distribution for position γ+1 in the same pass, so you get one extra token for free. That is why best case is γ+1 tokens per target call.

What determines the actual speedup?

The acceptance rate α — the probability the target accepts a given draft token — dominates everything. Model the per-position acceptances as i.i.d. with probability α (a simplification, but a useful one). The expected number of tokens produced per target forward pass with draft length γ is:

E[tokens] = (1 - α^(γ+1)) / (1 - α)
Enter fullscreen mode Exit fullscreen mode

At α = 0.8, γ = 4: (1 - 0.8⁵)/(1 - 0.8) ≈ 3.3 tokens per target call. If a target call costs roughly the same as one decode step, that is ~3.3x the throughput before subtracting overhead. At α = 0.5 it collapses to ~1.9. At α = 0.3, barely above 1.4 — and now the draft model's own latency can eat the gains entirely.

Two competing costs set the optimal γ:

  • Too small and you under-utilize the parallel verify pass.
  • Too large and you waste draft compute on tokens that get rejected anyway (everything after the first rejection is thrown away), and you pay γ sequential draft steps before each verify.

In practice γ between 4 and 8 is the usual sweet spot, and the optimum rises with α. Tune it per workload.

When does speculative decoding stop helping?

At large batch sizes. The whole premise is that decode is memory-bound — that the target's verify pass is "free" because the GPU was idle waiting on weights. As you batch more concurrent requests, you amortize the weight reads across the batch and decode becomes compute-bound. Now those extra verification FLOPs over γ positions × batch are real work, and the draft model is pure overhead. High-throughput serving with large batches often sees speculative decoding regress. It shines for low-latency, low-batch scenarios: interactive chat, single-stream agents, latency-SLA endpoints.

Other failure modes:

  • Draft misalignment. A draft from a different tokenizer or training distribution gives low α. The draft should be a smaller sibling of the target (same family, same tokenizer). A distilled or pruned version of the target is ideal.
  • Distribution shift mid-generation. α is not constant; it drops on hard, high-entropy spans (novel reasoning) and spikes on boilerplate (closing braces, repeated structure).

Do you even need a separate draft model?

No — and the newest variants drop it. The separate-model approach has two annoyances: you must host a second model, and keeping it aligned is work. Three alternatives:

  • Self-speculation / Medusa. Bolt extra decoding "heads" onto the target that predict tokens t+1, t+2, … in parallel from the same hidden state. One model, trained heads, tree-structured candidates.
  • EAGLE. Autoregress at the feature level (the second-to-last hidden state) rather than the token level, which is far more predictable, then verify. EAGLE-style methods report some of the highest acceptance rates because feature-space drafting aligns tightly with the target.
  • N-gram / prompt lookahead. No model at all: propose continuations by matching the recent context against earlier text. Astonishingly effective when the output echoes the input — RAG with quoted sources, code edits, JSON that mirrors a schema, summarization that lifts spans. The draft is a hash-table lookup; α on copy-heavy spans is enormous.

For agentic and RAG pipelines where outputs repeat input structure, prompt-lookahead decoding is often the highest ROI version: zero extra GPU memory, large α exactly where text is repetitive.

Where this fits in a production stack

Most serious inference servers (vLLM, TensorRT-LLM, SGLang) ship speculative decoding as a config flag — you pick a draft model or enable n-gram/EAGLE and set γ. Combine it with the other memory-bound wins: KV-cache paging, prefix caching, and a quantized draft. The mental model to carry: speculative decoding spends spare memory bandwidth and compute to buy latency, losslessly. It is one of the rare optimizations with no accuracy tax — the only cost is engineering and, at high batch, throughput.

So why are two models faster than one?

Speculative decoding is faster because autoregressive LLM decoding wastes the GPU — it is bottlenecked on reading weights from memory, not on math. A small draft model proposes several tokens cheaply, the large target model verifies all of them in a single parallel forward pass for roughly the cost of generating one token, and a rejection-sampling step guarantees the accepted tokens follow the target model's exact distribution. The result is 2-3x lower latency with zero quality loss, governed by the acceptance rate (1 - α^(γ+1))/(1 - α), best at small batch sizes, and increasingly available without a second model at all via EAGLE, Medusa, and n-gram lookahead.

Top comments (0)