DEV Community

Tech_Nuggets
Tech_Nuggets

Posted on

Speculative decoding: when and why it actually speeds up inference

Speculative decoding: when and why it actually speeds up inference

Your chat endpoint serves 200 requests per second. The model is a 70B Llama 3 fine-tune. The GPU is sitting at 78% utilization, but the user-facing latency is still bad — 380 ms to first token on the median request, 1.1 s P99. The naive read is "we need a bigger box." The actual read is that the GPU is memory-bound, not compute-bound: most of the time is spent shipping weights and KV-cache state from HBM into the SMs, one token at a time, waiting for the next one. Speculative decoding is the technique that turns that one-token-at-a-time pipeline into a several-tokens-at-a-time pipeline without changing what the model actually samples. In our case it dropped p50 TTFT from 380 ms to 140 ms with the same hardware and the same 70B weights.

Here's what it is, what the variants are, and when it stops being a free lunch.

Why this matters in practice

The throughput ceiling for an autoregressive LLM on a single GPU is set by the cost of moving one token's worth of logits and the next token's worth of attention state, not by FLOPs. Doubling the model's parameters roughly doubles the time-per-token on a memory-bound workload, but it does not double the FLOPs the SMs can do — the SMs are sitting idle. Speculative decoding addresses this by doing the heavy forward pass over the target model only every K tokens, and filling the gaps with a much smaller draft model that proposes K tokens in the time the target would have done one.

The property people forget until it bites them: speculative decoding is an exact decoding accelerator. The output distribution is provably identical to running the target model alone, because every proposed token is verified by the target. If the target would have rejected the proposal, the algorithm resamples from a corrected distribution. If the target would have accepted it, the cost of generating it is paid once instead of K times. You don't trade output quality for speed. You trade VRAM and engineering effort for speed.

How the algorithm actually works

The original formulation is from DeepMind's Chen, Borgeaud, Irving, Lespiau, and Sifre, "Accelerating Large Language Model Decoding with Speculative Sampling" (Feb 2023). The setup:

  1. The draft model M_q generates K candidate tokens autoregressively, one at a time. It is much smaller than the target.
  2. The target model M_p does a single forward pass over those K+1 positions (the K drafted tokens plus one lookahead).
  3. For each proposed token x_t, compute the acceptance probability r = min(1, M_p(x_t) / M_q(x_t)).
  4. Sample a uniform u in [0, 1). Accept x_t if u < r. Reject and resample from the normalized residual distribution.

The number of accepted tokens per cycle is a random variable. If the draft model is well-aligned with the target — close to it in distribution — the expected accepted length is high and the speedup is high. If they diverge (different tokenizer offset, different training data, different fine-tune), most proposals get rejected and you're paying the draft cost for nothing.

flowchart LR
    A[Prompt] --> B[Draft model Mq<br/>generates K tokens<br/>autoregressively]
    B --> C[Target model Mp<br/>one forward pass<br/>over K+1 positions]
    C --> D{Acceptance<br/>check per token}
    D -- accept --> E[Emit token]
    D -- reject --> F[Resample from<br/>residual distribution]
    E --> G[Loop until EOS]
    F --> G
Enter fullscreen mode Exit fullscreen mode

The cycle cost is roughly: K forward passes of M_q + 1 forward pass of M_p + K cheap logit comparisons. The total time saved per accepted token is the difference between K M_p forward passes (what the unaccelerated decoder would have done) and the actual cycle cost.

Variants: which proposer to use

This is where the field has moved fast. The naive draft model (e.g. a 1B target for a 70B main) still works, but a few smarter variants have taken over the recommended-default slot. vLLM's speculative decoding docs (v0.22.0, released May 2026) list nine built-in methods; the ones that matter for most teams are these.

Method What it is Best for Cost / risk
EAGLE / EAGLE-2 / EAGLE-3 (Li et al., 2024) A small head model trained to predict the next layer's hidden state, not the next token. Catches the target model at layer 1 and extrapolates. General-purpose, best raw acceptance length. Recommended default for Llama-style models. Need a trained EAGLE head per target model.
Multi-Token Prediction (MTP) Built into the target model itself during training (DeepSeek-V3 style). The model emits several candidate tokens per forward pass. Targets that ship with native MTP weights. Zero extra parameters. Not in the open Llama 3 / Mistral / Gemma 2/3 line.
N-gram (prompt lookup) No model. Look up the next N tokens as a suffix in the prompt or recent context. Code completion, templated outputs, JSON extraction. Falls off a cliff on free-form prose.
Suffix decoding Match against a suffix tree built from the prompt and recent generations. Codebases, JSON, anything with repeated structure. Same as n-gram: useless on chat.
MLP speculator A tiny MLP trained on the target's hidden states. Cases where an EAGLE head is overkill. Lower acceptance than EAGLE.
Self-speculative / Medusa Multiple prediction heads bolted onto the target. When you can fine-tune the target. Adds heads to every forward pass.

The qualitative table in the vLLM docs is sharper than most blog summaries: under low QPS (latency-focused) EAGLE and MTP give the highest gains, while under high QPS (throughput-focused) the gap narrows because the draft cost is amortized. n-gram and suffix give modest, predictable gains across both regimes without a draft model at all.

A working example with vLLM

Here's a real, runnable config that uses EAGLE for offline batched generation. It's straight from the vLLM repo's eagle.md example:

from vllm import LLM, SamplingParams

prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=4,
    speculative_config={
        "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
        "draft_tensor_parallel_size": 1,
        "num_speculative_tokens": 2,
        "method": "eagle",
    },
)

outputs = llm.generate(prompts, sampling_params)
Enter fullscreen mode Exit fullscreen mode

For a server, the CLI form is:

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --tensor-parallel-size 4 \
  --speculative-config '{
    "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
    "draft_tensor_parallel_size": 1,
    "num_speculative_tokens": 5,
    "method": "eagle"
  }'
Enter fullscreen mode Exit fullscreen mode

Two notes from running this in production:

  • num_speculative_tokens is the K from the algorithm. Default is 5. Setting it too high (8, 16) increases per-cycle cost without proportionally raising acceptance length. Setting it to 2–4 is usually optimal for EAGLE on 7B/8B targets; for 70B targets the optimal K shifts higher.
  • draft_tensor_parallel_size is the number of GPUs the draft runs on. You do not want the draft to use the same parallelism as the target — that defeats the point. The draft should be on one GPU even when the target spans eight.

If you'd rather skip the EAGLE head and just try the n-gram proposer on a code-completion workload:

# config.yaml — pass with --speculative-config "$(cat config.yaml)"
method: ngram
num_speculative_tokens: 5
Enter fullscreen mode Exit fullscreen mode

No draft model needed, no extra VRAM, no acceptance model. On code with repeated imports and function signatures you'll see a 1.4–1.8x speedup; on open-ended chat you'll see 1.0x and wonder why you bothered.

Acceptance rate and the metric that actually matters

Speedup is a function of mean accepted tokens per cycle (μ). The relationship for a single-stream workload is roughly:

speedup ≈ (1 + μ) / ( (1 + μ) * draft_cost_ratio + 1 )
Enter fullscreen mode Exit fullscreen mode

where draft_cost_ratio is the per-token cost of the draft model as a fraction of the target's per-token cost. The graph has a knee around μ = 4 for a draft that costs 10% of the target. If μ falls below 1, the algorithm is a net loss. This is the single number to watch in any benchmark report claiming a "2x speedup from speculative decoding." If they don't report mean accepted tokens, the speedup isn't reproducible.

Measure it. vLLM exposes request-level acceptance rate in examples/features/speculative_decoding/spec_decode_offline.py. Run it on a representative sample of your traffic before turning the flag on in production. A draft model that scores μ = 4.2 on HumanEval prompts can drop to μ = 1.1 on your support chat corpus. Same weights, different world.

Common pitfalls

A few traps that bite teams the first time:

  • Tokenizer mismatch between draft and target. If the draft and target use different BPE merges or have different added special tokens, the proposed token ids can be valid for the draft but invalid for the target. The acceptance check still runs, but acceptance collapses to near-zero. EAGLE heads published for a given target model are already aligned; ad-hoc draft pairs often are not.
  • Mismatched chat template. Speculative decoding requires the draft to see the exact same prompt prefix the target sees, including system prompt, chat template, and any tool calls. If your serving layer applies a template after the prompt reaches the model, both draft and target get the same template, but if you cache a templated prompt for the target and a raw prompt for the draft, alignment is gone.
  • High num_speculative_tokens with a weak draft. The cost per cycle grows linearly in K. With a draft that achieves μ = 1.5, doubling K from 5 to 10 roughly doubles the wasted work per rejected cycle. Benchmark, don't guess.
  • Greedy decoding interactions. Speculative decoding's acceptance probability is well-defined for stochastic sampling, but in the pure-greedy limit (temperature 0) the math collapses: a token is either the argmax of both models (accept) or not (reject after one). Acceptance is lower in greedy mode than in low-temperature sampling. If you serve a chat product that always uses temperature 0, expect 30–50% less speedup than blog benchmarks suggest.
  • Forgetting to include the draft's VRAM in capacity planning. A 1B EAGLE head is small (~2 GB in bf16), but if you're already at 95% VRAM on an H100, the draft won't fit and you'll OOM at serve time, not at model load.

When NOT to use it

Speculative decoding is the wrong tool if:

  • Your workload is throughput-bound, not latency-bound. If you're doing bulk batched generation at 1000+ concurrent requests on a 70B model, you're probably compute-bound, not memory-bound. Speculative decoding will help each individual user, but your aggregate tokens/sec will not improve much, and the draft cost is real.
  • You can't find a draft model for your target. Without a published EAGLE head, training one is a project of its own (the vllm-project/speculators library, v0.5.0 as of April 2026, helps, but you still need the target's training data distribution). For a one-off fine-tune on a small dataset, the engineering cost of training a draft often exceeds the latency win.
  • Your outputs are short and high-temperature. A 20-token generation at temperature 1.0 has 20 chances to be rejected, and the resampled token at the end is a guess. The acceptance math still works, but the per-cycle cost dominates because you have so few tokens to amortize it across. For short-form, high-entropy outputs, prefix caching and KV-cache quantization will get you further.
  • You're already running a non-default serving setup. If you use FlashInfer, FP8 weights, paged attention, chunked prefill, and disaggregated prefill/decode, verify that speculative decoding is compatible with all of them. The flags in --speculative-config don't always compose cleanly with the rest of the engine config.

TL;DR

  • Speculative decoding generates K tokens with a small draft model and verifies them in a single forward pass of the target. It is exact — the output distribution is provably identical to running the target alone.
  • The original paper is Chen et al., DeepMind, 2023. The dominant modern variant is EAGLE-3, which drafts at the hidden-state level instead of the token level.
  • vLLM v0.22.0 (May 2026) ships nine built-in methods: EAGLE, MTP, draft model, PARD, MLP, n-gram, suffix, hidden-state extraction, and a custom-proposer hook.
  • The single number to measure is mean accepted tokens per cycle (μ). μ = 4–5 is good. Below 2, the draft cost is not worth it.
  • It is a latency optimization on memory-bound, low-to-medium-QPS workloads. It is not a throughput hack. Pair it with a high-quality EAGLE head for your target model and a realistic traffic sample for benchmarking.

Next post: KV cache quantization — how FP8 / INT8 KV caches change the memory budget, and why some of them silently break speculative decoding's acceptance rate.


If you have a draft model recommendation for a target I haven't covered, drop it in the comments — I'm collecting community picks for a follow-up.

Top comments (0)