Speculative Decoding for Self-Hosted LLMs: When the Math Pays Off

#llm #performance #python #tutorial

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You are running a 70B model on two H100s for an internal coding
assistant. The product team wants the median response under two
seconds. The model is sitting at 4.1s. You already swapped to
FP8 quantization. You already turned on continuous batching. The
GPU memory is comfortable. The bottleneck is the one thing
quantization does not fix: every output token is one full forward
pass through 70 billion parameters, in series, one after the
next. A 200-token answer is 200 of those passes back to back.

Speculative decoding is the trick that breaks the series.

What it actually does

The idea, from Leviathan et al., 2023,
is to split the work in two. A small "draft" model generates a
short run of candidate tokens cheaply. The large "target" model
then runs one forward pass that scores all of those candidates
in parallel, accepting the longest prefix that matches what it
would have produced anyway, and falling back to its own sample at
the first mismatch.

When the draft is right, you get several tokens for the price of
one target step. When it is wrong, you paid an extra small-model
call and one wasted target pass. The asymmetry is the win: target
forward passes are the expensive part, and you only pay one of
them per draft batch instead of one per token.

Because the acceptance rule is a rejection sampler with the right
weights, the output distribution is identical to the target
model's. The check is exact.

When the math pays off

Three conditions stack the deck:

The target model dominates wall time. A 70B target running alongside a 7B draft is the textbook case. The target step is roughly an order of magnitude slower than the draft step, so wasted draft work is cheap.
The draft and target agree often. Same family helps a lot. Llama 3.1 8B drafting for Llama 3.1 70B is the canonical pairing. Different families work but acceptance rates drop.
The decode is long enough that draft setup amortizes. Short answers (10-20 tokens) barely benefit. The break-even is usually above 50 output tokens.

Speedups in self-hosted setups vary widely. Latency wins of
1.5x to 2.5x on chat and coding workloads are common in published
benchmarks, with some narrower benchmarks reporting more. The
Red Hat April 2026 write-up on vLLM with gpt-oss
reports peak throughput improvements around 27% on conversational
loads and 20% on code tasks at high concurrency. Those figures
are from Red Hat's tuned setup; reproducing them requires the
same draft pairing, batch policy, and hardware. Your mileage
will swing on draft choice, prompt distribution, and how you
batch.

The honest version: do not paste any number above into a launch
deck. Run your own benchmark on your own traffic before you
commit. Speedups vary a lot more across workloads than people
admit.

When it does not help

It is worth being specific about the failure modes, because they
catch teams off guard.

Short outputs. Classification, single-token answers, routing. The draft setup overhead eats the win. You are better off with a smaller target.
Tiny target models. Speculative decoding on a 7B target is mostly a wash. The target step is already fast. There is no big rock to amortize.
Low draft-target overlap. A draft from a different family, or one trained on a very different distribution, will get rejected often. Acceptance rates under 40% can make the whole thing slower than plain decoding.
Memory-bound serving. If your GPUs are pinned at 95% memory and you cannot fit a draft model, you do not have room for this technique.

A short test on representative prompts tells you which bucket
you are in. Run it before committing.

vLLM: the config

vLLM's current syntax (as of writing — see the vLLM speculative-decoding docs)
puts everything in a single speculative_config dict. The
older --speculative-model flag is deprecated.

Server-side launch:

vllm serve Qwen/Qwen3-8B \
  --speculative-config \
  '{"model": "Qwen/Qwen3-0.6B",
    "num_speculative_tokens": 5,
    "method": "draft_model"}'

Same thing from Python:

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3-8B",
    tensor_parallel_size=1,
    speculative_config={
        "model": "Qwen/Qwen3-0.6B",
        "num_speculative_tokens": 5,
        "method": "draft_model",
    },
)

params = SamplingParams(temperature=0.2, max_tokens=256)
outputs = llm.generate(
    ["Write a Python function that reverses a list."],
    params,
)
print(outputs[0].outputs[0].text)

A few knobs worth knowing before you tune:

num_speculative_tokens is how many tokens the draft proposes per round. Five is the common starting point. Going to 7 or 10 helps when acceptance is high; it hurts when it is not.
method can be draft_model, ngram (prompt lookup, no draft model needed), or one of the EAGLE variants. ngram is free and helps a surprising amount on prompts with repeated spans (code, JSON, structured output).
draft_tensor_parallel_size defaults to the target's TP. If the draft is small enough, drop it to 1 to free a GPU rank.

llama.cpp: the same idea, fewer flags

If you are running quantized models on a Mac or a single GPU,
llama.cpp's speculative decoding
gives you the same shape with simpler ergonomics:

llama-server \
  --model ./llama-3.1-70b.Q4_K_M.gguf \
  --model-draft ./llama-3.1-8b.Q4_K_M.gguf \
  --spec-draft-n-max 16 \
  --spec-draft-n-min 4

--spec-draft-n-max is the proposal length cap.
--spec-draft-n-min is a floor that prevents the draft from
giving up on weak runs too early. (Older write-ups used
--draft-max/--draft-min; those flags were removed.) The
defaults are reasonable. Two things move the needle: the
draft-target pairing, and your sampling temperature. Lower
temperature makes the target more predictable, and acceptance
climbs.

A 50-token measurement script

If you want to know what speculative decoding is doing for your
workload, run something like this against a representative
sample:

import time
from statistics import median
from vllm import LLM, SamplingParams

PROMPTS = [
    # Replace with 30-100 prompts from real traffic
    "Explain context propagation in OTel.",
    "Refactor this Python function to use a list comp.",
    # ...
]

def bench(llm, prompts, params):
    latencies = []
    for p in prompts:
        t0 = time.perf_counter()
        out = llm.generate([p], params)
        latencies.append(time.perf_counter() - t0)
    return median(latencies)

params = SamplingParams(temperature=0.2, max_tokens=256)

baseline = LLM(model="Qwen/Qwen3-8B")
print("baseline median:", bench(baseline, PROMPTS, params))

spec = LLM(
    model="Qwen/Qwen3-8B",
    speculative_config={
        "model": "Qwen/Qwen3-0.6B",
        "num_speculative_tokens": 5,
        "method": "draft_model",
    },
)
print("spec median:", bench(spec, PROMPTS, params))

Two LLM instances back to back, median of 30+ prompts each. If
your speedup is under 1.3x on representative traffic, do not ship
it: the operational cost of a second model is real, and the win
needs to be visible.

If you want to see acceptance rate directly, vLLM exposes
spec-decode metrics on its Prometheus endpoint — check the
metric names against your version, the v1 engine renamed
several of them. The ratio of accepted draft tokens to total
draft tokens is your acceptance rate. As a rule of thumb (not
a hard cutoff), once that ratio drops below roughly 50%, the
technique is fighting you — you are paying for the draft model
without earning enough free tokens.

What to do on Monday

If you have a target above 30B parameters, a same-family draft
under 8B, and decoding that runs past 50 tokens per response,
spend an afternoon on this. Get a baseline median. Get a spec
median. Look at the acceptance rate. Decide.

If your model is already small, your outputs are short, or your
GPUs are pinned, do not start here. There are bigger wins
elsewhere: better quantization, smaller target, prompt
compaction, KV cache reuse. Speculative decoding is a late
optimization, not an early one. It pays off after the obvious
moves are done.

The papers are honest about the bound: acceptance rate times
the speed ratio of target to draft is your ceiling. Everything
else is overhead. Measure first, claim second.

If this was useful

Latency of self-hosted models is one of those problems that has
a textbook answer (faster forward pass) and a real-world answer
(half a dozen techniques layered together). The
LLM Observability Pocket Guide
is the field guide for the second one: the metrics, traces, and
eval gates that make it safe to ship optimizations like this
without flying blind.