Christopher Maher

Posted on Apr 6

I tested speculative decoding on my home GPU cluster. Here's why it didn't help.

#llm #kubernetes #gpu #ai

I spent Saturday night testing n-gram speculative decoding on consumer GPUs. The claim: speculative decoding can speed up LLM inference by 2-3x by predicting future tokens and verifying them in parallel.

I wanted to see if that holds up on real hardware running diverse workloads. For the most part, it doesn't. But the journey was worth it, and I caught a benchmarking pitfall that I think a lot of people are falling into.

The setup

My home lab runs Kubernetes on a machine called Shadowstack. Two NVIDIA RTX 5060 Ti GPUs (16GB VRAM each, 32GB total). I use LLMKube, an open source K8s operator I built, to manage LLM inference workloads with llama.cpp.

For this test I deployed two models:

Gemma 4 26B-A4B: Google's Mixture of Experts model. 26B total params but only ~4B active per token. Runs at 88 tok/s on my setup.
Qwen3-32B: A dense 32B model. All parameters active per token. Runs at 20 tok/s.

Both running Q4_K_M quantization, flash attention enabled, 8K context, split across both GPUs.

Quick note on why the MoE model is so much faster: Gemma 4 only activates a fraction of its parameters per token, so there's way less weight data to read from VRAM on each forward pass. MoE routing overhead eats into some of that advantage, but it's still a huge win on bandwidth-constrained hardware.

What I tested

llama.cpp has built-in n-gram speculative decoding. No draft model needed, you just pass a few flags:

--spec-type ngram-mod
--draft-max 64
--draft-min 48
--spec-ngram-size-n 24
--spec-ngram-size-m 48

How it works: llama.cpp builds an n-gram lookup table from the recent context (both the input prompt and generated output so far). When it spots a pattern it's seen before, it speculatively drafts the next several tokens and verifies them in a single forward pass. If the predictions are right, you get multiple tokens for the cost of one.

Important: this is specifically n-gram speculative decoding, not draft-model approaches like EAGLE-3 or Medusa. Those use a separate trained model to generate speculations. N-gram lookup is simpler and doesn't require any extra model files.

With LLMKube, switching between configs is just updating the extraArgs field in the InferenceService CRD and letting the operator restart the pod:

spec:
  modelRef: gemma4-26b-a4b
  extraArgs:
    - "--spec-type"
    - "ngram-mod"
    - "--draft-max"
    - "64"

I tested two variants: ngram-simple (basic lookup) and ngram-mod (the variant recommended for MoE models in the llama.cpp docs).

The result that fooled me

My first test ran the same prompt 10 times in a row. The numbers looked incredible:

Run	tok/s
1 (cold)	88.3
2	105.7
3	112.4
5	186.4
8	336.5
10	419.5

Almost 5x speedup by run 10. I was ready to write a very different article.

Then I ran 8 different prompts. Code generation, API design, Go functions, bash scripts, technical explanations. Real variety.

Prompt	Baseline (tok/s)	+ ngram-mod (tok/s)
BST implementation	88.3	94.2
K8s operator explanation	88.3	88.3
GPU monitoring script	88.3	87.6
REST API design	88.3	88.2
GGUF parser in Go	88.3	88.2
Parallelism explainer	88.3	88.1
Benchmark script	88.2	88.2
Helm chart design	88.1	88.2
Median	88.3	88.2

Zero improvement. The 419 tok/s "speedup" was the n-gram cache memorizing repeated output patterns. With diverse prompts, there's nothing useful to cache.

Same story on the dense model

Qwen3-32B showed the same pattern. 20.4 tok/s baseline, 20.6 tok/s with ngram-simple. Within measurement noise.

Model	Type	Baseline	+ ngram-simple	+ ngram-mod
Gemma 4 26B	MoE	88.3	87.2 (-1.2%)	88.2 (0%)
Qwen3-32B	Dense	20.4	20.6 (+1%)	not tested

Why it doesn't help on these GPUs

The bottleneck on RTX 5060 Ti is memory bandwidth, not compute. Every token requires reading model weights from VRAM. Speculative decoding tries to batch multiple verification steps together, but when you're already saturating the memory bus during single-token generation, there's not enough idle compute for the speculative verification to pay for itself.

This is different from high-end datacenter GPUs (A100, H100) where the compute-to-memory bandwidth ratio is much higher. An H100 has roughly 3,350 GB/s memory bandwidth but nearly 2,000 TFLOPS of FP16 compute. That ratio means there's genuine idle compute at small batch sizes that speculative decoding can exploit. Consumer GPUs don't have that same headroom.

For MoE models specifically, there's an additional wrinkle. Each speculative token in a verification batch may activate different experts, which means more expert weight blocks need to be read. This reduces the batching advantage that speculative decoding relies on in dense models, where weight reads stay roughly constant regardless of batch size.

Caveat: there are scenarios where n-gram spec decoding can help even on consumer hardware. If your model is partially CPU-offloaded (doesn't fit in VRAM), the PCIe bandwidth bottleneck is severe enough that speculative batching can provide real gains. And for highly repetitive or templated outputs (think structured JSON, boilerplate code), the n-gram cache hit rate goes way up. My testing focused on single-user inference with fully VRAM-resident models and diverse prompts.

What about EAGLE-3?

I originally wanted to test EAGLE-3, which uses a trained draft head instead of n-gram lookup. Three problems:

No EAGLE-3 draft model exists for Gemma 4 (no one has trained one)
The llama.cpp EAGLE-3 PR (#18039) is still open and in draft as of April 5, 2026
The PR's own benchmarks show MoE models getting roughly 0.89-1.06x on certain prompts, with some actually slower due to the expert activation overhead during batch verification

Even with a trained draft head, the fundamental bandwidth constraint on consumer GPUs would remain.

What actually helps on consumer GPUs

If you're running local LLMs on consumer hardware, here's what actually moves the needle:

Flash attention: Already standard, significant memory savings
KV cache quantization: q4_0 or q8_0 reduces cache memory pressure without meaningful quality loss
MoE over dense: Gemma 4 activates ~4B parameters per token vs Qwen3-32B's 32B. That's the primary driver of the throughput difference, though MoE routing overhead means the speedup isn't a clean 8x ratio.
Multi-GPU split: Doubles your available memory bandwidth, which is the actual bottleneck
Context size tuning: Smaller context = less KV cache = more VRAM headroom

The benchmarking lesson

The biggest takeaway wasn't about speculative decoding. It was about benchmark methodology.

If I'd only tested with repeated prompts, I would have reported a 4.75x speedup and been completely wrong. The n-gram cache is doing something real, but only in a narrow scenario where outputs are highly repetitive or templated. For interactive chat, coding assistance, or any workload with diverse inputs, it provides no benefit on this hardware.

Be skeptical of speculative decoding benchmarks that don't disclose their prompt diversity. And if you see someone reporting huge n-gram gains, check if they're running the same prompt over and over.

Try it yourself

Everything I tested runs on Kubernetes via LLMKube. The InferenceService CRD's extraArgs field makes it trivial to swap between configs without touching your deployment:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: gemma4-spec-bench
spec:
  modelRef: gemma4-26b-a4b
  image: ghcr.io/ggml-org/llama.cpp:server-cuda
  contextSize: 8192
  flashAttention: true
  extraArgs:
    - "--spec-type"
    - "ngram-mod"
    - "--draft-max"
    - "64"
  resources:
    gpu: 2

LLMKube is open source, Apache 2.0: github.com/defilantech/llmkube

DEV Community