I spent Saturday night testing n-gram speculative decoding on consumer GPUs. The claim: speculative decoding can speed up LLM inference by 2-3x by predicting future tokens and verifying them in parallel.
I wanted to see if that holds up on real hardware running diverse workloads. For the most part, it doesn't. But the journey was worth it, and I caught a benchmarking pitfall that I think a lot of people are falling into.
The setup
My home lab runs Kubernetes on a machine called Shadowstack. Two NVIDIA RTX 5060 Ti GPUs (16GB VRAM each, 32GB total). I use LLMKube, an open source K8s operator I built, to manage LLM inference workloads with llama.cpp.
For this test I deployed two models:
- Gemma 4 26B-A4B: Google's Mixture of Experts model. 26B total params but only ~4B active per token. Runs at 88 tok/s on my setup.
- Qwen3-32B: A dense 32B model. All parameters active per token. Runs at 20 tok/s.
Both running Q4_K_M quantization, flash attention enabled, 8K context, split across both GPUs.
Quick note on why the MoE model is so much faster: Gemma 4 only activates a fraction of its parameters per token, so there's way less weight data to read from VRAM on each forward pass. MoE routing overhead eats into some of that advantage, but it's still a huge win on bandwidth-constrained hardware.
What I tested
llama.cpp has built-in n-gram speculative decoding. No draft model needed, you just pass a few flags:
--spec-type ngram-mod
--draft-max 64
--draft-min 48
--spec-ngram-size-n 24
--spec-ngram-size-m 48
How it works: llama.cpp builds an n-gram lookup table from the recent context (both the input prompt and generated output so far). When it spots a pattern it's seen before, it speculatively drafts the next several tokens and verifies them in a single forward pass. If the predictions are right, you get multiple tokens for the cost of one.
Important: this is specifically n-gram speculative decoding, not draft-model approaches like EAGLE-3 or Medusa. Those use a separate trained model to generate speculations. N-gram lookup is simpler and doesn't require any extra model files.
With LLMKube, switching between configs is just updating the extraArgs field in the InferenceService CRD and letting the operator restart the pod:
spec:
modelRef: gemma4-26b-a4b
extraArgs:
- "--spec-type"
- "ngram-mod"
- "--draft-max"
- "64"
I tested two variants: ngram-simple (basic lookup) and ngram-mod (the variant recommended for MoE models in the llama.cpp docs).
The result that fooled me
My first test ran the same prompt 10 times in a row. The numbers looked incredible:
| Run | tok/s |
|---|---|
| 1 (cold) | 88.3 |
| 2 | 105.7 |
| 3 | 112.4 |
| 5 | 186.4 |
| 8 | 336.5 |
| 10 | 419.5 |
Almost 5x speedup by run 10. I was ready to write a very different article.
Then I ran 8 different prompts. Code generation, API design, Go functions, bash scripts, technical explanations. Real variety.
| Prompt | Baseline (tok/s) | + ngram-mod (tok/s) |
|---|---|---|
| BST implementation | 88.3 | 94.2 |
| K8s operator explanation | 88.3 | 88.3 |
| GPU monitoring script | 88.3 | 87.6 |
| REST API design | 88.3 | 88.2 |
| GGUF parser in Go | 88.3 | 88.2 |
| Parallelism explainer | 88.3 | 88.1 |
| Benchmark script | 88.2 | 88.2 |
| Helm chart design | 88.1 | 88.2 |
| Median | 88.3 | 88.2 |
Zero improvement. The 419 tok/s "speedup" was the n-gram cache memorizing repeated output patterns. With diverse prompts, there's nothing useful to cache.
Same story on the dense model
Qwen3-32B showed the same pattern. 20.4 tok/s baseline, 20.6 tok/s with ngram-simple. Within measurement noise.
| Model | Type | Baseline | + ngram-simple | + ngram-mod |
|---|---|---|---|---|
| Gemma 4 26B | MoE | 88.3 | 87.2 (-1.2%) | 88.2 (0%) |
| Qwen3-32B | Dense | 20.4 | 20.6 (+1%) | not tested |
Why it doesn't help on these GPUs
The bottleneck on RTX 5060 Ti is memory bandwidth, not compute. Every token requires reading model weights from VRAM. Speculative decoding tries to batch multiple verification steps together, but when you're already saturating the memory bus during single-token generation, there's not enough idle compute for the speculative verification to pay for itself.
This is different from high-end datacenter GPUs (A100, H100) where the compute-to-memory bandwidth ratio is much higher. An H100 has roughly 3,350 GB/s memory bandwidth but nearly 2,000 TFLOPS of FP16 compute. That ratio means there's genuine idle compute at small batch sizes that speculative decoding can exploit. Consumer GPUs don't have that same headroom.
For MoE models specifically, there's an additional wrinkle. Each speculative token in a verification batch may activate different experts, which means more expert weight blocks need to be read. This reduces the batching advantage that speculative decoding relies on in dense models, where weight reads stay roughly constant regardless of batch size.
Caveat: there are scenarios where n-gram spec decoding can help even on consumer hardware. If your model is partially CPU-offloaded (doesn't fit in VRAM), the PCIe bandwidth bottleneck is severe enough that speculative batching can provide real gains. And for highly repetitive or templated outputs (think structured JSON, boilerplate code), the n-gram cache hit rate goes way up. My testing focused on single-user inference with fully VRAM-resident models and diverse prompts.
What about EAGLE-3?
I originally wanted to test EAGLE-3, which uses a trained draft head instead of n-gram lookup. Three problems:
- No EAGLE-3 draft model exists for Gemma 4 (no one has trained one)
- The llama.cpp EAGLE-3 PR (#18039) is still open and in draft as of April 5, 2026
- The PR's own benchmarks show MoE models getting roughly 0.89-1.06x on certain prompts, with some actually slower due to the expert activation overhead during batch verification
Even with a trained draft head, the fundamental bandwidth constraint on consumer GPUs would remain.
What actually helps on consumer GPUs
If you're running local LLMs on consumer hardware, here's what actually moves the needle:
- Flash attention: Already standard, significant memory savings
- KV cache quantization: q4_0 or q8_0 reduces cache memory pressure without meaningful quality loss
- MoE over dense: Gemma 4 activates ~4B parameters per token vs Qwen3-32B's 32B. That's the primary driver of the throughput difference, though MoE routing overhead means the speedup isn't a clean 8x ratio.
- Multi-GPU split: Doubles your available memory bandwidth, which is the actual bottleneck
- Context size tuning: Smaller context = less KV cache = more VRAM headroom
The benchmarking lesson
The biggest takeaway wasn't about speculative decoding. It was about benchmark methodology.
If I'd only tested with repeated prompts, I would have reported a 4.75x speedup and been completely wrong. The n-gram cache is doing something real, but only in a narrow scenario where outputs are highly repetitive or templated. For interactive chat, coding assistance, or any workload with diverse inputs, it provides no benefit on this hardware.
Be skeptical of speculative decoding benchmarks that don't disclose their prompt diversity. And if you see someone reporting huge n-gram gains, check if they're running the same prompt over and over.
Try it yourself
Everything I tested runs on Kubernetes via LLMKube. The InferenceService CRD's extraArgs field makes it trivial to swap between configs without touching your deployment:
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: gemma4-spec-bench
spec:
modelRef: gemma4-26b-a4b
image: ghcr.io/ggml-org/llama.cpp:server-cuda
contextSize: 8192
flashAttention: true
extraArgs:
- "--spec-type"
- "ngram-mod"
- "--draft-max"
- "64"
resources:
gpu: 2
LLMKube is open source, Apache 2.0: github.com/defilantech/llmkube
Top comments (0)