I Tried Speculative Decoding on RTX 4060 8GB — Every Config Was Slower Than Baseline
All numbers in this article are from my own hardware: Ryzen 7 7845HS / 32GB DDR5 / RTX 4060 Laptop 8GB.
The Sales Pitch
"Use a small model to draft tokens, then a large model to verify. You get the quality of the large model at nearly the speed of the small one."
Speculative decoding's pitch is seductive. Since the 2022 DeepMind paper (Leviathan et al.), llama.cpp has supported it — just add a -md (model-draft) flag. In theory, it's magic.
My setup:
- Qwen3.5-27B (dense): 3.5 t/s (ngl=24, daily driver)
- Qwen3.5-9B: 33 t/s (ngl=99, fully on GPU)
If I could get 27B quality at anywhere near 33 t/s — incredible. Draft 8 tokens with 9B, verify in batch with 27B. High acceptance rate = multiple tokens confirmed per verification pass. Theoretical speedup: 2-3x.
I tested three draft model sizes: 0.8B, 4B, and 9B. Every single one was slower than baseline.
Test Setup
| Component | Spec |
|---|---|
| CPU | Ryzen 7 7845HS (8C/16T) |
| RAM | 32GB DDR5-4800 |
| GPU | RTX 4060 Laptop 8GB GDDR6 (272 GB/s) |
| llama.cpp | b8233 (CUDA 12.4) |
| Main model | Qwen3.5-27B-Q4_K_M (~16.5GB) |
| Draft models | Qwen3.5-0.8B / 4B / 9B-Q4_K_M |
The 8GB Reality: You Can't Fit Both
First wall. Speculative decoding requires both models loaded simultaneously.
VRAM estimate (27B + 9B):
9B (Q4_K_M) ≈ 5.5 GB
27B (Q4_K_M) ≈ 16.5 GB
KV cache ×2 ≈ 0.5-1 GB
─────────────────────────
Total ≈ 22.5 GB ← 3x the RTX 4060's 8GB
Physically impossible. Even running 27B alone, ngl=24 (24 of 48 layers on GPU) is my limit. Two models at once? No way — that was my first reaction.
The Workaround: Split VRAM Allocation
llama.cpp lets you set -ngl (main model GPU layers) and -ngld (draft GPU layers) independently. So you can split 8GB across "part of main + all of draft."
The smaller the draft, the more VRAM the main model gets:
Draft 0.8B (0.6GB) → 27B gets 7.4GB → ngl=20 (20/48 layers on GPU)
Draft 4B (2.5GB) → 27B gets 5.5GB → ngl=13 (13/48 layers on GPU)
Draft 9B (5.5GB) → 27B gets 2.5GB → ngl=0 (fully on CPU)
# Example: 0.8B draft config
llama-cli \
-m Qwen3.5-27B-Q4_K_M.gguf \ # main model
-md Qwen3.5-0.8B-Q4_K_M.gguf \ # draft model
-ngl 20 \ # 27B: 20 layers on GPU
-ngld 99 \ # 0.8B: all layers on GPU
--draft-max 8 \ # max 8 tokens per speculation
-c 2048 -t 8 -n 256 --temp 0.0
Here's the tradeoff: bigger draft = higher acceptance but fewer GPU layers for main. Smaller draft = more GPU layers but lower acceptance. Where's the sweet spot? I tested all three.
Results: All Three Configs Lost
Test 1: 27B (ngl=20) + 0.8B draft (ngld=99)
VRAM: 0.8B ~0.6GB + 27B 20 layers ~6.5GB ≈ 7.1GB
Generation: 3.1 t/s ← 11% slower than baseline
Test 2: 27B (ngl=13) + 4B draft (ngld=99)
VRAM: 4B ~2.5GB + 27B 13 layers ~4.5GB ≈ 7.0GB
Generation: 2.8 t/s ← 20% slower
Test 3: 27B (ngl=0) + 9B draft (ngld=99)
VRAM: 9B ~5.5GB + 27B 0 layers = 5.5GB
Generation: 2.5 t/s ← 29% slower
Baseline: 27B normal (ngl=24)
Generation: 3.5 t/s
Reference: 9B standalone (ngl=99)
Generation: 33 t/s
| Config | Draft VRAM | 27B GPU Layers | Speed | vs Baseline |
|---|---|---|---|---|
| 27B + 0.8B | 0.6GB | 20/48 | 3.1 t/s | -11% |
| 27B + 4B | 2.5GB | 13/48 | 2.8 t/s | -20% |
| 27B + 9B | 5.5GB | 0/48 | 2.5 t/s | -29% |
| 27B normal | — | 24/48 | 3.5 t/s | baseline |
| 9B standalone | — | all | 33 t/s | — |
All three configs lost to baseline. Smaller draft was less bad, but still slower than vanilla 27B.
Why Every Config Fails: The VRAM Tug-of-War
The results form a clean gradient — the bigger the draft, the slower it gets. That pattern is the clue.
Two-Way Trap
Making the draft bigger (9B):
9B should have high acceptance, but it eats 5.5GB of VRAM. 27B falls entirely to CPU (ngl=0). nvidia-smi tells the story:
GPU Utilization: 5-10%
VRAM Used: ~7140 MiB
GPU Temperature: 42°C (basically idle)
The GPU is barely working. 27B verification takes seconds on CPU while the GPU just waits.
Making the draft smaller (0.8B):
0.8B only needs 0.6GB, leaving 20 GPU layers for 27B — close to normal operation (ngl=24). But the capability gap between 0.8B and 27B is massive. Acceptance rate drops. If only 2-3 out of 8 speculated tokens get accepted, the verification overhead makes it a net loss.
4B in the middle:
Gets hit by both weaknesses, halfway.
Timeline → (9B draft case)
GPU: [draft 8tok] [idle........] [draft 8tok] [idle........]
CPU: [idle] [verify 8tok ] [idle] [verify 8tok ]
↑ this is the bottleneck
On 8GB VRAM, you lose either way. Big draft starves GPU layers. Small draft starves acceptance. The vanilla 27B config at ngl=24 — where VRAM is dedicated to a single model — can't be beaten by any split configuration.
DRAM Bandwidth Is the Real Enemy
One level deeper. When 27B runs on CPU, the bottleneck isn't compute — it's memory bandwidth.
DDR5-4800 (dual channel):
Theoretical bandwidth: 76.8 GB/s
Qwen3.5-27B-Q4_K_M:
Model size: ~16.5 GB
Read per token: ~16.5 GB
→ Theoretical max: 76.8 / 16.5 ≈ 4.7 t/s
RTX 4060 GDDR6:
Theoretical bandwidth: 272 GB/s
→ Theoretical max: 272 / 16.5 ≈ 16.5 t/s (3.5x faster)
Moving model layers from GPU (272 GB/s) to CPU (76.8 GB/s) cuts bandwidth by 3.5x. The speed gains from speculative parallelism get completely erased by CPU memory bandwidth constraints.
The measured 2.5 t/s falling short of the 4.7 t/s theoretical makes sense — KV cache I/O and OS memory management overhead eat the rest. Hitting 53% of theoretical is actually reasonable.
What This Algorithm Was Actually Designed For
The physics is clear. Step back and think about the design assumptions behind speculative decoding.
The original paper (Leviathan et al., 2022) assumes an environment where both models coexist in the same high-speed memory, with draft generation and verification alternating rapidly. Everything stays in GPU VRAM. The draft→verify round trip is fast.
On 8GB, that assumption collapses. The moment you split VRAM between two models, part of the main model falls to CPU, and the verification step becomes throttled by slow memory. The speculative parallelism gains vanish the instant you cross memory hierarchy boundaries. It's not slow because the bandwidth is low — it's slow because the design premise of "fast round-trips within a single memory space" doesn't hold.
When It Actually Works
Condition 1: Both models fit in VRAM simultaneously
Condition 2: Draft speed >> main speed (5x minimum)
Condition 3: Acceptance rate is high enough (70%+ for same-family pairs)
On RTX 4060 8GB, trying to meet Condition 1:
| Pair | 27B GPU Layers | Draft VRAM | Measured | Verdict |
|---|---|---|---|---|
| 27B + 0.8B | 20/48 | 0.6 GB | 3.1 t/s | Slower than baseline |
| 27B + 4B | 13/48 | 2.5 GB | 2.8 t/s | Same |
| 27B + 9B | 0/48 | 5.5 GB | 2.5 t/s | Worst |
| 7B + 1.5B (hypothetical) | all | ~1.0 GB | untested | Works if 7B quality is acceptable |
At 8GB, "no draft size beats baseline 27B" is confirmed across three data points. 7B + 1.5B would both fit on GPU, but if you're okay with 7B quality, you might as well run 9B standalone at 33 t/s.
The Real Target Hardware
Speculative decoding shines on:
- A100/H100 80GB: 70B + 7B coexist comfortably
- RTX 4090 24GB: 27B + 7B or 14B + 1.5B are realistic
- M4 Max 128GB Unified Memory: large model pairs possible on unified architecture
Consumer 8GB GPUs were never the intended target for this technique.
The Answer for 8GB Was MoE All Along
Pulling in data from my previous benchmark:
| Model | Architecture | Speed | VRAM | Quality |
|---|---|---|---|---|
| Qwen3.5-9B | dense, ngl=99 | 33 t/s | 7.1 GB | Great |
| Qwen3.5-35B-A3B | MoE, partial | 8.6 t/s | ~7 GB | Excellent |
| Qwen3.5-27B | dense, ngl=24 | 3.5 t/s | ~6.5 GB | Excellent |
| 27B + 0.8B spec | split | 3.1 t/s | ~7.1 GB | Excellent |
| 27B + 4B spec | split | 2.8 t/s | ~7.0 GB | Excellent |
| 27B + 9B spec | split | 2.5 t/s | ~7.1 GB | Excellent |
To get 27B-class quality faster:
- Speculative (best case, 0.8B draft): 3.1 t/s — slower than vanilla 27B
- MoE (35B-A3B): 8.6 t/s — same or better quality, 3x faster
On 8GB VRAM, MoE (Mixture-of-Experts) is fundamentally more efficient than speculative decoding. MoE is designed from the ground up with "large total parameters, small active parameters during inference." It solves the same problem speculative decoding tries to solve — "big model quality at small model speed" — but at the architecture level instead of the inference level.
Speculative Decoding:
Inference = 2 models loaded simultaneously → VRAM pressure
Gain = speculative parallelism (meaningless on CPU)
MoE:
Inference = 1 model, only active parameters computed
Gain = architectural efficiency
When Speculative Decoding Makes a Comeback
This article concludes "speculative decoding doesn't work on 8GB VRAM." The technique itself isn't dead, though.
- More VRAM: If RTX 5060 ships with 12-16GB, a 27B + 3B pair becomes realistic
- Tinier draft models: I tested down to 0.8B, still -11%. Ultra-light models like SmolLM2-135M would barely touch VRAM, but acceptance rate drops further. Cross-architecture speculative support is still limited
- Hardware specialization: NPUs or AMD XDNA-style inference engines could handle draft generation on a separate memory path
Option 2 — tiny same-family draft models — is the most realistic near-term improvement, especially as 0.5B models in major architectures become more common. But ultra-small drafts have inherently lower acceptance, so validating that tradeoff is the next step.
No Magic Beats Physics
Speculative decoding on RTX 4060 8GB with Qwen3.5-27B, 3 configs:
27B + 0.8B draft: 3.1 t/s (-11%)
27B + 4B draft: 2.8 t/s (-20%)
27B + 9B draft: 2.5 t/s (-29%)
27B normal: 3.5 t/s (baseline)
Splitting 8GB VRAM between two models can't beat
a single model using all the VRAM for itself.
MoE (35B-A3B: 8.6 t/s) is the answer.
Three draft sizes, both ends of the tradeoff explored, clean sweep loss. This isn't "I haven't found the right config yet" — speculative decoding structurally cannot work under the physical constraint of 8GB VRAM.
Two days of testing for a total wipeout stings. For now, MoE 35B-A3B at 8.6 t/s is the pragmatic choice. But something nags at me — Qwen3.5's thinking mode. The acceptance rate for <think>...</think> tokens during speculation might behave differently from regular text generation. A high-performance model without the thinking overhead — Mistral or Llama family — could change the equation. That's going on the list for the next experiment.
References
- 3 Qwen3.5 Models Head-to-Head on RTX 4060 8GB — What Spec Sheets Won't Tell You — 9B / 27B / 35B-A3B benchmark comparison
- Qwen2.5-32B Running on RTX 4060 8GB — Full Optimization Guide That Beat M4 — ngl optimization fundamentals
- Leviathan et al., "Fast Inference from Transformers via Speculative Decoding" (2022) — original paper
- llama.cpp speculative decoding docs —
-md,-ngld,--draft-maxparameter reference
Top comments (0)