DEV Community

plasmon
plasmon

Posted on

I Tried Speculative Decoding on RTX 4060 8GB — Every Config Was Slower Than Baseline

I Tried Speculative Decoding on RTX 4060 8GB — Every Config Was Slower Than Baseline

All numbers in this article are from my own hardware: Ryzen 7 7845HS / 32GB DDR5 / RTX 4060 Laptop 8GB.


The Sales Pitch

"Use a small model to draft tokens, then a large model to verify. You get the quality of the large model at nearly the speed of the small one."

Speculative decoding's pitch is seductive. Since the 2022 DeepMind paper (Leviathan et al.), llama.cpp has supported it — just add a -md (model-draft) flag. In theory, it's magic.

My setup:

  • Qwen3.5-27B (dense): 3.5 t/s (ngl=24, daily driver)
  • Qwen3.5-9B: 33 t/s (ngl=99, fully on GPU)

If I could get 27B quality at anywhere near 33 t/s — incredible. Draft 8 tokens with 9B, verify in batch with 27B. High acceptance rate = multiple tokens confirmed per verification pass. Theoretical speedup: 2-3x.

I tested three draft model sizes: 0.8B, 4B, and 9B. Every single one was slower than baseline.


Test Setup

Component Spec
CPU Ryzen 7 7845HS (8C/16T)
RAM 32GB DDR5-4800
GPU RTX 4060 Laptop 8GB GDDR6 (272 GB/s)
llama.cpp b8233 (CUDA 12.4)
Main model Qwen3.5-27B-Q4_K_M (~16.5GB)
Draft models Qwen3.5-0.8B / 4B / 9B-Q4_K_M

The 8GB Reality: You Can't Fit Both

First wall. Speculative decoding requires both models loaded simultaneously.

VRAM estimate (27B + 9B):
  9B (Q4_K_M)  ≈ 5.5 GB
  27B (Q4_K_M) ≈ 16.5 GB
  KV cache ×2  ≈ 0.5-1 GB
  ─────────────────────────
  Total        ≈ 22.5 GB   ← 3x the RTX 4060's 8GB
Enter fullscreen mode Exit fullscreen mode

Physically impossible. Even running 27B alone, ngl=24 (24 of 48 layers on GPU) is my limit. Two models at once? No way — that was my first reaction.

The Workaround: Split VRAM Allocation

llama.cpp lets you set -ngl (main model GPU layers) and -ngld (draft GPU layers) independently. So you can split 8GB across "part of main + all of draft."

The smaller the draft, the more VRAM the main model gets:

Draft 0.8B (0.6GB) → 27B gets 7.4GB → ngl=20 (20/48 layers on GPU)
Draft 4B   (2.5GB) → 27B gets 5.5GB → ngl=13 (13/48 layers on GPU)
Draft 9B   (5.5GB) → 27B gets 2.5GB → ngl=0  (fully on CPU)
Enter fullscreen mode Exit fullscreen mode
# Example: 0.8B draft config
llama-cli \
  -m  Qwen3.5-27B-Q4_K_M.gguf \    # main model
  -md Qwen3.5-0.8B-Q4_K_M.gguf \   # draft model
  -ngl 20 \                         # 27B: 20 layers on GPU
  -ngld 99 \                        # 0.8B: all layers on GPU
  --draft-max 8 \                   # max 8 tokens per speculation
  -c 2048 -t 8 -n 256 --temp 0.0
Enter fullscreen mode Exit fullscreen mode

Here's the tradeoff: bigger draft = higher acceptance but fewer GPU layers for main. Smaller draft = more GPU layers but lower acceptance. Where's the sweet spot? I tested all three.


Results: All Three Configs Lost

Test 1: 27B (ngl=20) + 0.8B draft (ngld=99)
  VRAM: 0.8B ~0.6GB + 27B 20 layers ~6.5GB ≈ 7.1GB
  Generation:   3.1 t/s    ← 11% slower than baseline

Test 2: 27B (ngl=13) + 4B draft (ngld=99)
  VRAM: 4B ~2.5GB + 27B 13 layers ~4.5GB ≈ 7.0GB
  Generation:   2.8 t/s    ← 20% slower

Test 3: 27B (ngl=0) + 9B draft (ngld=99)
  VRAM: 9B ~5.5GB + 27B 0 layers = 5.5GB
  Generation:   2.5 t/s    ← 29% slower

Baseline: 27B normal (ngl=24)
  Generation:   3.5 t/s

Reference: 9B standalone (ngl=99)
  Generation:  33 t/s
Enter fullscreen mode Exit fullscreen mode
Config Draft VRAM 27B GPU Layers Speed vs Baseline
27B + 0.8B 0.6GB 20/48 3.1 t/s -11%
27B + 4B 2.5GB 13/48 2.8 t/s -20%
27B + 9B 5.5GB 0/48 2.5 t/s -29%
27B normal 24/48 3.5 t/s baseline
9B standalone all 33 t/s

All three configs lost to baseline. Smaller draft was less bad, but still slower than vanilla 27B.


Why Every Config Fails: The VRAM Tug-of-War

The results form a clean gradient — the bigger the draft, the slower it gets. That pattern is the clue.

Two-Way Trap

Making the draft bigger (9B):
9B should have high acceptance, but it eats 5.5GB of VRAM. 27B falls entirely to CPU (ngl=0). nvidia-smi tells the story:

GPU Utilization:  5-10%
VRAM Used:        ~7140 MiB
GPU Temperature:  42°C (basically idle)
Enter fullscreen mode Exit fullscreen mode

The GPU is barely working. 27B verification takes seconds on CPU while the GPU just waits.

Making the draft smaller (0.8B):
0.8B only needs 0.6GB, leaving 20 GPU layers for 27B — close to normal operation (ngl=24). But the capability gap between 0.8B and 27B is massive. Acceptance rate drops. If only 2-3 out of 8 speculated tokens get accepted, the verification overhead makes it a net loss.

4B in the middle:
Gets hit by both weaknesses, halfway.

Timeline → (9B draft case)
GPU:  [draft 8tok] [idle........] [draft 8tok] [idle........]
CPU:  [idle]       [verify 8tok ] [idle]       [verify 8tok ]
                   ↑ this is the bottleneck
Enter fullscreen mode Exit fullscreen mode

On 8GB VRAM, you lose either way. Big draft starves GPU layers. Small draft starves acceptance. The vanilla 27B config at ngl=24 — where VRAM is dedicated to a single model — can't be beaten by any split configuration.

DRAM Bandwidth Is the Real Enemy

One level deeper. When 27B runs on CPU, the bottleneck isn't compute — it's memory bandwidth.

DDR5-4800 (dual channel):
  Theoretical bandwidth: 76.8 GB/s

Qwen3.5-27B-Q4_K_M:
  Model size: ~16.5 GB
  Read per token: ~16.5 GB
  → Theoretical max: 76.8 / 16.5 ≈ 4.7 t/s

RTX 4060 GDDR6:
  Theoretical bandwidth: 272 GB/s
  → Theoretical max: 272 / 16.5 ≈ 16.5 t/s (3.5x faster)
Enter fullscreen mode Exit fullscreen mode

Moving model layers from GPU (272 GB/s) to CPU (76.8 GB/s) cuts bandwidth by 3.5x. The speed gains from speculative parallelism get completely erased by CPU memory bandwidth constraints.

The measured 2.5 t/s falling short of the 4.7 t/s theoretical makes sense — KV cache I/O and OS memory management overhead eat the rest. Hitting 53% of theoretical is actually reasonable.


What This Algorithm Was Actually Designed For

The physics is clear. Step back and think about the design assumptions behind speculative decoding.

The original paper (Leviathan et al., 2022) assumes an environment where both models coexist in the same high-speed memory, with draft generation and verification alternating rapidly. Everything stays in GPU VRAM. The draft→verify round trip is fast.

On 8GB, that assumption collapses. The moment you split VRAM between two models, part of the main model falls to CPU, and the verification step becomes throttled by slow memory. The speculative parallelism gains vanish the instant you cross memory hierarchy boundaries. It's not slow because the bandwidth is low — it's slow because the design premise of "fast round-trips within a single memory space" doesn't hold.

When It Actually Works

Condition 1: Both models fit in VRAM simultaneously
Condition 2: Draft speed >> main speed (5x minimum)
Condition 3: Acceptance rate is high enough (70%+ for same-family pairs)
Enter fullscreen mode Exit fullscreen mode

On RTX 4060 8GB, trying to meet Condition 1:

Pair 27B GPU Layers Draft VRAM Measured Verdict
27B + 0.8B 20/48 0.6 GB 3.1 t/s Slower than baseline
27B + 4B 13/48 2.5 GB 2.8 t/s Same
27B + 9B 0/48 5.5 GB 2.5 t/s Worst
7B + 1.5B (hypothetical) all ~1.0 GB untested Works if 7B quality is acceptable

At 8GB, "no draft size beats baseline 27B" is confirmed across three data points. 7B + 1.5B would both fit on GPU, but if you're okay with 7B quality, you might as well run 9B standalone at 33 t/s.

The Real Target Hardware

Speculative decoding shines on:

  • A100/H100 80GB: 70B + 7B coexist comfortably
  • RTX 4090 24GB: 27B + 7B or 14B + 1.5B are realistic
  • M4 Max 128GB Unified Memory: large model pairs possible on unified architecture

Consumer 8GB GPUs were never the intended target for this technique.


The Answer for 8GB Was MoE All Along

Pulling in data from my previous benchmark:

Model Architecture Speed VRAM Quality
Qwen3.5-9B dense, ngl=99 33 t/s 7.1 GB Great
Qwen3.5-35B-A3B MoE, partial 8.6 t/s ~7 GB Excellent
Qwen3.5-27B dense, ngl=24 3.5 t/s ~6.5 GB Excellent
27B + 0.8B spec split 3.1 t/s ~7.1 GB Excellent
27B + 4B spec split 2.8 t/s ~7.0 GB Excellent
27B + 9B spec split 2.5 t/s ~7.1 GB Excellent

To get 27B-class quality faster:

  • Speculative (best case, 0.8B draft): 3.1 t/s — slower than vanilla 27B
  • MoE (35B-A3B): 8.6 t/s — same or better quality, 3x faster

On 8GB VRAM, MoE (Mixture-of-Experts) is fundamentally more efficient than speculative decoding. MoE is designed from the ground up with "large total parameters, small active parameters during inference." It solves the same problem speculative decoding tries to solve — "big model quality at small model speed" — but at the architecture level instead of the inference level.

Speculative Decoding:
  Inference = 2 models loaded simultaneously → VRAM pressure
  Gain = speculative parallelism (meaningless on CPU)

MoE:
  Inference = 1 model, only active parameters computed
  Gain = architectural efficiency
Enter fullscreen mode Exit fullscreen mode

When Speculative Decoding Makes a Comeback

This article concludes "speculative decoding doesn't work on 8GB VRAM." The technique itself isn't dead, though.

  1. More VRAM: If RTX 5060 ships with 12-16GB, a 27B + 3B pair becomes realistic
  2. Tinier draft models: I tested down to 0.8B, still -11%. Ultra-light models like SmolLM2-135M would barely touch VRAM, but acceptance rate drops further. Cross-architecture speculative support is still limited
  3. Hardware specialization: NPUs or AMD XDNA-style inference engines could handle draft generation on a separate memory path

Option 2 — tiny same-family draft models — is the most realistic near-term improvement, especially as 0.5B models in major architectures become more common. But ultra-small drafts have inherently lower acceptance, so validating that tradeoff is the next step.


No Magic Beats Physics

Speculative decoding on RTX 4060 8GB with Qwen3.5-27B, 3 configs:

  27B + 0.8B draft:  3.1 t/s (-11%)
  27B + 4B draft:    2.8 t/s (-20%)
  27B + 9B draft:    2.5 t/s (-29%)
  27B normal:        3.5 t/s (baseline)

Splitting 8GB VRAM between two models can't beat
a single model using all the VRAM for itself.
MoE (35B-A3B: 8.6 t/s) is the answer.
Enter fullscreen mode Exit fullscreen mode

Three draft sizes, both ends of the tradeoff explored, clean sweep loss. This isn't "I haven't found the right config yet" — speculative decoding structurally cannot work under the physical constraint of 8GB VRAM.

Two days of testing for a total wipeout stings. For now, MoE 35B-A3B at 8.6 t/s is the pragmatic choice. But something nags at me — Qwen3.5's thinking mode. The acceptance rate for <think>...</think> tokens during speculation might behave differently from regular text generation. A high-performance model without the thinking overhead — Mistral or Llama family — could change the equation. That's going on the list for the next experiment.


References

Top comments (0)