The fix was swapping a 4B draft model for a 0.6B one in my speculative decoding config. That's the whole punchline. But the path there touched every assumption I had about how spec decode interacts with VRAM budgets on consumer hardware, so here's the full story.
TL;DR
| Change | Result |
|---|---|
| 4B draft → 0.6B draft | ~2 GiB saved, same MoE throughput |
| Embedding parallelism 16 → 8 | ~8 GiB freed |
| Combined | Dropped from ~97 GiB to ~87.7 GiB, no more OOM |
Spec decode isn't free. You're paying VRAM for both models simultaneously.
The Setup
I run a local LLM inference gateway on two AMD-based mini PCs — GMKTec EVO-X2 boxes with Strix Halo APUs and 160 GB of unified memory each. The gateway serves around 20 models through llama-swap, a process manager that loads and evicts models on demand behind an OpenAI-compatible API. Think of it as a poor man's model router: one port per logical model, llama-swap starts the right llama.cpp process on request, and idle models get evicted when memory gets tight.
Speculative Decoding (Quick Context)
Speculative decoding pairs a large target model with a smaller draft model. The draft proposes tokens cheaply; the target verifies them in a single forward pass. When the draft is right — and for well-matched model families, it often is — you get roughly 1.5–2× throughput. The important detail that bites people: both models are resident in memory at the same time.
The Bad Assumption
I was running a blanket policy: every Qwen3-family model gets the Qwen3-4B draft. Four billion parameters felt like the safe middle ground — big enough to draft well, small enough to fit. Or so I thought.
The Crash
The problem surfaced when I tried to load qwen3.5-122b-a10b (roughly 71 GiB at Q4_K_M) alongside my always-resident embedding model. On paper, the embedding model was supposed to run around 16 GiB. In practice:
embed: ~23.8 GiB
122B + 4B draft: ~73.6 GiB
─────────────────────────────
total: ~97.6 GiB
available: ~96.0 GiB
Intermittent OOM crashes followed.
The Diagnosis
Pulling real numbers from rocm-smi told a different story than my estimates. The embedding model was actually consuming 23.8 GiB, not 16. The culprit was KV cache pre-allocation: with parallelism set to 16 and context at 8,192 tokens, the runtime was pre-allocating 16 full-context-length KV cache slots simultaneously, and that adds up fast.
Two Knobs, Both Pulled
At that point I had two levers: reduce embedding parallelism, or shrink the draft model. I did both.
Dropping embedding parallelism from 16 to 8 freed roughly 8 GiB while keeping context length at 8,192 tokens, which still comfortably covers my p99 usage around 2,532 tokens. On the draft side, the key insight was that not every model needs the same draft. A 0.6B draft — about 0.4 GiB — performs nearly as well as the 4B for MoE architectures, where sparse activation already limits how much a larger draft model can contribute. Total consumption dropped from roughly 97 GiB to around 87.7 GiB. Stable, no crashes.
What I Learned
- Measure actual VRAM usage, not estimated usage. They are not the same number.
- Draft model sizing should follow model architecture, not a one-size-fits-all policy.
- KV cache pre-allocation scales with parallelism — and it will surprise you.
- Spec decode costs memory. Budget for two models, not one.
- Working inside tight constraints forces you to understand your system at a level that comfortable headroom never would.


Top comments (0)