Josh Green

Posted on May 28 • Edited on Jun 8 • Originally published at joshgreen.hashnode.dev

Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

#ai #llm #minipc #selfhosted

Did you know that a 35-billion-parameter model can generate tokens at the same compute cost as a 4B model? That single fact made me abandon a multi-model agent architecture I'd spent a weekend building. But I had to run the benchmarks first to understand why.

Here's the full breakdown, with commands, numbers, and the architectural reason it all falls apart on shared-memory hardware.

The Discovery That Changed Everything

I'd been running qwen3.6:35b on my Minisforum UM790Pro for weeks -- it's my daily driver for everything from coding to running GeometryViewer for 3D model previews. 17.8 tokens/second -- genuinely usable for interactive work. But I kept wondering: could I run a lightweight sidecar model alongside it for quick classification and tool-calling in an agent pipeline?

Before I even started benchmarking, I dug into what qwen3.6:35b actually is under the hood. It's a Mixture of Experts model: 256 total experts with only 8 activated per token. The architecture also incorporates SSM (State Space Model) components alongside traditional attention -- Mamba-style layers that handle certain sequence patterns more efficiently than pure transformers.

The math hit me: 8 out of 256 experts means each token only touches roughly 4-5B parameters worth of compute. The model carries 36 billion parameters of knowledge, but its per-token cost is comparable to a small dense model. I was planning to run a separate 4B model for "fast tasks" next to a model that already operates at 4B-class speed.

But I had to prove it with numbers.

Hardware and Ollama Setup

The UM790Pro specs that matter for this experiment:

CPU: AMD Ryzen 9 7940HS (Zen 4, 8C/16T)
iGPU: AMD Radeon 780M (12 RDNA 3 compute units)
RAM: 96 GB DDR5-5600 (~80 GB/s bandwidth)
GPU memory pool: 2 GB dedicated VRAM + 46 GB GTT = 48 GB GPU-accessible

That 48 GB GPU pool sounds enormous until you realize it's carved from the same DDR5 that the CPU also uses. There is no separate GDDR6 bus. Everything -- CPU inference, GPU inference, KV caches, OS operations -- flows through one 80 GB/s pipe.

Four models under test, managed through Ollama:

# Pull the models
ollama pull qwen3.6:35b
ollama pull gemma4-e2b-abliterated
ollama pull qwen3:4b-instruct
ollama pull qwen2.5:1.5b

# Check what's loaded and where
ollama ps

ollama ps shows you which models are in memory and whether they're on GPU or CPU. For forcing CPU-only inference (critical for these tests), you pass num_gpu as a model parameter:

# Force a model onto CPU -- zero GPU layers
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4-e2b-abliterated",
  "prompt": "Explain quicksort in 3 sentences.",
  "options": { "num_gpu": 0 }
}'

Setting num_gpu: 0 tells Ollama to offload zero layers to the GPU, keeping the entire model in system RAM for CPU-only inference. This is how I isolated CPU vs GPU performance and tested mixed configurations.

To verify VRAM allocation, ollama ps gives you the breakdown:

NAME                          SIZE     PROCESSOR    UNTIL
qwen3.6:35b                   32.2 GB  100% GPU     4 minutes from now
gemma4-e2b-abliterated:latest  4.1 GB  100% GPU     4 minutes from now

On a discrete NVIDIA card you'd cross-reference with nvidia-smi, but on an AMD APU the GTT allocation is only visible through ollama ps or by reading /sys/kernel/debug/dri/0/amdgpu_gem_info.

The Benchmark Results

Every test used identical prompts fired simultaneously at both models. I measured generation throughput (tokens/second) across solo and dual-model runs.

Solo Baselines

Model	Parameters	GPU (tok/s)	CPU (tok/s)
qwen3.6:35b	36B (MoE)	17.8	--
gemma4-e2b-abliterated	4.6B	42.9	28.7
qwen3:4b-instruct	4B	26.2	19.6
qwen2.5:1.5b	1.5B	--	53.4

Dual-Model Runs

Both on GPU -- qwen3.6:35b + gemma4-e2b:

Model	Solo	Dual	Performance Hit
qwen3.6:35b (GPU)	17.8	13.1	-26%
gemma4-e2b (GPU)	42.9	25.3	-41%

GPU + tiny CPU -- qwen3.6:35b (GPU) + qwen2.5:1.5b (CPU):

Model	Solo	Dual	Performance Hit
qwen3.6:35b (GPU)	17.8	14.9	-16%
qwen2.5:1.5b (CPU)	53.4	26.2	-51%

GPU + medium CPU -- qwen3.6:35b (GPU) + gemma4-e2b (CPU, num_gpu=0):

Model	Solo	Dual	Performance Hit
qwen3.6:35b (GPU)	17.8	13.0	-27%
gemma4-e2b (CPU)	28.7	13.4	-53%

GPU + large-context CPU -- qwen3.6:35b (GPU) + qwen3:4b-instruct (CPU, num_gpu=0):

Model	Solo	Dual	Performance Hit
qwen3.6:35b (GPU)	17.8	11.6	-35%
qwen3:4b-instruct (CPU)	19.6	11.1	-43%

That last combination was the worst. The 4B instruct model supports 256K context, and its KV cache ballooned to 24.2 GB. Combined with the 35B model's 32 GB GPU allocation, we were saturating every available byte of bandwidth.

Why It Happens: One Bus to Rule Them All

On a discrete GPU setup, the CPU reads model weights from DDR5 over its memory controller while the GPU reads from its own GDDR6 over a completely separate bus (often 300+ GB/s). Two independent pipes, no contention.

On an APU, both the Zen 4 CPU cores and the RDNA 3 compute units share a single memory controller connected to the same DDR5 DIMMs. The theoretical peak is ~80 GB/s, and that bandwidth is divided between every consumer.

DDR5-5600 (96 GB) -- ~80 GB/s shared
       |
  +----+----+
  |         |
CPU cores  780M iGPU
(Zen 4)    (12 CUs)
  |         |
 model      model
weights    weights
  |         |
  +-- SAME MEMORY CONTROLLER --+

LLM inference is almost entirely memory-bound. Each generated token requires streaming the model's weights through the compute units. A 35B MoE model activating 8 experts per token still needs to read those expert weights from memory every single time. When a CPU-side model is doing the same thing simultaneously, the two streams compete for the same bandwidth.

Even the "best" dual-model result (35B GPU + 1.5B CPU) cost 16% on the big model. The 1.5B model is tiny enough that its memory footprint barely dents bandwidth -- but it still halved its own throughput because the 35B model was dominating the bus.

The Agent Framework Problem

My original goal was a planner-executor agent setup: the 35B model reasons about what to do, a small model handles tool calls. Sounds efficient in theory.

In practice, agent frameworks are sequential. The planner generates a plan, then the executor runs a tool, then the planner evaluates the result. At any given moment, only one model is actively generating. The other sits idle in memory, consuming VRAM or RAM that could instead feed the active model a larger context window.

Combined with the MoE insight -- the 35B model already runs at small-model speeds -- the dual-model architecture solves a problem that does not exist on this hardware.

Bonus: Finding Orphan Blobs in Ollama

While investigating model storage during this project, I found 12.9 GB of wasted disk space. Ollama uses content-addressed storage under ~/.ollama/models/, so multiple model tags can reference the same weight blob. But when you delete a model, the blob sometimes lingers.

Here's how to find orphans:

# 1. Collect every blob hash referenced by a manifest
find ~/.ollama/models/manifests -name '*' -type f \
  -exec grep -oh 'sha256:[a-f0-9]*' {} \; | sort -u > /tmp/referenced_blobs.txt

# 2. List every blob on disk
ls ~/.ollama/models/blobs/ | sed 's/-/:/g' | sort -u > /tmp/disk_blobs.txt

# 3. Find blobs on disk that no manifest references
comm -13 /tmp/referenced_blobs.txt /tmp/disk_blobs.txt

Any hash that appears in the output of step 3 is an orphan. There's no ollama prune command yet, so you delete them manually. On my system this reclaimed nearly 13 GB from a single forgotten blob.

Also worth knowing: qwen3.6:35b, qwen3.6:latest, and qwen3.6:35b-nothink all resolve to the same 23.9 GB blob. Ollama's content-addressing means you're not actually tripling your disk usage by pulling multiple tags of the same weights.

The Verdict

If you're running local LLMs on a shared-memory APU (any AMD APU, any Intel with Arc iGPU, any machine without a discrete GPU), here's the takeaway:

One model at a time. The memory bus is your bottleneck, and dual-model inference taxes it regardless of CPU/GPU split.
MoE models are your best friend on this hardware. You get large-model reasoning quality at small-model inference cost. No need for a sidecar.
Use your surplus RAM for context, not extra models. A single 35B MoE with a 64K context window is more useful than two models fighting over bandwidth.
Watch for Ollama's iGPU memory reporting bug (#14953) -- loading multiple models can trigger OOM crashes because Ollama misjudges available iGPU memory.
Audit your blob storage. Orphan blobs from deleted models add up fast.

The UM790Pro with 96 GB of DDR5 is genuinely impressive hardware for local inference. 17.8 tok/s from a 35B-class model on an integrated GPU, in a box the size of a paperback. Just don't try to make it do two things at once.

If you're into 3D printing and web dev like me, check out GeometryViewer -- a browser-based 3D model viewer I built that runs great on this same hardware. And you can find my other projects on GitHub.

Tested on: Minisforum UM790Pro, Ryzen 9 7940HS, 96 GB DDR5-5600, Ollama v0.9.x, Ubuntu Linux.

DEV Community