byeongsoo kang

Posted on Jun 3 • Originally published at bric.pe.kr

Running a 35B MoE (Qwen3.6-35B-A3B) on 2x GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?

#llm #machinelearning #gpu #ollama

TL;DR (Quick Answer)

I actually ran Qwen3.6-35B-A3B — a 35B-parameter mixture-of-experts model (only 3B active per token) — on a pair of 8-year-old GTX 1080 Ti cards (22 GB combined). Real, measured numbers:

Generation speed: ~20 tokens/sec on 2× 1080 Ti (IQ4_XS quant), stable across runs (19.4 / 21.4 / 20.0).
Single GPU: ~16.8 tok/s. CPU-only (i9-14900K): ~17.1 tok/s. The second 1080 Ti buys only ~20% over one card — and, the kicker, the GPUs barely beat a modern CPU here (~+18%), because the MoE experts stay mmap'd in CPU RAM regardless. See the honest update below.
It only "fits" because of the MoE + CPU-mmap trick. ~13 GB of the model sits on the two GPUs; ~18 GB of expert weights are mmap'd from CPU RAM, and only the active 3B runs each token.
Quant matters for 22 GB: the default qwen3.6:35b-a3b tag is 23.9 GB and spills to CPU. You want ≤ IQ4_XS (~17.7 GB) to keep it (mostly) on the GPUs.

Bottom line: a 35B MoE is genuinely usable on this box in 2026 — but the honest workhorse turned out to be the i9-14900K CPU; the used 1080 Ti cards add only ~18%. Pick a sparse MoE and a quant that mostly fits — and know that for an offload-heavy MoE, a fast CPU + RAM bandwidth matters as much as the GPUs.

The setup (and one gotcha)

GPUs: 2× NVIDIA GeForce GTX 1080 Ti (11 GB each, 22 GB total), Pascal, compute capability 6.1.
Driver: 581.57 (Windows host, used via WSL2 passthrough). This matters — recent Ollama bundles CUDA 13, which refuses drivers older than 570. On the older 560 driver it silently fell back to CPU (total_vram=0). Updating to 581 fixed it.
Ollama: v0.30.2. Interesting detail: its cuda_v13 build skips Pascal ("compute capability not in compiled architectures", cc 6.1), so it auto-falls back to the bundled cuda_v12 build to use the 1080 Ti. Good to know if you're on old hardware.

Why a "35B" model runs on old cards at all

Qwen3.6-35B-A3B is a mixture-of-experts (MoE): 35B total parameters, but only ~3B are active for any given token. So the compute per token is small (3B-class), even though all the experts must be available in memory.

That's the whole reason this works on Pascal: the GTX 1080 Ti has no tensor cores and modest FP16, so a dense 35B would crawl. A sparse 3B-active MoE keeps the per-token math light, and the bottleneck shifts to where the weights live — which is exactly what the dual-GPU question is about.

Quant fit on 22 GB

You can't just ollama pull qwen3.6:35b-a3b — that default is 23.9 GB and won't sit on 22 GB of VRAM. Measured GGUF sizes:

Quant	Size	Fits 22 GB?
Q3_K_M	~16.6–17.1 GB	✅ comfortable
IQ4_XS	~17.7 GB	✅ best quality that fits
Q4_K_S	~21 GB	⚠️ too tight (spills with KV cache)
Q4_K_M / default	23.9 GB+	❌ offloads to CPU

I used IQ4_XS.

Results: single vs dual 1080 Ti

Same model (IQ4_XS), same prompt, num_predict=256, measured via Ollama's /api/generate:

Config	Generation	Prefill	Model on GPU	Model on CPU (mmap)
CPU only (i9-14900K)	~17.1 tok/s	—	0 GB	whole model in RAM
1× GTX 1080 Ti	~16.8 tok/s	~50 tok/s	~3 GB	~18 GB+
2× GTX 1080 Ti	~20.3 tok/s	~50 tok/s	~13 GB (4 + 9.3)	~18 GB

Under load, the busier card drew up to ~101 W, GPU utilization sat around 26–33% — telling: the cards are waiting a lot, because the CPU-mmap'd experts are the bottleneck, not raw GPU FLOPs.

Update (2026-06-03) — the honest punchline, after an r/ollama reader pushed back ("those numbers are slow for A3B"). I measured CPU-only on the same box — an Intel i9-14900K (32 threads, DDR5): ~17.1 tok/s. That's basically tied with a single 1080 Ti, and only ~18% behind both GPUs combined. So for this offload-heavy MoE, the old Pascal cards barely beat a modern CPU — the 14900K does most of the work and the GPUs mostly shave overhead. The honest framing isn't "a 35B runs on 2× 1080 Ti" so much as "a 35B MoE runs on a fast desktop CPU, and old GPUs add ~18%." When the experts have to live in CPU RAM, your CPU + memory bandwidth — not the GPU — set the ceiling. (On hardware where the whole MoE is VRAM-resident, the GPU story would look very different.)

So, does the second 1080 Ti help? A little — ~+20% over one card, ~+18% over CPU-only — by keeping ~9 GB more of the model in VRAM. But not 2×, and not the win you'd hope: an MoE that overflows your combined VRAM is gated by the CPU-side experts in every config here.

Reproduce it

# (driver must be 570+ for current Ollama; check with: nvidia-smi)
ollama pull hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS

# generate + read the eval rate
curl -s http://127.0.0.1:11434/api/generate -d '{
  "model": "hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS",
  "prompt": "Explain mixture-of-experts in 150 words.",
  "stream": false,
  "options": {"num_predict": 256}
}'
# tokens/sec = eval_count / (eval_duration / 1e9)

To force a single GPU for comparison, start the server with CUDA_VISIBLE_DEVICES=0 ollama serve.

Honest Limitations

One quant, one model, one box. IQ4_XS on 2× 1080 Ti; your tokens/sec will shift with quant, context length, CPU, and RAM speed.
Prefill measured on a short prompt (~55 tokens) — treat ~50 tok/s as a ballpark; long-context prefill on Pascal will be slower.
IQ4_XS is a ~4-bit quant — fine for chat/drafting, but it's not full-precision quality.
MoE-specific. These conclusions (the modest dual-GPU gain, the CPU-mmap behavior) are about this sparse MoE. A dense model that fully fits VRAM would scale differently across two cards.
A few runs, not a statistical study — numbers are representative, not p-valued.

FAQ

Q: Can a GTX 1080 Ti really run a 35B model in 2026?

A sparse MoE one, yes — Qwen3.6-35B-A3B at IQ4_XS ran ~20 tok/s on two of them. A dense 35B would not be usable. The 3B-active design is what makes it work.

Q: Will a second 1080 Ti double my speed?

No. Here it added ~20%. The MoE experts stay memory-mapped in CPU RAM in both single- and dual-GPU setups, so the second card helps but doesn't scale linearly.

Q: Why did Ollama ignore my GPU until I updated the driver?

Recent Ollama bundles CUDA 13, which requires NVIDIA driver ≥ 570. On an older driver it falls back to CPU silently. Update the driver; Ollama then uses its cuda_v12 build for Pascal cards.

Q: Which quant should I use on 22 GB?

IQ4_XS (~17.7 GB) for the best quality that stays (mostly) on the GPUs; Q3_K_M if you want more headroom for context. Avoid the 23.9 GB default — it spills to CPU.

Resources

Model: Qwen3.6-35B-A3B GGUF (bartowski)
Ollama · benchmark via /api/generate (eval_count / eval_duration).

Top comments (4)

Josh Green • Jun 5

The update section where you added the CPU-only numbers is the most honest thing I've seen in a benchmarking post in a while. Most people would just quietly delete the original conclusions but you left both which makes this way more useful 👏

The 14900K tying a single 1080 Ti at ~17 tok/s on a MoE basically confirms what I suspected, that these sparse models are memory bandwith limited not compute limited. DDR5 on a modern CPU can nearly match PCIe 3.0 x16 when most of the expert weights are being mmapped from system RAM anyway. Would be really interesting to see how this changes with a dense 12B model that actualy fits in VRAM entirely.

byeongsoo kang • Jun 5

Tested it instead of guessing 😄 Q4, ~20 generations, short prompts vs long context-heavy prompts (matched output length):

Short prompts → glitches show up (~8 stray chars / 22k, bursts like 慕 / 이야날 / بحسب dropped into otherwise-fine English)
Long, context-heavy prompts → 0 glitches
Q8 → 0 either way, so it's a Q4 quantization artifact, not the model itself.

My guess (just a guess): a short prompt leaves the next-token distribution flatter, so 4-bit logit error can tip it past the right token into some random high-id token — which lands in CJK/Arabic ranges. A long, specific context sharpens the distribution enough that the quantized logits still pick the right token. The model is most fragile exactly when it's least anchored.

And 100% on the HEIDI point being the scary one: a speed glitch you can see, but a confidently-inverted domain answer is invisible unless you already know. That's why I keep a tiny local RAG over the actual papers and check the model against the source instead of trusting it on niche facts. Q8 fixed both for me, ~30% slower.

The UM790 "throttled from both sides" framing nails it — my 1080 Ti number is misleading the same way: the 12B fits in VRAM so it's fine, but my 35B-MoE test found the old GPUs barely beat the CPU, because the experts get mmap'd to RAM and it goes memory-bandwidth-bound. So for offload-heavy models your shared-DDR5 iGPU and my "dual GPU" are closer than they look. Are you keeping Gemma fully in the 780M allocation, or letting it spill to shared system RAM?

Josh Green • Jun 8

On an APU there's no real distinction between "in the GPU allocation" and "shared system RAM" — the GTT pool is system RAM, just mapped for GPU access. The 780M has 2 GB of dedicated VRAM that fits nothing useful, so everything runs through the 46 GB GTT pool, which is the same DDR5 the CPU is using. Ollama reports it as "100% GPU" but that's a bit misleading — it means GPU-accessible, not on separate VRAM.

So your mmap analysis describes what's happening here by default, even for models that are nominally "fully on GPU." Your 35B experts spill to system RAM over PCIe, mine spill to... the same system RAM, just without the PCIe hop. Different path, same bottleneck. That's probably why our numbers converge more than you'd expect from the hardware gap.

byeongsoo kang • Jun 9

This is the clearest explanation of the APU memory model I've read — thank you. "The GTT pool is system RAM, just GPU-mapped" reframes it perfectly: Ollama's "100% GPU" means GPU-accessible, not on separate VRAM, so your 780M is basically running my MoE-spill case by default. Different path, same DDR5 wall — the PCIe hop is almost a rounding error next to the bandwidth limit. That's exactly why the numbers converge.

And it answers your earlier question from the other direction: I did finally test a dense 12B that fully fits VRAM (Gemma 4 12B, ~7 GB on a single 1080 Ti). It runs ~28–31 tok/s and scales cleanly — because nothing spills to system RAM, it's actually compute-bound on the GPU, not bandwidth-bound.

So the whole thing collapses to one axis: the moment any weights live in system RAM (MoE experts over PCIe, or an APU's GTT pool), you hit the DDR5 wall and the "GPU" label stops meaning much; fit it entirely in real VRAM and the bottleneck moves back to compute. Your APU just makes the system-RAM case the default. Genuinely great comment.