DEV Community

Cover image for Running a 35B MoE (Qwen3.6-35B-A3B) on 2x GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?
byeongsoo kang
byeongsoo kang

Posted on • Originally published at bric.pe.kr

Running a 35B MoE (Qwen3.6-35B-A3B) on 2x GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?

TL;DR (Quick Answer)

I actually ran Qwen3.6-35B-A3B — a 35B-parameter mixture-of-experts model (only 3B active per token) — on a pair of 8-year-old GTX 1080 Ti cards (22 GB combined). Real, measured numbers:

  • Generation speed: ~20 tokens/sec on 2× 1080 Ti (IQ4_XS quant), stable across runs (19.4 / 21.4 / 20.0).
  • Single GPU: ~16.8 tok/s. CPU-only (i9-14900K): ~17.1 tok/s. The second 1080 Ti buys only ~20% over one card — and, the kicker, the GPUs barely beat a modern CPU here (~+18%), because the MoE experts stay mmap'd in CPU RAM regardless. See the honest update below.
  • It only "fits" because of the MoE + CPU-mmap trick. ~13 GB of the model sits on the two GPUs; ~18 GB of expert weights are mmap'd from CPU RAM, and only the active 3B runs each token.
  • Quant matters for 22 GB: the default qwen3.6:35b-a3b tag is 23.9 GB and spills to CPU. You want ≤ IQ4_XS (~17.7 GB) to keep it (mostly) on the GPUs.

Bottom line: a 35B MoE is genuinely usable on this box in 2026 — but the honest workhorse turned out to be the i9-14900K CPU; the used 1080 Ti cards add only ~18%. Pick a sparse MoE and a quant that mostly fits — and know that for an offload-heavy MoE, a fast CPU + RAM bandwidth matters as much as the GPUs.

The setup (and one gotcha)

  • GPUs: 2× NVIDIA GeForce GTX 1080 Ti (11 GB each, 22 GB total), Pascal, compute capability 6.1.
  • Driver: 581.57 (Windows host, used via WSL2 passthrough). This matters — recent Ollama bundles CUDA 13, which refuses drivers older than 570. On the older 560 driver it silently fell back to CPU (total_vram=0). Updating to 581 fixed it.
  • Ollama: v0.30.2. Interesting detail: its cuda_v13 build skips Pascal ("compute capability not in compiled architectures", cc 6.1), so it auto-falls back to the bundled cuda_v12 build to use the 1080 Ti. Good to know if you're on old hardware.

Why a "35B" model runs on old cards at all

Qwen3.6-35B-A3B is a mixture-of-experts (MoE): 35B total parameters, but only ~3B are active for any given token. So the compute per token is small (3B-class), even though all the experts must be available in memory.

That's the whole reason this works on Pascal: the GTX 1080 Ti has no tensor cores and modest FP16, so a dense 35B would crawl. A sparse 3B-active MoE keeps the per-token math light, and the bottleneck shifts to where the weights live — which is exactly what the dual-GPU question is about.

Quant fit on 22 GB

You can't just ollama pull qwen3.6:35b-a3b — that default is 23.9 GB and won't sit on 22 GB of VRAM. Measured GGUF sizes:

Quant Size Fits 22 GB?
Q3_K_M ~16.6–17.1 GB ✅ comfortable
IQ4_XS ~17.7 GB ✅ best quality that fits
Q4_K_S ~21 GB ⚠️ too tight (spills with KV cache)
Q4_K_M / default 23.9 GB+ ❌ offloads to CPU

I used IQ4_XS.

Results: single vs dual 1080 Ti

Same model (IQ4_XS), same prompt, num_predict=256, measured via Ollama's /api/generate:

Config Generation Prefill Model on GPU Model on CPU (mmap)
CPU only (i9-14900K) ~17.1 tok/s 0 GB whole model in RAM
1× GTX 1080 Ti ~16.8 tok/s ~50 tok/s ~3 GB ~18 GB+
2× GTX 1080 Ti ~20.3 tok/s ~50 tok/s ~13 GB (4 + 9.3) ~18 GB
  • Under load, the busier card drew up to ~101 W, GPU utilization sat around 26–33% — telling: the cards are waiting a lot, because the CPU-mmap'd experts are the bottleneck, not raw GPU FLOPs.

Update (2026-06-03) — the honest punchline, after an r/ollama reader pushed back ("those numbers are slow for A3B"). I measured CPU-only on the same box — an Intel i9-14900K (32 threads, DDR5): ~17.1 tok/s. That's basically tied with a single 1080 Ti, and only ~18% behind both GPUs combined. So for this offload-heavy MoE, the old Pascal cards barely beat a modern CPU — the 14900K does most of the work and the GPUs mostly shave overhead. The honest framing isn't "a 35B runs on 2× 1080 Ti" so much as "a 35B MoE runs on a fast desktop CPU, and old GPUs add ~18%." When the experts have to live in CPU RAM, your CPU + memory bandwidth — not the GPU — set the ceiling. (On hardware where the whole MoE is VRAM-resident, the GPU story would look very different.)

So, does the second 1080 Ti help? A little — ~+20% over one card, ~+18% over CPU-only — by keeping ~9 GB more of the model in VRAM. But not 2×, and not the win you'd hope: an MoE that overflows your combined VRAM is gated by the CPU-side experts in every config here.

Reproduce it

# (driver must be 570+ for current Ollama; check with: nvidia-smi)
ollama pull hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS

# generate + read the eval rate
curl -s http://127.0.0.1:11434/api/generate -d '{
  "model": "hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS",
  "prompt": "Explain mixture-of-experts in 150 words.",
  "stream": false,
  "options": {"num_predict": 256}
}'
# tokens/sec = eval_count / (eval_duration / 1e9)
Enter fullscreen mode Exit fullscreen mode

To force a single GPU for comparison, start the server with CUDA_VISIBLE_DEVICES=0 ollama serve.

Honest Limitations

  1. One quant, one model, one box. IQ4_XS on 2× 1080 Ti; your tokens/sec will shift with quant, context length, CPU, and RAM speed.
  2. Prefill measured on a short prompt (~55 tokens) — treat ~50 tok/s as a ballpark; long-context prefill on Pascal will be slower.
  3. IQ4_XS is a ~4-bit quant — fine for chat/drafting, but it's not full-precision quality.
  4. MoE-specific. These conclusions (the modest dual-GPU gain, the CPU-mmap behavior) are about this sparse MoE. A dense model that fully fits VRAM would scale differently across two cards.
  5. A few runs, not a statistical study — numbers are representative, not p-valued.

FAQ

Q: Can a GTX 1080 Ti really run a 35B model in 2026?

A sparse MoE one, yes — Qwen3.6-35B-A3B at IQ4_XS ran ~20 tok/s on two of them. A dense 35B would not be usable. The 3B-active design is what makes it work.

Q: Will a second 1080 Ti double my speed?

No. Here it added ~20%. The MoE experts stay memory-mapped in CPU RAM in both single- and dual-GPU setups, so the second card helps but doesn't scale linearly.

Q: Why did Ollama ignore my GPU until I updated the driver?

Recent Ollama bundles CUDA 13, which requires NVIDIA driver ≥ 570. On an older driver it falls back to CPU silently. Update the driver; Ollama then uses its cuda_v12 build for Pascal cards.

Q: Which quant should I use on 22 GB?

IQ4_XS (~17.7 GB) for the best quality that stays (mostly) on the GPUs; Q3_K_M if you want more headroom for context. Avoid the 23.9 GB default — it spills to CPU.

Resources

Top comments (0)