TL;DR (Quick Answer)
Gemma 4 12B just dropped, so I ran it on a GTX 1080 Ti (Pascal, 2017) to see what an 8-year-old card does with a 2026 model. Real numbers, and a few honest surprises:
- Speed: ~28 tok/s at Q4_K_M on a single 1080 Ti (~8 GB VRAM). The 12B fits one card, so the second GPU sits idle.
-
Three things broke before it worked: the GGUF is multimodal and its vision projector crashes Ollama; it's a reasoning model that hides its answer in a
thinkingchannel; and Q4 produces visible token glitches. - The interesting part — Q4 vs Q8. I asked it real bioinformatics questions. At Q4 it answered concepts and code well but got a niche method (the HEIDI test) confidently backwards, with garbled characters sprinkled in. Going to Q8_0 (12.7 GB, split across both 1080 Tis, ~30% slower at ~19.5 tok/s) removed the glitches and fixed the wrong answer.
Bottom line: for chat and drafting, Q4 on one old card is genuinely usable. For work where details matter, the higher quant across two cards is worth the speed hit — and it's the one case where the second 1080 Ti finally earns its keep.
Setup
- Hardware: 2× NVIDIA GTX 1080 Ti (11 GB each), Pascal cc 6.1, driver 581.57, via WSL2.
-
Runtime: Ollama 0.30.2. Gemma 4 isn't in Ollama's library yet, so I pulled the unsloth GGUF:
ollama pull hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M.
The 3 things that broke first
1. It's multimodal — and the vision projector crashes Ollama.
First generation returned nothing. The logs:
error: Failed to load CLIP model from .../blobs/sha256-7d10888...
llama-server process has terminated: exit status 1
Gemma 4 12B-it ships with a vision (CLIP) projector, and Ollama 0.30.2 fails to load it — taking down the whole model server. If you only want text, you have to strip the projector. Pull the model, then rebuild it text-only from the same blobs (no re-download):
ollama show --modelfile hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M > Gemma4.Modelfile
# delete the second `FROM ...` line — the mmproj/CLIP blob — keep only the text GGUF
ollama create gemma4-12b-text -f Gemma4.Modelfile
2. It's a reasoning model — your answer hides in thinking.
With the text model, generation worked but content came back empty while eval_count was 200. The output was all going into the reasoning channel and getting cut off mid-thought at the token cap. Fix: disable thinking.
{ "model": "gemma4-12b-text", "think": false, "messages": [ ... ] }
With think: false, clean answers in ~10 seconds.
3. Q4 has visible token glitches.
At Q4_K_M, prose came out with occasional garbled characters — literally self-さattention, ściindicates, stray Korean/Japanese codepoints injected mid-word. Code blocks were clean; only prose was affected. (Spoiler: Q8 fixes this.)
Speed (Q4, single 1080 Ti)
think: false, num_predict=256, measured via Ollama's API:
- Generation: ~27.6 tok/s (27.5 / 27.6 / 27.7 — rock stable)
- VRAM: ~8 GB on GPU0; GPU1 completely idle (0 MiB) — a 12B at Q4 fits one card, so the second GPU does nothing.
Quality: I asked it about my actual field
Speed is easy; is it useful for real work? I gave it four bioinformatics questions and checked the answers honestly:
| Question | Verdict |
|---|---|
| RNA-seq normalization (raw vs TPM vs FPKM; DESeq2 input) | ✅ Correct and precise |
| Pandas function to filter a DESeq2 results table | ✅ Correct, clean, usable |
| Troubleshoot an implausibly high DEG count | ✅ Good — batch effects, PCA, outliers, covariates |
| What a small HEIDI p-value means (SMR/colocalization) | ❌ Confidently backwards |
That last one is the lesson. HEIDI is a niche test: a small p-value means the locus fails (heterogeneity/linkage — you filter it out). Q4 Gemma told me a small p-value means a single causal gene — the exact opposite. It was fluent and sure of itself. If you don't already know the answer, that's the dangerous kind of wrong.
The payoff: Q4 vs Q8
So I pulled Q8_0 (12.7 GB) and rebuilt it text-only the same way. At 12.7 GB it no longer fits one card — Ollama splits it across both 1080 Tis (~7 GB each). Same questions:
| Q4_K_M | Q8_0 | |
|---|---|---|
| Size / GPUs | 7 GB / 1 card (GPU1 idle) | 12.7 GB / 2 cards (~7 GB each) |
| Speed | ~28 tok/s | ~19.5 tok/s (−30%) |
| Token glitches |
self-さattention etc. |
gone — clean ✅ |
| HEIDI answer | backwards ❌ | correct ✅ ("small p = fails, filter it out") |
Less quantization bought three things: the glitches disappeared, it got the niche domain detail right, and — because the bigger file overflows one card — the otherwise-idle second 1080 Ti finally did work. The cost was ~30% throughput.
(Honesty note: I asked Q8 the HEIDI question with a more pointed framing than Q4, so that single comparison isn't perfectly controlled. The token-glitch difference, on identical prompts, is unambiguous.)
When does the second 1080 Ti actually help?
Combining this with an earlier 35B-MoE run, a clear rule emerges:
- Model fits one card (12B Q4): second GPU is idle — useless.
- Model overflows one card (12B Q8, or a 35B): it spills to the second card, which now helps.
The second 1080 Ti isn't about speed; it's about fitting a bigger or higher-precision model.
Honest Limitations
- One model, two quants, one box; your tok/s will vary with CPU, RAM, and context length.
- Q8 HEIDI test used a more direct prompt — suggestive, not a controlled A/B.
- Quality judged on a handful of prompts, not a benchmark suite.
- Ollama 0.30.2's Gemma 4 support is early (the CLIP crash, the reasoning-channel behavior); later versions may change this.
Reproduce
ollama pull hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M # or :Q8_0 for the 2-card run
ollama show --modelfile hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M > m.Modelfile
# remove the mmproj/CLIP `FROM` line, keep the text GGUF
ollama create gemma4-12b-text -f m.Modelfile
# then call /api/chat with "think": false
FAQ
Q: Can a GTX 1080 Ti run Gemma 4 12B?
Yes — ~28 tok/s at Q4 on a single card, ~19.5 tok/s at Q8 across two. Just strip the vision projector (it crashes Ollama 0.30.2) and disable the reasoning channel with think: false.
Q: Q4 or Q8?
Q4 for speed and casual use (one card). Q8 when correctness matters: on my field's questions it removed the token glitches and fixed an answer Q4 got backwards — at ~30% lower speed, and it needs both cards.
Q: Why did the second GPU sit idle at Q4?
A 12B at Q4 is ~7 GB and fits one 11 GB card, so Ollama uses one GPU. Only when the model overflows one card (Q8, or a larger model) does the second card get used.
Resources
- Model: unsloth/gemma-4-12b-it-GGUF
- Related: 35B MoE on 2× 1080 Ti · Ollama
Top comments (0)