byeongsoo kang

Posted on Jun 5 • Originally published at bric.pe.kr

Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field

#llm #ollama #gpu #machinelearning

TL;DR (Quick Answer)

Gemma 4 12B just dropped, so I ran it on a GTX 1080 Ti (Pascal, 2017) to see what an 8-year-old card does with a 2026 model. Real numbers, and a few honest surprises:

Speed: ~28 tok/s at Q4_K_M on a single 1080 Ti (~8 GB VRAM). The 12B fits one card, so the second GPU sits idle.
Three things broke before it worked: the GGUF is multimodal and its vision projector crashes Ollama; it's a reasoning model that hides its answer in a thinking channel; and Q4 produces visible token glitches.
The interesting part — Q4 vs Q8. I asked it real bioinformatics questions. At Q4 it answered concepts and code well but got a niche method (the HEIDI test) confidently backwards, with garbled characters sprinkled in. Going to Q8_0 (12.7 GB, split across both 1080 Tis, ~30% slower at ~19.5 tok/s) removed the glitches and fixed the wrong answer.

Bottom line: for chat and drafting, Q4 on one old card is genuinely usable. For work where details matter, the higher quant across two cards is worth the speed hit — and it's the one case where the second 1080 Ti finally earns its keep.

Setup

Hardware: 2× NVIDIA GTX 1080 Ti (11 GB each), Pascal cc 6.1, driver 581.57, via WSL2.
Runtime: Ollama 0.30.2. Gemma 4 isn't in Ollama's library yet, so I pulled the unsloth GGUF: ollama pull hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M.

The 3 things that broke first

1. It's multimodal — and the vision projector crashes Ollama.
First generation returned nothing. The logs:

error: Failed to load CLIP model from .../blobs/sha256-7d10888...
llama-server process has terminated: exit status 1

Gemma 4 12B-it ships with a vision (CLIP) projector, and Ollama 0.30.2 fails to load it — taking down the whole model server. If you only want text, you have to strip the projector. Pull the model, then rebuild it text-only from the same blobs (no re-download):

ollama show --modelfile hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M > Gemma4.Modelfile
# delete the second `FROM ...` line — the mmproj/CLIP blob — keep only the text GGUF
ollama create gemma4-12b-text -f Gemma4.Modelfile

2. It's a reasoning model — your answer hides in thinking.
With the text model, generation worked but content came back empty while eval_count was 200. The output was all going into the reasoning channel and getting cut off mid-thought at the token cap. Fix: disable thinking.

{ "model": "gemma4-12b-text", "think": false, "messages": [ ... ] }

With think: false, clean answers in ~10 seconds.

3. Q4 has visible token glitches.
At Q4_K_M, prose came out with occasional garbled characters — literally self-さattention, ściindicates, stray Korean/Japanese codepoints injected mid-word. Code blocks were clean; only prose was affected. (Spoiler: Q8 fixes this.)

Speed (Q4, single 1080 Ti)

think: false, num_predict=256, measured via Ollama's API:

Generation: ~27.6 tok/s (27.5 / 27.6 / 27.7 — rock stable)
VRAM: ~8 GB on GPU0; GPU1 completely idle (0 MiB) — a 12B at Q4 fits one card, so the second GPU does nothing.

Quality: I asked it about my actual field

Speed is easy; is it useful for real work? I gave it four bioinformatics questions and checked the answers honestly:

Question	Verdict
RNA-seq normalization (raw vs TPM vs FPKM; DESeq2 input)	✅ Correct and precise
Pandas function to filter a DESeq2 results table	✅ Correct, clean, usable
Troubleshoot an implausibly high DEG count	✅ Good — batch effects, PCA, outliers, covariates
What a small HEIDI p-value means (SMR/colocalization)	❌ Confidently backwards

That last one is the lesson. HEIDI is a niche test: a small p-value means the locus fails (heterogeneity/linkage — you filter it out). Q4 Gemma told me a small p-value means a single causal gene — the exact opposite. It was fluent and sure of itself. If you don't already know the answer, that's the dangerous kind of wrong.

The payoff: Q4 vs Q8

So I pulled Q8_0 (12.7 GB) and rebuilt it text-only the same way. At 12.7 GB it no longer fits one card — Ollama splits it across both 1080 Tis (~7 GB each). Same questions:

	Q4_K_M	Q8_0
Size / GPUs	7 GB / 1 card (GPU1 idle)	12.7 GB / 2 cards (~7 GB each)
Speed	~28 tok/s	~19.5 tok/s (−30%)
Token glitches	`self-さattention` etc.	gone — clean ✅
HEIDI answer	backwards ❌	correct ✅ ("small p = fails, filter it out")

Less quantization bought three things: the glitches disappeared, it got the niche domain detail right, and — because the bigger file overflows one card — the otherwise-idle second 1080 Ti finally did work. The cost was ~30% throughput.

(Honesty note: I asked Q8 the HEIDI question with a more pointed framing than Q4, so that single comparison isn't perfectly controlled. The token-glitch difference, on identical prompts, is unambiguous.)

When does the second 1080 Ti actually help?

Combining this with an earlier 35B-MoE run, a clear rule emerges:

Model fits one card (12B Q4): second GPU is idle — useless.
Model overflows one card (12B Q8, or a 35B): it spills to the second card, which now helps.

The second 1080 Ti isn't about speed; it's about fitting a bigger or higher-precision model.

Honest Limitations

One model, two quants, one box; your tok/s will vary with CPU, RAM, and context length.
Q8 HEIDI test used a more direct prompt — suggestive, not a controlled A/B.
Quality judged on a handful of prompts, not a benchmark suite.
Ollama 0.30.2's Gemma 4 support is early (the CLIP crash, the reasoning-channel behavior); later versions may change this.

Reproduce

ollama pull hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M       # or :Q8_0 for the 2-card run
ollama show --modelfile hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M > m.Modelfile
# remove the mmproj/CLIP `FROM` line, keep the text GGUF
ollama create gemma4-12b-text -f m.Modelfile
# then call /api/chat with "think": false

FAQ

Q: Can a GTX 1080 Ti run Gemma 4 12B?

Yes — ~28 tok/s at Q4 on a single card, ~19.5 tok/s at Q8 across two. Just strip the vision projector (it crashes Ollama 0.30.2) and disable the reasoning channel with think: false.

Q: Q4 or Q8?

Q4 for speed and casual use (one card). Q8 when correctness matters: on my field's questions it removed the token glitches and fixed an answer Q4 got backwards — at ~30% lower speed, and it needs both cards.

Q: Why did the second GPU sit idle at Q4?

A 12B at Q4 is ~7 GB and fits one 11 GB card, so Ollama uses one GPU. Only when the model overflows one card (Q8, or a larger model) does the second card get used.

Resources

Model: unsloth/gemma-4-12b-it-GGUF
Related: 35B MoE on 2× 1080 Ti · Ollama

Top comments (1)

Josh Green • Jun 5

The HEIDI p-value thing is wild. Q4 confidently giving you the exact opposite answer on a domain specific question is basically the worst failure mode, you cant even tell its wrong unless you already know the answer.

I've been running Gemma on a UM790 Pro (780M iGPU, shared DDR5) and the bandwidth bottleneck is similar but different. The iGPU shares memory bandwidth with the CPU so you get throttled from both sides at once. 27 tok/s on a 1080 Ti is honestly better than I expected for Pascal at this point. Did you notice any difference in the token glitching between longer vs shorter prompts on Q4? 🤨