The "GPU poor" narrative has flipped this year: 24GB-and-below cards are suddenly fine, thanks to quantization-aware training (near-bf16 quality at Q4 size) and MTP (free decode speed). But most of those posts are running 3090s and 4080s. I wanted the floor: what actually runs well on a GTX 1080 Ti — an 8-year-old card with 11GB — in 2026? So I measured it.
The numbers
Single GTX 1080 Ti (11GB), Ollama with flash-attention, num_ctx 8192, all running 100% on the GPU (no CPU offload):
| model | size | gen tok/s | prefill tok/s | VRAM |
|---|---|---|---|---|
| Qwen3 8B | 5.2 GB | ~46 | ~1390 | 6.6 GB |
| Gemma 4 12B QAT (unsloth UD-Q4_K_XL) | 6.7 GB | ~32 | ~316 | 8.0 GB |
| Gemma 4 12B QAT (Google) | 7.0 GB | ~31 | ~314 | 8.2 GB |
| Gemma 4 12B (regular Q4) | 7.1 GB | ~29 | — | 8.4 GB |
The headline: a 12B model runs at ~30 tok/s on an 8-year-old card, comfortably inside 11GB, fully on the GPU. That's faster than most people read, and well into "usable for daily work" territory. An 8B sits around 46 tok/s with much faster prefill.
A few things worth noting
QAT buys a small but real speed edge. The Gemma 4 12B QAT builds (~31–32 tok/s) come in a bit faster than the regular Q4 (~29) and slightly smaller — about a 9% gen-speed gain, consistent with what I measured earlier on the same card. Not magic, but free.
Prefill scales with size, hard. Qwen3 8B processes the prompt at ~1390 tok/s; the Gemma 12Bs at ~315 — roughly 4× slower. On a Pascal card the prompt-processing stage is where you feel the model size, so for long prompts the smaller model's lead widens beyond the gen-speed gap. (This is the same prefill wall story, scaled to old hardware.)
12B is the comfortable ceiling for one card. A 12B Q4 lands around 8GB and leaves room for a real context. The QAT 12B even fits 16k context on an 8GB card with KV-cache quantization, so an 11GB 1080 Ti has comfortable headroom.
Where the ceiling actually is
A dense 27B (Q4 ≈ 17GB) does not fit one 11GB card — you either split across two cards or it spills to system RAM and crawls. And spilling is worse than it sounds on this class of hardware: I benchmarked the 35B-A3B MoE on 2× 1080 Ti and it ran at only ~17 tok/s, because the experts get mmapped to system RAM and the whole thing goes memory-bandwidth-bound — a CPU nearly tied it. So "more VRAM via a second old card" helps you fit bigger models, but the bandwidth ceiling means a dense 12B that lives entirely on one card often feels snappier than a 35B that spills.
The takeaway
If you've got a 1080 Ti gathering dust: in 2026 it runs a 12B at ~30 tok/s and an 8B at ~46, fully on the GPU, no cloud, no rate limits. QAT made the quality competitive and the size friendly; the card was always fast enough for this once the models got small enough. The "GPU poor are eating well" story reaches all the way back to 2017 silicon — you just stay at or below 12B and let it sit entirely in VRAM.
Caveats
- Single GTX 1080 Ti, single request, Ollama + flash-attention,
num_ctx 8192, the specific quants above. Gen tok/s from the server's own token timings; numbers are stable but not claimed to ±0.1. - The 35B-A3B 2× 1080 Ti figure is from my earlier write-up, not this run.
- 27B+ "doesn't fit" assumes a single 11GB card and Q4-class quant; a second card or heavier KV-quant changes the math (at a speed cost).
Top comments (0)