What Actually Runs Well on a GTX 1080 Ti in 2026 (Measured)

#llm #machinelearning #gemma #localllama

The "GPU poor" narrative has flipped this year: 24GB-and-below cards are suddenly fine, thanks to quantization-aware training (near-bf16 quality at Q4 size) and MTP (free decode speed). But most of those posts are running 3090s and 4080s. I wanted the floor: what actually runs well on a GTX 1080 Ti — an 8-year-old card with 11GB — in 2026? So I measured it.

The numbers

Single GTX 1080 Ti (11GB), Ollama with flash-attention, num_ctx 8192, all running 100% on the GPU (no CPU offload):

model	size	gen tok/s	prefill tok/s	VRAM
Qwen3 8B	5.2 GB	~46	~1390	6.6 GB
Gemma 4 12B QAT (unsloth UD-Q4_K_XL)	6.7 GB	~32	~316	8.0 GB
Gemma 4 12B QAT (Google)	7.0 GB	~31	~314	8.2 GB
Gemma 4 12B (regular Q4)	7.1 GB	~29	—	8.4 GB

The headline: a 12B model runs at ~30 tok/s on an 8-year-old card, comfortably inside 11GB, fully on the GPU. That's faster than most people read, and well into "usable for daily work" territory. An 8B sits around 46 tok/s with much faster prefill.

A few things worth noting

QAT buys a small but real speed edge. The Gemma 4 12B QAT builds (~31–32 tok/s) come in a bit faster than the regular Q4 (~29) and slightly smaller — about a 9% gen-speed gain, consistent with what I measured earlier on the same card. Not magic, but free.

Prefill scales with size, hard. Qwen3 8B processes the prompt at ~1390 tok/s; the Gemma 12Bs at ~315 — roughly 4× slower. On a Pascal card the prompt-processing stage is where you feel the model size, so for long prompts the smaller model's lead widens beyond the gen-speed gap. (This is the same prefill wall story, scaled to old hardware.)

12B is the comfortable ceiling for one card. A 12B Q4 lands around 8GB and leaves room for a real context. The QAT 12B even fits 16k context on an 8GB card with KV-cache quantization, so an 11GB 1080 Ti has comfortable headroom.

Where the ceiling actually is

A dense 27B (Q4 ≈ 17GB) does not fit one 11GB card — you either split across two cards or it spills to system RAM and crawls. And spilling is worse than it sounds on this class of hardware: I benchmarked the 35B-A3B MoE on 2× 1080 Ti and it ran at only ~17 tok/s, because the experts get mmapped to system RAM and the whole thing goes memory-bandwidth-bound — a CPU nearly tied it. So "more VRAM via a second old card" helps you fit bigger models, but the bandwidth ceiling means a dense 12B that lives entirely on one card often feels snappier than a 35B that spills.

The takeaway

If you've got a 1080 Ti gathering dust: in 2026 it runs a 12B at ~30 tok/s and an 8B at ~46, fully on the GPU, no cloud, no rate limits. QAT made the quality competitive and the size friendly; the card was always fast enough for this once the models got small enough. The "GPU poor are eating well" story reaches all the way back to 2017 silicon — you just stay at or below 12B and let it sit entirely in VRAM.

Caveats

Single GTX 1080 Ti, single request, Ollama + flash-attention, num_ctx 8192, the specific quants above. Gen tok/s from the server's own token timings; numbers are stable but not claimed to ±0.1.
The 35B-A3B 2× 1080 Ti figure is from my earlier write-up, not this run.
27B+ "doesn't fit" assumes a single 11GB card and Q4-class quant; a second card or heavier KV-quant changes the math (at a speed cost).

DEV Community