Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.
I have run these LLMs on llama.cpp with 19K, 32K, and 64K tokens context windows.
For the broader performance picture (throughput versus latency, VRAM limits, parallel requests, and how benchmarks fit together across hardware and runtimes), see LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization.
The quality of the response is analysed in other articles, for instance:
- Best LLMs for OpenCode - Tested Locally
- Comparison of Hugo Page Translation quality - LLMs on Ollama
I did run similar test for LLMs on Ollama: Best LLMs for Ollama on 16GB VRAM GPU.
In this post I am recording my attempts to squeeze as much performance in a sense of speed as possible.
LLM speed comparison table (tokens per second and VRAM)
| Model | Size | 19K VRAM | 19K GPU/CPU | 19K T/s | 32K VRAM | 32K Load | 32K T/s | 64K VRAM | 64K Load | 64K: T/s |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3.5-35B-A3B-UD-IQ3_S | 13.6 | 14.3GB | 93%/100% | 136.4 | 14.6GB | 93%/100% | 138.5 | 14.9GB | 88%/115% | 136.8 |
| Qwen3.5-27B-UD-IQ3_XXS | 11.5 | 12.9 | 98/100 | 45.3 | 13.7 | 98/100 | 45.1 | 14.7 | 45/410 | 22.7 |
| Qwen3.5-122B-A10B-UD-IQ3_XXS | 44.7 | 14.7 | 30/470 | 22.3 | 14.7 | 30/480 | 21.8 | 14.7 | 28/490 | 21.5 |
| nvidia Nemotron-Cascade-2-30B IQ4_XS | 18.2 | 14.6 | 60/305 | 115.8 | 14.7 | 57/311 | 113.6 | 14.7 | 55/324 | 103.4 |
| gemma-4-26B-A4B-it-UD-IQ4_XS | 13.4 | 14.7 | 95/100 | 121.7 | 14.9 | 95/115 | 114.9 | 14.9 | 75/190 | 96.1 |
| gemma-4-31B-it-UD-IQ3_XXS | 11.8 | 14.8 | 68/287 | 29.2 | 14.8 | 41/480 | 18.4 | 14.8 | 18/634 | 8.1 |
| GLM-4.7-Flash-IQ4_XS | 16.3 | 15.0 | 66/240 | 91.8 | 14.9 | 62/262 | 86.1 | 14.9 | 53/313 | 72.5 |
| GLM-4.7-Flash-REAP-23B IQ4_XS | 12.6 | 13.7 | 92/100 | 122.0 | 14.4 | 95/102 | 123.2 | 14.9 | 71/196 | 97.1 |
19K, 32K, and 64K are the context sizes.
The load above is a GPU Load.
If you see a low number in this column- that means model is running mostly on CPU and can not get any decent speed on this hardware. That pattern matches what people see when too little of the model fits on the GPU or when context pushes work back to the host.
Why context length changes tokens per second
As you move from 19K to 32K or 64K tokens, the KV cache grows and VRAM pressure rises. Some rows show a big drop in tokens per second at 64K while others stay flat, which is the signal to revisit quants, context limits, or layer offload rather than assuming the model is “slow” in general.
The models and quants I've chosen to test - are to run by myself and see do they give a good gain in a sense of cost/benefit on this equipment or not. So no q8 quants here with 200k context :) ...
GPU/CPU is a load, measured by nvitop.
llama.cpp when autoconfiguring the layers unloading to GPU is trying to keep 1GB free.
We manually specify this parameter via commandline param -ngl but I'm not finetuning it here,
just need to understand that if there is significant performance drop when increasing context window size from 32k to 64k - we can try to increase speed on 64k by finetuning number of unloaded layers.
Test hardware and llama.cpp setup
I tested the LLM speed on a PC with this config:
- CPU i-14700
- RAM 64GB 6000Hz (2x32GB)
- GPU RTX-4080
- Ubuntu with NVidia drivers
- llama.cpp/llama-cli, no unloaded layers specified
- Initial VRAM used, before starting llama-cli: 300MB
Extra runs at 128K context (Qwen3.5 27B and 122B)
| Model | 128K Load | 128K: T/s |
|---|---|---|
| Qwen3.5-27B-UD-IQ3_XXS | 16/625 | 9.6 |
| Qwen3.5-122B-A10B-UD-IQ3_XXS | 27/496 | 19.2 |
Takeaways for 16 GB VRAM builds
- My current favorite Qwen3.5-27B-UD-IQ3_XXS is looking good on it's sweetspot 50k context (I am getting approx 36t/s)
- Qwen3.5-122B-A10B-UD-IQ3_XXS is overtaking performance-wise the Qwen3.5 27B on the contexts above 64K.
- I can push Qwen3.5-35B-A3B-UD-IQ3_S to handle context 100k tokens, and it fits into vram, so no performance drop
- I will not use gemma-4-31B on 16GB VRAM, but gemma-4-26B might be medium-well..., need to test.
- Need to test how well Nemotron cascade 2 and GLM-4.7 Flash REAP 23B work. will they be better then Qwen3.5-35B q3? I doubt but still, might test to confirm the suspicion.
Top comments (0)