16 GB VRAM LLM benchmarks with llama.cpp (speed and context)

#selfhosting #llm #ai #hardware

Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.

I have run these LLMs on llama.cpp with 19K, 32K, and 64K tokens context windows.

For the broader performance picture (throughput versus latency, VRAM limits, parallel requests, and how benchmarks fit together across hardware and runtimes), see LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization.

The quality of the response is analysed in other articles, for instance:

I did run similar test for LLMs on Ollama: Best LLMs for Ollama on 16GB VRAM GPU.

In this post I am recording my attempts to squeeze as much performance in a sense of speed as possible.

LLM speed comparison table (tokens per second and VRAM)

Model	Size	19K VRAM	19K GPU/CPU	19K T/s	32K VRAM	32K Load	32K T/s	64K VRAM	64K Load	64K: T/s
Qwen3.5-35B-A3B-UD-IQ3_S	13.6	14.3GB	93%/100%	136.4	14.6GB	93%/100%	138.5	14.9GB	88%/115%	136.8
Qwen3.5-27B-UD-IQ3_XXS	11.5	12.9	98/100	45.3	13.7	98/100	45.1	14.7	45/410	22.7
Qwen3.5-122B-A10B-UD-IQ3_XXS	44.7	14.7	30/470	22.3	14.7	30/480	21.8	14.7	28/490	21.5
nvidia Nemotron-Cascade-2-30B IQ4_XS	18.2	14.6	60/305	115.8	14.7	57/311	113.6	14.7	55/324	103.4
gemma-4-26B-A4B-it-UD-IQ4_XS	13.4	14.7	95/100	121.7	14.9	95/115	114.9	14.9	75/190	96.1
gemma-4-31B-it-UD-IQ3_XXS	11.8	14.8	68/287	29.2	14.8	41/480	18.4	14.8	18/634	8.1
GLM-4.7-Flash-IQ4_XS	16.3	15.0	66/240	91.8	14.9	62/262	86.1	14.9	53/313	72.5
GLM-4.7-Flash-REAP-23B IQ4_XS	12.6	13.7	92/100	122.0	14.4	95/102	123.2	14.9	71/196	97.1

19K, 32K, and 64K are the context sizes.

The load above is a GPU Load.
If you see a low number in this column- that means model is running mostly on CPU and can not get any decent speed on this hardware. That pattern matches what people see when too little of the model fits on the GPU or when context pushes work back to the host.

Why context length changes tokens per second

As you move from 19K to 32K or 64K tokens, the KV cache grows and VRAM pressure rises. Some rows show a big drop in tokens per second at 64K while others stay flat, which is the signal to revisit quants, context limits, or layer offload rather than assuming the model is “slow” in general.

The models and quants I've chosen to test - are to run by myself and see do they give a good gain in a sense of cost/benefit on this equipment or not. So no q8 quants here with 200k context :) ...

GPU/CPU is a load, measured by nvitop.

llama.cpp when autoconfiguring the layers unloading to GPU is trying to keep 1GB free.
We manually specify this parameter via commandline param -ngl but I'm not finetuning it here,
just need to understand that if there is significant performance drop when increasing context window size from 32k to 64k - we can try to increase speed on 64k by finetuning number of unloaded layers.

Test hardware and llama.cpp setup

I tested the LLM speed on a PC with this config:

CPU i-14700
RAM 64GB 6000Hz (2x32GB)
GPU RTX-4080
Ubuntu with NVidia drivers
llama.cpp/llama-cli, no unloaded layers specified
Initial VRAM used, before starting llama-cli: 300MB

Extra runs at 128K context (Qwen3.5 27B and 122B)

Model	128K Load	128K: T/s
Qwen3.5-27B-UD-IQ3_XXS	16/625	9.6
Qwen3.5-122B-A10B-UD-IQ3_XXS	27/496	19.2

Takeaways for 16 GB VRAM builds

My current favorite Qwen3.5-27B-UD-IQ3_XXS is looking good on it's sweetspot 50k context (I am getting approx 36t/s)
Qwen3.5-122B-A10B-UD-IQ3_XXS is overtaking performance-wise the Qwen3.5 27B on the contexts above 64K.
I can push Qwen3.5-35B-A3B-UD-IQ3_S to handle context 100k tokens, and it fits into vram, so no performance drop
I will not use gemma-4-31B on 16GB VRAM, but gemma-4-26B might be medium-well..., need to test.
Need to test how well Nemotron cascade 2 and GLM-4.7 Flash REAP 23B work. will they be better then Qwen3.5-35B q3? I doubt but still, might test to confirm the suspicion.