vLLM vs llama.cpp vs Ollama: What Happens When Your Model Doesn't Fit in 24GB VRAM

#llm #homelab #vllm #ai

TL;DR

Benchmarked llama.cpp, Ollama, and vLLM across 5 models (1B to 116.8B params) on one RTX 3090 (24GB) + 128GB RAM home-lab box, priced through HomeLab Monitor. Inside 24GB, vLLM's continuous batching scales aggregate throughput 3.9x-5.4x from concurrency 1 to 8 (llama.cpp only manages 1.2x-1.9x, even with -np 8 explicitly set to match). Past 24GB — two models deliberately chosen to force RAM-spill — llama.cpp and Ollama both degrade to single-digit tok/s and keep generating. vLLM OOMs outright on both, at the same ~22.1-22.2GB-used / <700MB-free ceiling, regardless of quantization scheme. Sub-plot: llama.cpp's manually-tuned layer offload beats Ollama's automatic split by 37x on time-to-first-token during RAM-spill, while landing on nearly identical steady-state decode speed.

The roster

Model	Vendor	Type	Fits in 24GB?
Gemma 3 1B	Google	dense	yes
Qwen3-Coder 30B-A3B	Alibaba	MoE (~3.3B active)	yes
Gemma 4 26B-A4B	Google	MoE (~4B active)	yes
GLM-4.5-Air 106B-A12B	Zhipu	MoE (~12B active)	no, deliberately
GPT-OSS 120B-A5.1B	OpenAI	MoE (~5.1B active)	no, deliberately

(Gemma 4 is real — Google's newest release as of this writing, not a Gemma 3 typo.)

3 prompt tiers (short/medium/long), concurrency 1 and 8, 2 reps per cell, 15 backend×model pairs total. Caveat stated up front: the first three models ran against my production Ollama (OLLAMA_NUM_PARALLEL=1, serialized by default — real daily-use config); GLM and GPT-OSS ran against a separate isolated instance (OLLAMA_NUM_PARALLEL=4) since they needed a clean volume anyway. Ollama's concurrency=8 numbers for the first three models are not its concurrency ceiling — they're its actual default production behavior.

Concurrency, inside 24GB

Aggregate decode tok/s, concurrency 1 → concurrency 8:

Model	Ollama	llama.cpp	vLLM
Gemma 3 1B	125.6 → 71.4	294.1 → 400.6	235.5 → 1172.1
Qwen3-Coder 30B-A3B	129.3 → 108.4	157.2 → 183.9	172.0 → 677.9
Gemma 4 26B-A4B	84.5 → 78.5	118.8 → 220.6	133.8 → 723.4

vLLM's own c1→c8 scaling: 3.9x-5.4x (paged attention, requests slot into idle cycles). llama.cpp's, even with -np 8 matched to the concurrency level: 1.2x-1.9x — it pre-declares a fixed KV-cache reservation per parallel slot before the server starts, so concurrency is a config decision, not a runtime one. Head-to-head at c8: vLLM beats llama.cpp by 2.9x-3.7x, beats Ollama's serialized default by 6.3x-16.4x (caveat above applies).

The cliff, and vLLM's wall

GLM-4.5-Air (~52% of layers spilled to system RAM under llama.cpp's tuning) and GPT-OSS-120B (~67% spilled) were picked specifically to not fit. llama.cpp and Ollama both ran them — slow, single-digit tok/s, but real generation, no crash. vLLM failed outright on both:

# GPT-OSS-120B, native MXFP4, --cpu-offload-gb 45
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.08 GiB.
GPU 0 has a total capacity of 23.56 GiB of which 533.69 MiB is free.
Process ... has 22.21 GiB memory in use.
RuntimeError: Engine core initialization failed.

# GLM-4.5-Air, pre-quantized AWQ, --cpu-offload-gb 36
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.16 GiB.
GPU 0 has a total capacity of 23.56 GiB of which 685.69 MiB is free.
Process ... has 22.12 GiB memory in use.

Same shape, different model, different quantization path. I retried GLM at --gpu-memory-utilization 0.78 (down from 0.90, to force more declared headroom) — got the byte-for-byte identical error: 22.12 GiB used, 685.69 MiB free, 1.16 GiB requested. That rules out the utilization knob as the fix; the base weight + offload footprint is already pinned at the ceiling before profiling starts. Two models, two quant schemes, same ~22GB wall — reads as a real limit of vLLM's CPU-offload path for >100B-param MoE on one 24GB card on this stack, not a per-model quirk.

TTFT: the 37x gap that steady-state doesn't show

On the models that ran everywhere, steady-state decode is nearly a tie once warmed up — GPT-OSS-120B's longest tier: 7.65 tok/s (llama.cpp) vs 7.6 tok/s (Ollama). GLM: 4.58 vs 4.59. Time-to-first-token is a different story:

Model	Ollama TTFT	llama.cpp TTFT	Gap
GLM-4.5-Air	13.6s	8.1s	1.7x
GPT-OSS-120B	274.0s	7.3s	37x

llama.cpp's -ngl is a number I computed myself from the model's real config.json (layer count, per-layer size) — -ngl 12 for GPT-OSS, offloading ~21GB deliberately. Ollama figures the split out automatically at load time, and on a freshly-pulled, partially-RAM-resident 65GB model, that automatic path is expensive. Same destination, very different path there.

What it costs (BGN per 1M output tokens, real GPU energy)

Model	Ollama	llama.cpp	vLLM
Gemma 3 1B	0.19	0.05	~0*
Gemma 4 26B-A4B	0.25	0.14	0.04
Qwen3-Coder 30B-A3B	0.16	0.13	0.04
GLM-4.5-Air	2.61	1.95	OOM
GPT-OSS-120B	10.00	1.43	OOM

*vLLM's Gemma 3 1B run finished in 6s — too fast for the power sampler to catch a reading, recorded near-zero. A sampling limitation on short bursts, not a genuine free result.

GPT-OSS-120B on Ollama costs ~7x more real electricity per million tokens than llama.cpp for the identical model — the TTFT convenience tax from above, showing up again in currency.

Three disclosed vLLM checkpoint swaps

The original plan was on-the-fly bitsandbytes 4-bit quant for every vLLM leg. It failed for every MoE model, for three distinct, verified reasons — not the same error copy-pasted three times:

Qwen3-Coder-30B: ValueError: BitsAndBytes quantization with padded hidden_size ... Parameter shape (786432, 1) != checkpoint shape (2048, 768) — bnb can't dequantize this MoE's padded expert layout. Fix: pre-quantized AWQ checkpoint. Ran clean after (677.9 tok/s aggregate @ c8).
Gemma 4 26B-A4B: AttributeError: MoE Model Gemma4ForConditionalGeneration does not support BitsAndBytes quantization yet. A new architecture, bnb path not wired up yet. Fix: a different pre-quantized checkpoint — which then hit a pydantic error because its config.json says compressed-tensors, not AWQ, despite the repo name. Fixed by dropping the explicit --quantization flag entirely and letting vLLM auto-detect.
GLM-4.5-Air: not a failure — a practicality call. Skipped a 212GB native bf16 download to test a bnb+MoE+CPU-offload combo the vLLM community already flagged as shaky, went straight to a ~63GB pre-quantized AWQ checkpoint that tests the exact same question.

Every root cause above came from the actual container logs, not from assuming precedent carried over from the previous model's failure.

What wasn't tested

Only two --gpu-memory-utilization values before accepting the OOM as final, not a full --cpu-offload-gb sweep. No multi-GPU / tensor-parallel vLLM path — a different question from "does single-card CPU offload work." Ollama's c8 numbers for the first three models are its production default, not its concurrency ceiling. And one raw llama.cpp per-request timing (Gemma 4, medium tier, c8) self-reported an impossible 250,024 tok/s from a near-zero-duration completion — the aggregate figures used throughout are total-tokens-over-wall-time, which isn't corrupted by that, but it's a known rough edge in the raw per-request logs.

Full narrative version, with the RAM-spill mechanics and the redacted dashboard screenshot: on Medium.

Every number above was priced through HomeLab Monitor — open source, MIT licensed — against the RTX 3090's real power draw.

If you're already running one of these three backends: has yours ever tried to load something that just didn't fit — and did it fail loud or fail quiet?