Vasyl

Posted on May 13 • Edited on May 14 • Originally published at vasyl.blog

I put Ollama on a 4 GB mobile GPU and got 2.5 — here's the VRAM math

#devchallenge #gemmachallenge #gemma #ollama

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

📎 Companion piece to my earlier post: I shipped local LLM features two months ago — production never ran them once. Same gemma4:e2b, same box — this one is the GPU offload follow-up.

🔬 TL;DR

2.5× faster, 10°C cooler — on a 4 GB laptop GPU that "shouldn't" fit the model.

	CPU only	GPU hybrid
Tokens / sec	17	39
Per-call latency	~5.5 s	~2.0 s
CPU temp under burst	hot	−10 °C
Layers on GPU	0	35 / 36

Same prompt. Same model. Same hardware. The only thing that changed was whether Ollama was allowed to touch the card.

Honest take: I was hoping for more. The math at the end of this post explains exactly why 2.5× is the ceiling on 4 GB of VRAM with Gemma 4, and what it would take to push higher.

⚙️ Setup


Model	`gemma4:e2b` (2 B effective params, ~7.2 GB on disk)
CPU	AMD Ryzen 5 4600H, 6 cores / 12 threads
GPU	NVIDIA GTX 1650 Ti Mobile, 4 GB VRAM
OS / runtime	Ubuntu + Docker, Ollama 0.23.1
Prompt	Distractor + hint + explanation generator from my reader app — fixed across runs
Output budget	~60 tokens per call
Control	`num_gpu=0` → CPU only · `num_gpu=999` → let Ollama auto-split
Warm-up	One throwaway call per mode before the timed samples

Both modes ran after warm-up, so the numbers reflect steady-state inference, not first-load cost. Each /api/generate response came back as NDJSON, so I pulled eval_count, eval_duration, and total_duration straight from the engine — no external timing noise.

🎯 Why I picked E2B

Gemma 4 ships in three flavours — the small E2B/E4B family, a 31B Dense model, and a 26B MoE. The model that runs in this benchmark is the smallest of those, and that wasn't accidental.

The work is a fire-and-forget enrichment step inside a vocabulary-save flow — distractors plus a hint plus a short explanation, all generated in one call. It has to feel synchronous on a save action, and it has to run on the same commodity laptop as the rest of the app. Anything bigger is the wrong tool.

The 31B Dense doesn't fit. The 26B MoE would, but its VRAM patterns on a 4 GB card are punishing. E4B is the obvious step up in quality from E2B, but its size pushes total memory over the line where Ollama has to keep more on CPU — slower for the same job at the latency profile a save action needs. E2B at Q4 lands the quality where I need it for distractor generation while leaving headroom for the KV cache and everything else.

The framing that matters here isn't "the biggest model I could fit" but "the smallest model that gave me the output I needed." On constrained hardware, that distinction is the whole game — and it's what made the GPU experiment below worth running at all.

📊 Results

Metric	CPU only	GPU hybrid (35/36 layers on GPU)	Δ
Avg output tokens / call	60	55	~same
Avg eval latency (token gen only)	3,506 ms	1,411 ms	2.49× faster
Avg total latency (prompt + gen)	5,390 ms	2,174 ms	2.48× faster
Tokens / sec	17	39	2.29× faster

ollama ps during the GPU run:

NAME          SIZE      PROCESSOR        CONTEXT   UNTIL
gemma4:e2b    7.8 GB    74%/26% CPU/GPU  4096      Forever

nvidia-smi during a generation:

NVIDIA GTX 1650 Ti, used 1998 MiB, free 1909 MiB, util 32 %

⚠️ ollama ps lies to you.
That "74%/26% CPU/GPU" string is a memory split, not a layer split. The Ollama server logs are the only place that tells you which layers actually moved. Mine showed offloaded 35/36 layers to GPU. Almost the whole transformer — minus one layer that matters a lot. More on that in a second.

🧠 Why 2.5× and not 10×

The model has 36 transformer layers. Ollama put 35 of them on the GPU. The lone holdout is the output projection layer — the one that maps the final hidden state back into Gemma's vocabulary.

Gemma 4's vocab is enormous (~256k tokens). That output layer is dense, fat, and would happily swallow what's left of the 4 GB after the rest of the stack moves over. So Ollama leaves it on CPU.

The consequence is brutal in the steady state:

💡 Every single generated token has to round-trip through the CPU at the end. GPU is fast for the 35 layers it owns, then the pipeline stalls on the one layer the GPU couldn't take. Average across thousands of tokens and the CPU side becomes the floor.

That's the whole story of 2.5× instead of 10×. Hybrid inference is gated by the slower of the two devices, and on this card the slower device is doing real work on every token.

The takeaway worth bolding: if you only ever look at ollama ps, you'll get the wrong picture of what your setup is doing. The server load logs are the source of truth for which layers went where.

💡 What 2.5× actually buys you

In the app, a single save — distractors + hint + short explanation, ~60 output tokens — used to take 5.5 s. Now it's just over 2 s.

That moves the action from the "is this hanging?" zone into the "yeah, it's working" zone. That's the threshold that actually matters for a save action.

Five saves in a row:

Before: ~30 seconds of full-tilt CPU
After: ~10 seconds, work split between CPU and GPU
Bonus: peak CPU temperature during that burst dropped ~10 °C

On a thin laptop in a small room, that last number is the difference between a fan you hear and a fan you don't.

🚀 What would push it higher

Three options, in order of how willing I am to do them:

Smaller quant on just the output layer. If that layer fit in the remaining ~1.9 GB, the whole model would run on GPU and you'd see the 10× numbers other writeups quote. The cost is real quality loss on the output distribution — worth measuring on your own prompt set rather than assuming.
A bigger GPU. A 16 GB card holds the whole thing with room to spare. The point of this exercise was specifically "what does a commodity laptop GPU do", so a $500 desktop card isn't really in scope.
Swap engines. llama.cpp direct, vLLM, etc. Two seconds is already inside budget for the action this model powers. Optimising past "fast enough" is how you end up with three benchmarks and zero users.

🛠️ Reproducing this

# 1. Pull the model
ollama pull gemma4:e2b

# 2. Force CPU only
curl -s http://localhost:11434/api/generate -d '{
  "model": "gemma4:e2b",
  "prompt": "Give me 5 distractors for the word \"warehouse\".",
  "stream": false,
  "options": { "num_gpu": 0 }
}' | jq '{tokens: .eval_count, eval_ms: (.eval_duration/1e6), total_ms: (.total_duration/1e6)}'

# 3. Let Ollama use the GPU
curl -s http://localhost:11434/api/generate -d '{
  "model": "gemma4:e2b",
  "prompt": "Give me 5 distractors for the word \"warehouse\".",
  "stream": false,
  "options": { "num_gpu": 999 }
}' | jq '{tokens: .eval_count, eval_ms: (.eval_duration/1e6), total_ms: (.total_duration/1e6)}'

# 4. Check what actually landed where
docker logs ollama 2>&1 | grep -E "offloaded|layers"
nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv

Run each curl a handful of times to flush warm-up effects, then average eval_ms and total_ms. The interesting number is the ratio, not the absolute timings — they'll vary with your CPU.

✅ Takeaways

4 GB VRAM is enough to be useful, even on a model that "should" need more. Just don't expect 10×.
Hybrid inference is gated by the slower device. If one critical layer stays on CPU, that's your floor.
Trust the load logs, not ollama ps. The pretty CPU/GPU percentage is a memory split, not a layer count.
2.5× is the difference between a UX that feels broken and one that doesn't. That's enough.
Stop optimising once you're inside budget. "Fast enough" beats "fastest" every time.

📖 Full write-up with all the load-log spelunking on my blog: vasyl.blog — I put Ollama on a 4 GB mobile GPU and got 2.5×

⭐ The reader app this powers is open-source (AGPL-3.0): github.com/mrviduus/textstack

Built with gemma4:e2b for the Gemma 4 Challenge. If you're entering too, drop a link in the comments — happy to read yours.

Top comments (3)

HARD IN SOFT OUT • May 13

I didn't realize you were building this for the Gemma 4 Challenge — that actually makes the whole journey even more interesting. The first post wasn't a dead feature, it was your testing ground before the real submission. And now the GPU experiment proves you've pushed the model to a tangible 2.5× improvement on a tiny 4 GB card.

The "one specific layer caps it" bit intrigues me. If it's a memory-heavy layer, maybe partial offload (just that layer to CPU, keep the rest on GPU) could squeeze out a bit more speed for the challenge. Or is that already in your submission?

Either way, turning early production silence into a concrete VRAM math breakdown is a solid challenge entry. Good luck in the judging.

Vasyl • May 13

Appreciate the read — and "testing ground for the real submission" is a more generous framing than I gave it. The 35/36 split IS the partial offload you're describing: the lone CPU layer is the output projection (Gemma's 256k vocab makes it fat). The next move is quantising just that layer further, which I haven't measured the quality hit on yet.

HARD IN SOFT OUT • May 13

The 256k vocab output layer being the bottleneck is almost poetic — the model's biggest strength (vocabulary coverage) is also its heaviest part. Beyond quantizing further, I wonder if you could temporarily prune that layer's least‑used tokens at inference time, based on the input context. A dynamic, context‑aware "vocab filter" before the output projection could slim it down without permanent quality loss. Not trivial to implement, but on a 4 GB card every megabyte counts.