This is a submission for the Gemma 4 Challenge: Write About Gemma 4
π Companion piece to my earlier post: I shipped local LLM features two months ago β production never ran them once. Same
gemma4:e2b, same box β this one is the GPU offload follow-up.
π¬ TL;DR
2.5Γ faster, 10Β°C cooler β on a 4 GB laptop GPU that "shouldn't" fit the model.
| CPU only | GPU hybrid | |
|---|---|---|
| Tokens / sec | 17 | 39 |
| Per-call latency | ~5.5 s | ~2.0 s |
| CPU temp under burst | hot | β10 Β°C |
| Layers on GPU | 0 | 35 / 36 |
Same prompt. Same model. Same hardware. The only thing that changed was whether Ollama was allowed to touch the card.
Honest take: I was hoping for more. The math at the end of this post explains exactly why 2.5Γ is the ceiling on 4 GB of VRAM with Gemma 4, and what it would take to push higher.
βοΈ Setup
| Model |
gemma4:e2b (2 B effective params, ~7.2 GB on disk) |
| CPU | AMD Ryzen 5 4600H, 6 cores / 12 threads |
| GPU | NVIDIA GTX 1650 Ti Mobile, 4 GB VRAM |
| OS / runtime | Ubuntu + Docker, Ollama 0.23.1 |
| Prompt | Distractor + hint + explanation generator from my reader app β fixed across runs |
| Output budget | ~60 tokens per call |
| Control |
num_gpu=0 β CPU only Β· num_gpu=999 β let Ollama auto-split |
| Warm-up | One throwaway call per mode before the timed samples |
Both modes ran after warm-up, so the numbers reflect steady-state inference, not first-load cost. Each /api/generate response came back as NDJSON, so I pulled eval_count, eval_duration, and total_duration straight from the engine β no external timing noise.
π― Why I picked E2B
Gemma 4 ships in three flavours β the small E2B/E4B family, a 31B Dense model, and a 26B MoE. The model that runs in this benchmark is the smallest of those, and that wasn't accidental.
The work is a fire-and-forget enrichment step inside a vocabulary-save flow β distractors plus a hint plus a short explanation, all generated in one call. It has to feel synchronous on a save action, and it has to run on the same commodity laptop as the rest of the app. Anything bigger is the wrong tool.
The 31B Dense doesn't fit. The 26B MoE would, but its VRAM patterns on a 4 GB card are punishing. E4B is the obvious step up in quality from E2B, but its size pushes total memory over the line where Ollama has to keep more on CPU β slower for the same job at the latency profile a save action needs. E2B at Q4 lands the quality where I need it for distractor generation while leaving headroom for the KV cache and everything else.
The framing that matters here isn't "the biggest model I could fit" but "the smallest model that gave me the output I needed." On constrained hardware, that distinction is the whole game β and it's what made the GPU experiment below worth running at all.
π Results
| Metric | CPU only | GPU hybrid (35/36 layers on GPU) | Ξ |
|---|---|---|---|
| Avg output tokens / call | 60 | 55 | ~same |
| Avg eval latency (token gen only) | 3,506 ms | 1,411 ms | 2.49Γ faster |
| Avg total latency (prompt + gen) | 5,390 ms | 2,174 ms | 2.48Γ faster |
| Tokens / sec | 17 | 39 | 2.29Γ faster |
ollama ps during the GPU run:
NAME SIZE PROCESSOR CONTEXT UNTIL
gemma4:e2b 7.8 GB 74%/26% CPU/GPU 4096 Forever
nvidia-smi during a generation:
NVIDIA GTX 1650 Ti, used 1998 MiB, free 1909 MiB, util 32 %
β οΈ
ollama pslies to you.
That "74%/26% CPU/GPU" string is a memory split, not a layer split. The Ollama server logs are the only place that tells you which layers actually moved. Mine showedoffloaded 35/36 layers to GPU. Almost the whole transformer β minus one layer that matters a lot. More on that in a second.
π§ Why 2.5Γ and not 10Γ
The model has 36 transformer layers. Ollama put 35 of them on the GPU. The lone holdout is the output projection layer β the one that maps the final hidden state back into Gemma's vocabulary.
Gemma 4's vocab is enormous (~256k tokens). That output layer is dense, fat, and would happily swallow what's left of the 4 GB after the rest of the stack moves over. So Ollama leaves it on CPU.
The consequence is brutal in the steady state:
π‘ Every single generated token has to round-trip through the CPU at the end. GPU is fast for the 35 layers it owns, then the pipeline stalls on the one layer the GPU couldn't take. Average across thousands of tokens and the CPU side becomes the floor.
That's the whole story of 2.5Γ instead of 10Γ. Hybrid inference is gated by the slower of the two devices, and on this card the slower device is doing real work on every token.
The takeaway worth bolding: if you only ever look at ollama ps, you'll get the wrong picture of what your setup is doing. The server load logs are the source of truth for which layers went where.
π‘ What 2.5Γ actually buys you
In the app, a single save β distractors + hint + short explanation, ~60 output tokens β used to take 5.5 s. Now it's just over 2 s.
That moves the action from the "is this hanging?" zone into the "yeah, it's working" zone. That's the threshold that actually matters for a save action.
Five saves in a row:
- Before: ~30 seconds of full-tilt CPU
- After: ~10 seconds, work split between CPU and GPU
- Bonus: peak CPU temperature during that burst dropped ~10 Β°C
On a thin laptop in a small room, that last number is the difference between a fan you hear and a fan you don't.
π What would push it higher
Three options, in order of how willing I am to do them:
- Smaller quant on just the output layer. If that layer fit in the remaining ~1.9 GB, the whole model would run on GPU and you'd see the 10Γ numbers other writeups quote. The cost is real quality loss on the output distribution β worth measuring on your own prompt set rather than assuming.
- A bigger GPU. A 16 GB card holds the whole thing with room to spare. The point of this exercise was specifically "what does a commodity laptop GPU do", so a $500 desktop card isn't really in scope.
- Swap engines. llama.cpp direct, vLLM, etc. Two seconds is already inside budget for the action this model powers. Optimising past "fast enough" is how you end up with three benchmarks and zero users.
π οΈ Reproducing this
# 1. Pull the model
ollama pull gemma4:e2b
# 2. Force CPU only
curl -s http://localhost:11434/api/generate -d '{
"model": "gemma4:e2b",
"prompt": "Give me 5 distractors for the word \"warehouse\".",
"stream": false,
"options": { "num_gpu": 0 }
}' | jq '{tokens: .eval_count, eval_ms: (.eval_duration/1e6), total_ms: (.total_duration/1e6)}'
# 3. Let Ollama use the GPU
curl -s http://localhost:11434/api/generate -d '{
"model": "gemma4:e2b",
"prompt": "Give me 5 distractors for the word \"warehouse\".",
"stream": false,
"options": { "num_gpu": 999 }
}' | jq '{tokens: .eval_count, eval_ms: (.eval_duration/1e6), total_ms: (.total_duration/1e6)}'
# 4. Check what actually landed where
docker logs ollama 2>&1 | grep -E "offloaded|layers"
nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv
Run each curl a handful of times to flush warm-up effects, then average eval_ms and total_ms. The interesting number is the ratio, not the absolute timings β they'll vary with your CPU.
β Takeaways
- 4 GB VRAM is enough to be useful, even on a model that "should" need more. Just don't expect 10Γ.
- Hybrid inference is gated by the slower device. If one critical layer stays on CPU, that's your floor.
-
Trust the load logs, not
ollama ps. The pretty CPU/GPU percentage is a memory split, not a layer count. - 2.5Γ is the difference between a UX that feels broken and one that doesn't. That's enough.
- Stop optimising once you're inside budget. "Fast enough" beats "fastest" every time.
π Full write-up with all the load-log spelunking on my blog: vasyl.blog β I put Ollama on a 4 GB mobile GPU and got 2.5Γ
β The reader app this powers is open-source (AGPL-3.0): github.com/mrviduus/textstack
Built with gemma4:e2b for the Gemma 4 Challenge. If you're entering too, drop a link in the comments β happy to read yours.
Top comments (3)
I didn't realize you were building this for the Gemma 4 Challenge β that actually makes the whole journey even more interesting. The first post wasn't a dead feature, it was your testing ground before the real submission. And now the GPU experiment proves you've pushed the model to a tangible 2.5Γ improvement on a tiny 4 GB card.
The "one specific layer caps it" bit intrigues me. If it's a memory-heavy layer, maybe partial offload (just that layer to CPU, keep the rest on GPU) could squeeze out a bit more speed for the challenge. Or is that already in your submission?
Either way, turning early production silence into a concrete VRAM math breakdown is a solid challenge entry. Good luck in the judging.
Appreciate the read β and "testing ground for the real submission" is a more generous framing than I gave it. The 35/36 split IS the partial offload you're describing: the lone CPU layer is the output projection (Gemma's 256k vocab makes it fat). The next move is quantising just that layer further, which I haven't measured the quality hit on yet.
The 256k vocab output layer being the bottleneck is almost poetic β the model's biggest strength (vocabulary coverage) is also its heaviest part. Beyond quantizing further, I wonder if you could temporarily prune that layer's leastβused tokens at inference time, based on the input context. A dynamic, contextβaware "vocab filter" before the output projection could slim it down without permanent quality loss. Not trivial to implement, but on a 4 GB card every megabyte counts.