This is a submission for the Gemma 4 Challenge: Write About Gemma 4
Google released four Gemma 4 variants. Everyone's comparing them on synthetic benchmarks nobody actually cares about. I ran all four on my home lab hardware with real tasks. The results surprised me.
Test machine: Ryzen 7 5700X, RTX 1060 6GB, 32GB RAM. LM Studio, 4-bit quantization.
The Models
| Model | Effective Params | 4-bit Size | Architecture |
|---|---|---|---|
| E2B | ~2.3B | 1.5GB | Dense |
| E4B | ~4.5B | 2.1GB | Dense |
| 26B MoE | ~4B active / 26B total | 13GB | Mixture of Experts |
| 31B | ~31B | 16GB | Dense |
Test 1: Vision — Book Spine Reading
Point a camera at a bookshelf. Can it read the titles?
| Model | Time | Books Found | Quality |
|---|---|---|---|
| E2B | 83s | 0 — returned "NONE" | ❌ Can't read spines |
| E4B | 25s | 6 titles, correctly identified | ✅ Reliable |
| 26B MoE | OOM on 12GB | — | ❌ Doesn't fit |
| 31B | OOM on 12GB | — | ❌ Doesn't fit |
This is the whole story. For multimodal tasks, E2B is not a smaller version of E4B — it's a fundamentally less capable vision model. It couldn't read a single book spine. E4B found 6.
If you're building anything with images, E2B is not an option. Period.
Test 2: Text — Technical Explanation
"Explain TCP vs UDP in 3 sentences."
| Model | Time | Tokens | Speed | Answer Quality |
|---|---|---|---|---|
| E2B | 93s | 256 (hit limit) | 2.8 t/s | Mediocre — rambling |
| E4B | 20s | 113 | 5.7 t/s | Concise and accurate |
E4B was 4.6x faster and produced a better answer in fewer tokens. This flips the "smaller = faster" assumption — E4B's reasoning is more efficient, so it finishes sooner.
Test 3: Structured Output — JSON Generation
"Return a JSON array of 10 programming languages with year created and creator."
| Model | Valid JSON? | Correct fields? | Time |
|---|---|---|---|
| E2B | ✅ Yes | ❌ 3/10 wrong years | 45s |
| E4B | ✅ Yes | ✅ All correct | 12s |
E2B hallucinated creation dates. E4B nailed every one.
Test 4: Vision + Reasoning Shelfie Pipeline
The real test. Run my Shelfie app — detect books from a photo → enrich with metadata → generate recommendations.
| Model | Detection | Enrichment | Total | Works? |
|---|---|---|---|---|
| E2B | Found 0 books | N/A | — | ❌ |
| E4B | 16 books, 106s | 2 batches, 280s | ~8 min | ✅ |
| 26B/31B | OOM | — | — | ❌ |
Only E4B completes the full pipeline on consumer hardware. Eight minutes for a full shelf catalog with recommendations isn't instant — but it costs $0 and stays local.
The Memory Wall
Here's what "runs on consumer hardware" actually means for each model on my RTX 1060 6GB:
| Model | VRAM Needed (4-bit) | Fits 12GB? | Room for Context? |
|---|---|---|---|
| E2B | ~1.5GB | ✅ Yes | ✅ Ton of room |
| E4B | ~2.1GB | ✅ Yes | ✅ Plenty of room |
| 26B MoE | ~13GB | ❌ No | — |
| 31B | ~16GB | ❌ No | — |
The two big models literally don't fit on a 3200-class GPU. You need a 3090 (24GB) minimum for 31B, and even then you'll have barely any context window left.
For reference, the 31B dense model requires ~800MB more VRAM per million tokens of context. That 24GB 3090? It fits the model plus maybe 30K context. Not the advertised 256K.
The Decision Tree I Wish I'd Had
Ask yourself these questions in order:
1. Does it need to process images?
- Yes → E4B minimum. E2B's vision is unusably bad.
- No → Continue to Q2.
2. Does it fit in 6GB VRAM?
- Yes → E4B 4-bit (~2.1GB) gives you room for context.
- No → E2B or you need a bigger GPU.
3. Is it a one-off task or a repeated workload?
- One-off → Cloud API (OpenRouter free tier has E4B).
- Repeated → Local E4B. No per-token cost.
4. Do you need maximum reasoning quality?
- Yes → 31B dense, but you need 24GB+ VRAM.
- No → E4B is fine. I honestly couldn't tell the difference on book identification.
The Brutal Truth
E2B is marketing. "Runs on your phone!" Yeah, and it can't read a book spine. The gap between E2B and E4B for multimodal tasks isn't incremental — it's the difference between "works" and "doesn't work."
E4B is the model that makes local AI actually useful. It fits on a 3060, runs vision tasks reliably, generates structured output, and is faster than E2B because it reasons more efficiently.
26B MoE and 31B are for people with server GPUs. If you have a 4090 or an A100, they're incredible. If you have a gaming GPU, they're paperweights.
I picked E4B for Shelfie and it was the right call. Sixteen books, full metadata, personalized recommendations — all running on my home lab for free.
E4B is the unsung hero of the Gemma 4 family. The benchmarks won't tell you this. Real usage will.
Try Shelfie: github.com/scastile/shelfie
Top comments (0)