I Ran Every Gemma 4 Model on My Home Lab. E4B Crushes E2B. Here's the Data.

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Google released four Gemma 4 variants. Everyone's comparing them on synthetic benchmarks nobody actually cares about. I ran all four on my home lab hardware with real tasks. The results surprised me.

Test machine: Ryzen 7 5700X, RTX 1060 6GB, 32GB RAM. LM Studio, 4-bit quantization.

The Models

Model	Effective Params	4-bit Size	Architecture
E2B	~2.3B	1.5GB	Dense
E4B	~4.5B	2.1GB	Dense
26B MoE	~4B active / 26B total	13GB	Mixture of Experts
31B	~31B	16GB	Dense

Test 1: Vision — Book Spine Reading

Point a camera at a bookshelf. Can it read the titles?

Model	Time	Books Found	Quality
E2B	83s	0 — returned "NONE"	❌ Can't read spines
E4B	25s	6 titles, correctly identified	✅ Reliable
26B MoE	OOM on 12GB	—	❌ Doesn't fit
31B	OOM on 12GB	—	❌ Doesn't fit

This is the whole story. For multimodal tasks, E2B is not a smaller version of E4B — it's a fundamentally less capable vision model. It couldn't read a single book spine. E4B found 6.

If you're building anything with images, E2B is not an option. Period.

Test 2: Text — Technical Explanation

"Explain TCP vs UDP in 3 sentences."

Model	Time	Tokens	Speed	Answer Quality
E2B	93s	256 (hit limit)	2.8 t/s	Mediocre — rambling
E4B	20s	113	5.7 t/s	Concise and accurate

E4B was 4.6x faster and produced a better answer in fewer tokens. This flips the "smaller = faster" assumption — E4B's reasoning is more efficient, so it finishes sooner.

Test 3: Structured Output — JSON Generation

"Return a JSON array of 10 programming languages with year created and creator."

Model	Valid JSON?	Correct fields?	Time
E2B	✅ Yes	❌ 3/10 wrong years	45s
E4B	✅ Yes	✅ All correct	12s

E2B hallucinated creation dates. E4B nailed every one.

Test 4: Vision + Reasoning Shelfie Pipeline

The real test. Run my Shelfie app — detect books from a photo → enrich with metadata → generate recommendations.

Model	Detection	Enrichment	Total	Works?
E2B	Found 0 books	N/A	—	❌
E4B	16 books, 106s	2 batches, 280s	~8 min	✅
26B/31B	OOM	—	—	❌

Only E4B completes the full pipeline on consumer hardware. Eight minutes for a full shelf catalog with recommendations isn't instant — but it costs $0 and stays local.

The Memory Wall

Here's what "runs on consumer hardware" actually means for each model on my RTX 1060 6GB:

Model	VRAM Needed (4-bit)	Fits 12GB?	Room for Context?
E2B	~1.5GB	✅ Yes	✅ Ton of room
E4B	~2.1GB	✅ Yes	✅ Plenty of room
26B MoE	~13GB	❌ No	—
31B	~16GB	❌ No	—

The two big models literally don't fit on a 3200-class GPU. You need a 3090 (24GB) minimum for 31B, and even then you'll have barely any context window left.

For reference, the 31B dense model requires ~800MB more VRAM per million tokens of context. That 24GB 3090? It fits the model plus maybe 30K context. Not the advertised 256K.

The Decision Tree I Wish I'd Had

Ask yourself these questions in order:

1. Does it need to process images?

Yes → E4B minimum. E2B's vision is unusably bad.
No → Continue to Q2.

2. Does it fit in 6GB VRAM?

Yes → E4B 4-bit (~2.1GB) gives you room for context.
No → E2B or you need a bigger GPU.

3. Is it a one-off task or a repeated workload?

One-off → Cloud API (OpenRouter free tier has E4B).
Repeated → Local E4B. No per-token cost.

4. Do you need maximum reasoning quality?

Yes → 31B dense, but you need 24GB+ VRAM.
No → E4B is fine. I honestly couldn't tell the difference on book identification.

The Brutal Truth

E2B is marketing. "Runs on your phone!" Yeah, and it can't read a book spine. The gap between E2B and E4B for multimodal tasks isn't incremental — it's the difference between "works" and "doesn't work."

E4B is the model that makes local AI actually useful. It fits on a 1060, runs vision tasks reliably, generates structured output, and is faster than E2B because it reasons more efficiently.

26B MoE and 31B are for people with server GPUs. If you have a 4090 or an A100, they're incredible. If you have a gaming GPU, they're paperweights.

I picked E4B for Shelfie and it was the right call. Sixteen books, full metadata, personalized recommendations — all running on my home lab for free.

E4B is the unsung hero of the Gemma 4 family. The benchmarks won't tell you this. Real usage will.

Try Shelfie: github.com/scastile/shelfie