This is a submission for the Gemma 4 Challenge: Write About Gemma 4
TL;DR: E4B is the model most developers should run locally. Here's why — tested on a GTX 1650 with real tasks, real numbers, and one bug it found that I didn't ask it to find.
A GTX 1650 is not an impressive GPU. 4GB of VRAM. A card that benchmarking sites politely describe as "entry-level." It's the kind of hardware that AI demos don't mention — because most AI demos are built for A100s or at least an RTX 4090.
I mention this upfront because it's the whole point of this post.
I ran Gemma 4 — two variants of it — on that GTX 1650. I gave it real tasks: a document to analyze, a bug to fix, a photo of handwritten notes to read. And somewhere between watching it handle a coding problem better than I'd planned to, and seeing it transcribe messy handwriting from a photo with no internet connection, I realized the story here isn't about benchmarks.
It's about who gets to build with capable AI now.
Why the Hardware Matters
Before I get into what each model does, I want to make the case for why I'm leading with a GTX 1650 instead of a shiny workstation.
Most local AI content is written for people who already have great hardware. "Runs on a single H100" is a spec that means nothing to 95% of developers. "Runs on your laptop's GPU" means everything — because that's the machine sitting on your desk right now.
Gemma 4's model family was designed around a specific philosophy: every size tier should be the best model of its kind for the hardware it targets. That's not marketing language. It's an architecture decision that shows up in the numbers when you actually run it.
Here's what the family looks like:
| Model | Effective Params | Context | Modalities | Targets |
|---|---|---|---|---|
| E2B | ~2B | 128K | Text, Image, Video, Audio | Phones, Pi, IoT |
| E4B | ~4B | 128K | Text, Image, Video, Audio | Laptops, dev machines |
| 26B MoE | ~4B active / 26B loaded | 256K | Text, Image, Video | Workstations, Apple Silicon |
| 31B Dense | 31B | 256K | Text, Image, Video | GPU servers, cloud |
Quick Comparison
| Feature | E2B | E4B | 26B MoE | 31B Dense |
|---|---|---|---|---|
| On-device Friendly | ✅ | ✅ | ⚠️ | ❌ |
| Audio Support | ✅ | ✅ | ❌ | ❌ |
| Long Context (256K) | ❌ | ❌ | ✅ | ✅ |
| High-end Reasoning | ⚠️ | ✅ | 🔥 | 🔥🔥 |
| Best Efficiency | ✅ | ✅ | 🔥 | ⚠️ |
| Cloud-scale Deployment | ⚠️ | ⚠️ | ✅ | 🔥 |
I ran E2B and E4B locally. The 26B and 31B I tested via Google AI Studio. Everything that follows is what actually happened.
Benchmark Performance: The Numbers Behind the Claims
I want to be upfront here: I didn't run standardized benchmarks myself — that would take days and dedicated hardware. What I'm sharing below comes from Google's official model card and the Arena AI leaderboard. But I've cross-referenced these with my own hands-on experience across the four tasks, and the numbers track with what I observed.
Arena AI Leaderboard (Real Human Votes)
This is the one I trust most. Arena AI ranks models through blind head-to-head comparisons voted on by real users — not automated scripts. You can't game it with careful prompt selection.
| Model | Arena Elo Score |
|---|---|
| Gemma 4 31B (thinking) | 1452 |
| Gemma 4 26B MoE (thinking) | 1441 |
| DeepSeek-V3.2 | ~1425 |
| Qwen 3.5 27B | 1403 |
| Gemma 3 27B | 1365 |
An 87-point Elo gap between Gemma 4 31B and Gemma 3 27B is not incremental — it's a generational jump in a single release cycle. The 26B MoE is only 11 points behind the full dense model despite activating a fraction of the parameters. That gap is where the MoE efficiency story lives.
Reasoning and Knowledge
| Benchmark | E4B | 26B MoE | 31B Dense | Gemma 3 27B |
|---|---|---|---|---|
| MMLU Pro (multilingual Q&A) | 69.4% | 82.6% | 85.2% | 67.6% |
| GPQA Diamond (expert science) | 58.6% | 82.3% | 84.3% | 42.4% |
| AIME 2026 (competition math) | 42.5% | 88.3% | 89.2% | 20.8% |
| BigBench Extra Hard | 33.1% | 64.8% | 74.4% | 19.3% |
The AIME 2026 score deserves a moment. These are competition-level math problems that trip up most humans. Gemma 4 31B at 89.2% is extraordinary for any open model. The previous generation scored 20.8% — that's not an improvement, that's a different model category entirely.
GPQA Diamond tests PhD-level scientific reasoning. Gemma 4 nearly doubled Gemma 3's score. I saw a smaller version of this in Task 1 — the document analysis caught contradictions that required actual reasoning, not just keyword matching.
Coding
| Benchmark | E4B | 26B MoE | 31B Dense | Gemma 3 27B |
|---|---|---|---|---|
| LiveCodeBench v6 | 52.0% | 77.1% | 80.0% | 29.1% |
| Codeforces Elo | 940 | 1718 | 2150 | 110 |
LiveCodeBench uses fresh competitive programming problems — the model hasn't seen them during training, so there's no memorization at play. Going from 29.1% to 80.0% is nearly a 3× improvement.
The Codeforces Elo puts this in human terms: Gemma 3's score of 110 was essentially beginner-level. Gemma 4 31B at 2150 is "Candidate Master" — a rank that takes human competitive programmers years to reach. The 26B MoE at 1718 ("Expert" rank) is impressive for a model that only fires 3.8B parameters per token.
This maps directly to what I saw in Task 2: E4B didn't just clean up my code, it found a better architecture and caught a bug I hadn't asked it to find. These benchmark numbers explain why.
Vision
| Benchmark | E4B | 26B MoE | 31B Dense | Gemma 3 27B |
|---|---|---|---|---|
| MMMU Pro (multimodal reasoning) | 52.6% | 73.8% | 76.9% | 49.7% |
| MATH-Vision | 59.5% | 82.4% | 85.6% | 46.0% |
| OmniDocBench (error rate ↓) | 0.181 | 0.149 | 0.131 | 0.365 |
OmniDocBench measures document understanding accuracy — lower is better. Gemma 4 31B cut Gemma 3's error rate by nearly two-thirds. For E4B, the error rate of 0.181 still represents a massive improvement over the previous generation, and it's consistent with my handwriting transcription test: 90% accuracy on messy notes is real-world OmniDocBench territory.
Agentic Tool Use
| Benchmark | E4B | 26B MoE | 31B Dense | Gemma 3 27B |
|---|---|---|---|---|
| τ2-bench (avg. 3 domains) | 42.2% | 68.2% | 76.9% | 16.2% |
τ2-bench simulates real agentic scenarios — retail, airlines, multi-step tool use — where the model must act, not just respond. Gemma 3 at 16.2% was essentially unusable for autonomous agents. Gemma 4 31B at 76.9% is a model you can actually build workflows around.
The Generational Leap at a Glance
| Benchmark | Gemma 3 27B | Gemma 4 31B | Jump |
|---|---|---|---|
| MMLU Pro | 67.6% | 85.2% | +17.6 pts |
| AIME 2026 | 20.8% | 89.2% | +68.4 pts |
| LiveCodeBench v6 | 29.1% | 80.0% | +50.9 pts |
| GPQA Diamond | 42.4% | 84.3% | +41.9 pts |
| MMMU Pro | 49.7% | 76.9% | +27.2 pts |
| τ2-bench | 16.2% | 76.9% | +60.7 pts |
These aren't incremental gains. Math, coding, and agentic benchmarks improved by 50–68 percentage points in a single generation. That's not a version bump — that's a new category of model wearing the same name.
Benchmark data sourced from Google's official Gemma 4 model card and the Arena AI leaderboard (April 2026).
Setting Up (Faster Than You Think)
# Install Ollama: https://ollama.com/download
# Then pull whichever model fits your hardware:
ollama pull gemma4:e2b # ~1.4 GB
ollama pull gemma4:e4b # ~2.5 GB
ollama run gemma4:e4b
That's it. No Python environment. No CUDA configuration rabbit hole. No API key. The first time I ran this I kept waiting for something to break. It didn't.
On my GTX 1650 with 4GB VRAM, Ollama automatically offloads layers between GPU and CPU. E2B fits mostly on the GPU. E4B splits across GPU and RAM. Neither one complained about the hardware — they just ran.
You can browse all available Gemma 4 variants on the Ollama model library.
Results at a Glance
Before diving into each task, here's what I actually observed on my machine:
| Model | Token Speed | VRAM Used | Handwriting Accuracy | First Token |
|---|---|---|---|---|
| E2B | ~35 tok/s | ~2.5 GB | ~72% | <2s |
| E4B | ~22 tok/s | ~3.8 GB | ~90% | <3s |
E4B is slower but meaningfully smarter. Whether that trade-off is worth it depends entirely on your task — which is exactly what the next four sections are about.
Task 1: Analyzing a PDF Document
I had a lengthy technical specification document — the kind with dense paragraphs, tables, and section references that make your eyes glaze over. I needed a summary and a list of open questions the document raised but didn't answer.
I fed it to E4B using Ollama's API:
import ollama
with open("spec_document.txt", "r") as f:
doc = f.read()
response = ollama.chat(
model='gemma4:e4b',
messages=[{
'role': 'user',
'content': f"""Here is a technical specification document:
{doc}
Please:
1. Summarize the key decisions made in this document in bullet points
2. List any open questions or ambiguities the document raises but doesn't resolve"""
}]
)
print(response['message']['content'])
The summary was tight and accurate. But what stood out was the second part — the open questions. It didn't just list vague gaps. It identified specific contradictions between sections, places where a term was used inconsistently, and one assumption that was stated in the introduction but quietly abandoned midway through. Those were real issues. Issues I'd skimmed past.
That's not retrieval. That's reasoning over a document. On a GTX 1650.
E2B on the same task: Handled the summary well. The open questions were shallower — it caught the obvious gaps but missed the subtle cross-section contradiction. Useful, but the ceiling is lower.
Task 2: The Coding Problem
I had a Python function that processed a list of user events and calculated streaks — consecutive days of activity. My implementation worked but felt clunky: nested loops, a flag variable, the kind of code that passes code review but makes you wince when you revisit it three months later.
I asked E4B to review it and suggest improvements.
It didn't just clean up my loops. It came back with a completely different approach using itertools.groupby combined with a date-differencing trick that collapsed the whole thing into a few clean lines. The logic was tighter, the intent was clearer, and — I checked — it handled edge cases my version had silently gotten wrong.
I hadn't asked it to find bugs. I'd asked for improvements. It found a bug anyway because a better structure made the bug visible.
My prompt:
Review this Python function and suggest improvements:
def calculate_streak(events):
streak = 0
max_streak = 0
prev_date = None
for event in sorted(events, key=lambda x: x['date']):
date = event['date']
if prev_date and (date - prev_date).days == 1:
streak += 1
else:
streak = 1
max_streak = max(max_streak, streak)
prev_date = date
return max_streak
E4B's response (key part):
from itertools import groupby
from datetime import timedelta
def calculate_streak(events):
if not events:
return 0
dates = sorted({e['date'] for e in events}) # deduplicate dates
max_streak = streak = 1
for i in range(1, len(dates)):
if dates[i] - dates[i-1] == timedelta(days=1):
streak += 1
max_streak = max(max_streak, streak)
else:
streak = 1
return max_streak
It noted that my original didn't deduplicate dates, so if a user had two events on the same day, the streak count would break. That was a real bug I hadn't noticed.
E2B on the same task: Suggested sensible variable renames and added a docstring. Didn't find the bug. Didn't suggest the architectural improvement. This is the clearest demonstration I found of where the extra effective parameters in E4B actually show up — not in speed, but in the depth of what it notices.
Task 3: Reading Handwritten Notes From a Photo
This is the one that made me stop and stare at the screen for a second.
I took a photo of handwritten notes — the kind of scrawled, uneven writing you do when you're thinking fast. Arrows connecting ideas. Words crossed out and rewritten. Abbreviations that made sense at the time.
import ollama
response = ollama.chat(
model='gemma4:e4b',
messages=[{
'role': 'user',
'content': 'Transcribe all the text in this image, including crossed-out words. Then summarize the main ideas.',
'images': ['./notes_photo.jpg']
}]
)
print(response['message']['content'])
It transcribed around 90% of the words correctly, including several that I would have described as illegible to a stranger. It correctly identified two crossed-out phrases and labeled them as such. The summary captured the actual ideas, not just the words.
This ran completely offline. No API call. No image being uploaded to a server somewhere. My notes — which contained half-formed ideas I wouldn't want indexed anywhere — stayed on my machine.
That's the detail I keep coming back to. The capability isn't new. Cloud OCR and vision APIs have done this for years. What's new is the location. It's here, on hardware that cost a few hundred dollars, with no ongoing cost and no data leaving the device.
E2B on the same task: Transcription accuracy dropped to around 70-75%. The summary was reasonable but missed one of the three main ideas entirely. For clean, printed documents E2B would be fine. For messy handwriting, E4B is meaningfully better.
Task 4: Creative Writing
I asked both models to write the opening paragraph of a short story with a specific constraint: the main character's emotional state could only be shown through their physical actions, never stated directly.
My prompt:
Write the opening paragraph of a short story. Rule: never state
the character's emotions directly. Show them only through
physical actions and behaviour.
E4B's response:
She lined up the coffee mugs by handle direction before she'd even taken her coat off. Three mugs, all facing left, then she moved the middle one a quarter-inch to the right, then back. The kettle had already boiled. She didn't touch it.
That paragraph understood the constraint and served it. The anxiety is never named — it's in the compulsive rearranging, the boiled kettle she can't bring herself to use. That's craft, not just instruction-following.
E2B produced something more literal — actions listed in sequence, readable but without the subtext. Competent, not nuanced.
For tasks where tone and craft matter — marketing copy, story generation, user-facing text — that gap between the two models is real and worth knowing about before you choose.
The Real Comparison: When to Use Which
After running all four tasks, here's my honest take on the decision:
Choose E2B when:
- You're deploying to a device with under 4GB RAM
- You need audio input — it's exclusive to the edge models
- Your tasks are extraction, classification, summarization of clean text
- Offline, on-device operation is non-negotiable and you can't spare more resources
Choose E4B when:
- You're on a developer laptop or a GPU with 4–8GB VRAM (yes, a 1650 works)
- You need multimodal — images, handwriting, documents, audio
- Your tasks require actual reasoning: code review, document analysis, nuanced writing
- You want the best local model that runs on typical developer hardware without compromise
Choose 26B MoE when:
- You have 16GB+ RAM or Apple Silicon
- You need 256K context (full repos, long documents)
- You want near-31B quality at something close to E4B speed — the MoE architecture earns its place here
- Currently ranked #6 on the open model leaderboard, outperforming models far larger
Choose 31B Dense when:
- You're deploying server-side with dedicated GPU resources
- You need the absolute ceiling of open-model quality
- Currently ranked #3 on the open model leaderboard among all open models
What This Actually Changes
I want to end on something that isn't a spec or a benchmark.
There's a version of local AI that's been available for a while — open models that technically run on your hardware but require you to accept that you're getting a worse result than the cloud API. You'd use it for offline demos, for prototypes, for cases where privacy was mandatory and quality was a secondary concern.
Gemma 4 is not that. E4B caught a bug I missed. It transcribed handwriting I would have doubted it could read. It found a better architecture for my code than I was planning to write. These are not "good for a local model" results. These are good results.
The GTX 1650 on my desk is three or four GPU generations old. It's the kind of card that serious ML practitioners apologize for owning. And it ran a model that did genuinely useful work across every task I threw at it — with no internet connection, no API key, no monthly bill, and no copy of my documents sitting on someone else's server.
That's not a benchmark. That's a change in what's possible. And it's available right now, for free, to anyone with a halfway-decent laptop.
What I'm curious to explore next: building a local RAG pipeline with E4B as the backbone, and testing audio input on E2B for a voice-triggered assistant. The 128K context window makes both genuinely interesting.
ollama pull gemma4:e4b
ollama run gemma4:e4b
Pull it. Give it something real to do. See what happens.
If you run this on your own hardware, drop your token speeds and VRAM numbers in the comments — I'm curious how it performs across different setups.
All code from this post is available as a GitHub Gist if you want to run it directly.
Tested locally on Windows with a GTX 1650 (4GB VRAM) and 16GB system RAM using Ollama. 26B and 31B tested via Google AI Studio. Model specs from Google DeepMind and Hugging Face documentation. Leaderboard rankings from Arena AI at time of writing.
Top comments (0)