Sreejit Pradhan

Posted on May 11

I Tested Every Gemma 4 Model on a GTX 1650. Here's What Actually Happened.

#devchallenge #gemmachallenge #gemma #ai

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

TL;DR: E4B is the model most developers should run locally. Here's why — tested on a GTX 1650 with real tasks, real numbers, and one bug it found that I didn't ask it to find.

A GTX 1650 is not an impressive GPU. 4GB of VRAM. A card that benchmarking sites politely describe as "entry-level." It's the kind of hardware that AI demos don't mention — because most AI demos are built for A100s or at least an RTX 4090.

I mention this upfront because it's the whole point of this post.

I ran Gemma 4 — two variants of it — on that GTX 1650. I gave it real tasks: a document to analyze, a bug to fix, a photo of handwritten notes to read. And somewhere between watching it handle a coding problem better than I'd planned to, and seeing it transcribe messy handwriting from a photo with no internet connection, I realized the story here isn't about benchmarks.

It's about who gets to build with capable AI now.

Why the Hardware Matters

Before I get into what each model does, I want to make the case for why I'm leading with a GTX 1650 instead of a shiny workstation.

Most local AI content is written for people who already have great hardware. "Runs on a single H100" is a spec that means nothing to 95% of developers. "Runs on your laptop's GPU" means everything — because that's the machine sitting on your desk right now.

Gemma 4's model family was designed around a specific philosophy: every size tier should be the best model of its kind for the hardware it targets. That's not marketing language. It's an architecture decision that shows up in the numbers when you actually run it.

Here's what the family looks like:

Model	Effective Params	Context	Modalities	Targets
E2B	~2B	128K	Text, Image, Video, Audio	Phones, Pi, IoT
E4B	~4B	128K	Text, Image, Video, Audio	Laptops, dev machines
26B MoE	~4B active / 26B loaded	256K	Text, Image, Video	Workstations, Apple Silicon
31B Dense	31B	256K	Text, Image, Video	GPU servers, cloud

Quick Comparison

Feature	E2B	E4B	26B MoE	31B Dense
On-device Friendly	✅	✅	⚠️	❌
Audio Support	✅	✅	❌	❌
Long Context (256K)	❌	❌	✅	✅
High-end Reasoning	⚠️	✅	🔥	🔥🔥
Best Efficiency	✅	✅	🔥	⚠️
Cloud-scale Deployment	⚠️	⚠️	✅	🔥

I ran E2B and E4B locally. The 26B and 31B I tested via Google AI Studio. Everything that follows is what actually happened.

Benchmark Performance: The Numbers Behind the Claims

I want to be upfront here: I didn't run standardized benchmarks myself — that would take days and dedicated hardware. What I'm sharing below comes from Google's official model card and the Arena AI leaderboard. But I've cross-referenced these with my own hands-on experience across the four tasks, and the numbers track with what I observed.

Arena AI Leaderboard (Real Human Votes)

This is the one I trust most. Arena AI ranks models through blind head-to-head comparisons voted on by real users — not automated scripts. You can't game it with careful prompt selection.

Model	Arena Elo Score
Gemma 4 31B (thinking)	1452
Gemma 4 26B MoE (thinking)	1441
DeepSeek-V3.2	~1425
Qwen 3.5 27B	1403
Gemma 3 27B	1365

An 87-point Elo gap between Gemma 4 31B and Gemma 3 27B is not incremental — it's a generational jump in a single release cycle. The 26B MoE is only 11 points behind the full dense model despite activating a fraction of the parameters. That gap is where the MoE efficiency story lives.

Reasoning and Knowledge

Benchmark	E4B	26B MoE	31B Dense	Gemma 3 27B
MMLU Pro (multilingual Q&A)	69.4%	82.6%	85.2%	67.6%
GPQA Diamond (expert science)	58.6%	82.3%	84.3%	42.4%
AIME 2026 (competition math)	42.5%	88.3%	89.2%	20.8%
BigBench Extra Hard	33.1%	64.8%	74.4%	19.3%

The AIME 2026 score deserves a moment. These are competition-level math problems that trip up most humans. Gemma 4 31B at 89.2% is extraordinary for any open model. The previous generation scored 20.8% — that's not an improvement, that's a different model category entirely.

GPQA Diamond tests PhD-level scientific reasoning. Gemma 4 nearly doubled Gemma 3's score. I saw a smaller version of this in Task 1 — the document analysis caught contradictions that required actual reasoning, not just keyword matching.

Coding

Benchmark	E4B	26B MoE	31B Dense	Gemma 3 27B
LiveCodeBench v6	52.0%	77.1%	80.0%	29.1%
Codeforces Elo	940	1718	2150	110

LiveCodeBench uses fresh competitive programming problems — the model hasn't seen them during training, so there's no memorization at play. Going from 29.1% to 80.0% is nearly a 3× improvement.

The Codeforces Elo puts this in human terms: Gemma 3's score of 110 was essentially beginner-level. Gemma 4 31B at 2150 is "Candidate Master" — a rank that takes human competitive programmers years to reach. The 26B MoE at 1718 ("Expert" rank) is impressive for a model that only fires 3.8B parameters per token.

This maps directly to what I saw in Task 2: E4B didn't just clean up my code, it found a better architecture and caught a bug I hadn't asked it to find. These benchmark numbers explain why.

Vision

Benchmark	E4B	26B MoE	31B Dense	Gemma 3 27B
MMMU Pro (multimodal reasoning)	52.6%	73.8%	76.9%	49.7%
MATH-Vision	59.5%	82.4%	85.6%	46.0%
OmniDocBench (error rate ↓)	0.181	0.149	0.131	0.365

OmniDocBench measures document understanding accuracy — lower is better. Gemma 4 31B cut Gemma 3's error rate by nearly two-thirds. For E4B, the error rate of 0.181 still represents a massive improvement over the previous generation, and it's consistent with my handwriting transcription test: 90% accuracy on messy notes is real-world OmniDocBench territory.

Agentic Tool Use

Benchmark	E4B	26B MoE	31B Dense	Gemma 3 27B
τ2-bench (avg. 3 domains)	42.2%	68.2%	76.9%	16.2%

τ2-bench simulates real agentic scenarios — retail, airlines, multi-step tool use — where the model must act, not just respond. Gemma 3 at 16.2% was essentially unusable for autonomous agents. Gemma 4 31B at 76.9% is a model you can actually build workflows around.

The Generational Leap at a Glance

Benchmark	Gemma 3 27B	Gemma 4 31B	Jump
MMLU Pro	67.6%	85.2%	+17.6 pts
AIME 2026	20.8%	89.2%	+68.4 pts
LiveCodeBench v6	29.1%	80.0%	+50.9 pts
GPQA Diamond	42.4%	84.3%	+41.9 pts
MMMU Pro	49.7%	76.9%	+27.2 pts
τ2-bench	16.2%	76.9%	+60.7 pts

These aren't incremental gains. Math, coding, and agentic benchmarks improved by 50–68 percentage points in a single generation. That's not a version bump — that's a new category of model wearing the same name.

Benchmark data sourced from Google's official Gemma 4 model card and the Arena AI leaderboard (April 2026).

Setting Up (Faster Than You Think)

# Install Ollama: https://ollama.com/download
# Then pull whichever model fits your hardware:

ollama pull gemma4:e2b    # ~1.4 GB
ollama pull gemma4:e4b    # ~2.5 GB

ollama run gemma4:e4b

That's it. No Python environment. No CUDA configuration rabbit hole. No API key. The first time I ran this I kept waiting for something to break. It didn't.

On my GTX 1650 with 4GB VRAM, Ollama automatically offloads layers between GPU and CPU. E2B fits mostly on the GPU. E4B splits across GPU and RAM. Neither one complained about the hardware — they just ran.

You can browse all available Gemma 4 variants on the Ollama model library.

Results at a Glance

Before diving into each task, here's what I actually observed on my machine:

Model	Token Speed	VRAM Used	Handwriting Accuracy	First Token
E2B	~35 tok/s	~2.5 GB	~72%	<2s
E4B	~22 tok/s	~3.8 GB	~90%	<3s

E4B is slower but meaningfully smarter. Whether that trade-off is worth it depends entirely on your task — which is exactly what the next four sections are about.

Task 1: Analyzing a PDF Document

I had a lengthy technical specification document — the kind with dense paragraphs, tables, and section references that make your eyes glaze over. I needed a summary and a list of open questions the document raised but didn't answer.

I fed it to E4B using Ollama's API:

import ollama

with open("spec_document.txt", "r") as f:
    doc = f.read()

response = ollama.chat(
    model='gemma4:e4b',
    messages=[{
        'role': 'user',
        'content': f"""Here is a technical specification document:

{doc}

Please:
1. Summarize the key decisions made in this document in bullet points
2. List any open questions or ambiguities the document raises but doesn't resolve"""
    }]
)

print(response['message']['content'])

The summary was tight and accurate. But what stood out was the second part — the open questions. It didn't just list vague gaps. It identified specific contradictions between sections, places where a term was used inconsistently, and one assumption that was stated in the introduction but quietly abandoned midway through. Those were real issues. Issues I'd skimmed past.

That's not retrieval. That's reasoning over a document. On a GTX 1650.

E2B on the same task: Handled the summary well. The open questions were shallower — it caught the obvious gaps but missed the subtle cross-section contradiction. Useful, but the ceiling is lower.

Task 2: The Coding Problem

I had a Python function that processed a list of user events and calculated streaks — consecutive days of activity. My implementation worked but felt clunky: nested loops, a flag variable, the kind of code that passes code review but makes you wince when you revisit it three months later.

I asked E4B to review it and suggest improvements.

It didn't just clean up my loops. It came back with a completely different approach using itertools.groupby combined with a date-differencing trick that collapsed the whole thing into a few clean lines. The logic was tighter, the intent was clearer, and — I checked — it handled edge cases my version had silently gotten wrong.

I hadn't asked it to find bugs. I'd asked for improvements. It found a bug anyway because a better structure made the bug visible.

My prompt:

Review this Python function and suggest improvements:

def calculate_streak(events):
    streak = 0
    max_streak = 0
    prev_date = None
    for event in sorted(events, key=lambda x: x['date']):
        date = event['date']
        if prev_date and (date - prev_date).days == 1:
            streak += 1
        else:
            streak = 1
        max_streak = max(max_streak, streak)
        prev_date = date
    return max_streak

E4B's response (key part):

from itertools import groupby
from datetime import timedelta

def calculate_streak(events):
    if not events:
        return 0

    dates = sorted({e['date'] for e in events})  # deduplicate dates

    max_streak = streak = 1
    for i in range(1, len(dates)):
        if dates[i] - dates[i-1] == timedelta(days=1):
            streak += 1
            max_streak = max(max_streak, streak)
        else:
            streak = 1
    return max_streak

It noted that my original didn't deduplicate dates, so if a user had two events on the same day, the streak count would break. That was a real bug I hadn't noticed.

E2B on the same task: Suggested sensible variable renames and added a docstring. Didn't find the bug. Didn't suggest the architectural improvement. This is the clearest demonstration I found of where the extra effective parameters in E4B actually show up — not in speed, but in the depth of what it notices.

Task 3: Reading Handwritten Notes From a Photo

This is the one that made me stop and stare at the screen for a second.

I took a photo of handwritten notes — the kind of scrawled, uneven writing you do when you're thinking fast. Arrows connecting ideas. Words crossed out and rewritten. Abbreviations that made sense at the time.

import ollama

response = ollama.chat(
    model='gemma4:e4b',
    messages=[{
        'role': 'user',
        'content': 'Transcribe all the text in this image, including crossed-out words. Then summarize the main ideas.',
        'images': ['./notes_photo.jpg']
    }]
)

print(response['message']['content'])

It transcribed around 90% of the words correctly, including several that I would have described as illegible to a stranger. It correctly identified two crossed-out phrases and labeled them as such. The summary captured the actual ideas, not just the words.

This ran completely offline. No API call. No image being uploaded to a server somewhere. My notes — which contained half-formed ideas I wouldn't want indexed anywhere — stayed on my machine.

That's the detail I keep coming back to. The capability isn't new. Cloud OCR and vision APIs have done this for years. What's new is the location. It's here, on hardware that cost a few hundred dollars, with no ongoing cost and no data leaving the device.

E2B on the same task: Transcription accuracy dropped to around 70-75%. The summary was reasonable but missed one of the three main ideas entirely. For clean, printed documents E2B would be fine. For messy handwriting, E4B is meaningfully better.

Task 4: Creative Writing

I asked both models to write the opening paragraph of a short story with a specific constraint: the main character's emotional state could only be shown through their physical actions, never stated directly.

My prompt:

Write the opening paragraph of a short story. Rule: never state 
the character's emotions directly. Show them only through 
physical actions and behaviour.

E4B's response:

She lined up the coffee mugs by handle direction before she'd even taken her coat off. Three mugs, all facing left, then she moved the middle one a quarter-inch to the right, then back. The kettle had already boiled. She didn't touch it.

That paragraph understood the constraint and served it. The anxiety is never named — it's in the compulsive rearranging, the boiled kettle she can't bring herself to use. That's craft, not just instruction-following.

E2B produced something more literal — actions listed in sequence, readable but without the subtext. Competent, not nuanced.

For tasks where tone and craft matter — marketing copy, story generation, user-facing text — that gap between the two models is real and worth knowing about before you choose.

The Real Comparison: When to Use Which

After running all four tasks, here's my honest take on the decision:

Choose E2B when:

You're deploying to a device with under 4GB RAM
You need audio input — it's exclusive to the edge models
Your tasks are extraction, classification, summarization of clean text
Offline, on-device operation is non-negotiable and you can't spare more resources

Choose E4B when:

You're on a developer laptop or a GPU with 4–8GB VRAM (yes, a 1650 works)
You need multimodal — images, handwriting, documents, audio
Your tasks require actual reasoning: code review, document analysis, nuanced writing
You want the best local model that runs on typical developer hardware without compromise

Choose 26B MoE when:

You have 16GB+ RAM or Apple Silicon
You need 256K context (full repos, long documents)
You want near-31B quality at something close to E4B speed — the MoE architecture earns its place here
Currently ranked #6 on the open model leaderboard, outperforming models far larger

Choose 31B Dense when:

You're deploying server-side with dedicated GPU resources
You need the absolute ceiling of open-model quality
Currently ranked #3 on the open model leaderboard among all open models

What This Actually Changes

I want to end on something that isn't a spec or a benchmark.

There's a version of local AI that's been available for a while — open models that technically run on your hardware but require you to accept that you're getting a worse result than the cloud API. You'd use it for offline demos, for prototypes, for cases where privacy was mandatory and quality was a secondary concern.

Gemma 4 is not that. E4B caught a bug I missed. It transcribed handwriting I would have doubted it could read. It found a better architecture for my code than I was planning to write. These are not "good for a local model" results. These are good results.

The GTX 1650 on my desk is three or four GPU generations old. It's the kind of card that serious ML practitioners apologize for owning. And it ran a model that did genuinely useful work across every task I threw at it — with no internet connection, no API key, no monthly bill, and no copy of my documents sitting on someone else's server.

That's not a benchmark. That's a change in what's possible. And it's available right now, for free, to anyone with a halfway-decent laptop.

What I'm curious to explore next: building a local RAG pipeline with E4B as the backbone, and testing audio input on E2B for a voice-triggered assistant. The 128K context window makes both genuinely interesting.

ollama pull gemma4:e4b
ollama run gemma4:e4b

Pull it. Give it something real to do. See what happens.

If you run this on your own hardware, drop your token speeds and VRAM numbers in the comments — I'm curious how it performs across different setups.

All code from this post is available as a GitHub Gist if you want to run it directly.

Tested locally on Windows with a GTX 1650 (4GB VRAM) and 16GB system RAM using Ollama. 26B and 31B tested via Google AI Studio. Model specs from Google DeepMind and Hugging Face documentation. Leaderboard rankings from Arena AI at time of writing.

Top comments (8)

Syed Ahmer Shah • May 13

Running E4B on a GTX 1650 proves that local AI is finally moving past "good for its size" to genuinely useful. The fact that it caught a logical bug in your code and handled messy handwriting transcription completely offline is a massive win for privacy and accessibility. It's a solid breakdown that makes a strong case for E4B as the sweet spot for developers on standard hardware.

Sreejit Pradhan • May 15

You are absolutely right. If we talk about Open Source from a year ago, they aren't even worth to pay attention to except large models like deepseek which u can't run on a PC. But looking at current advancements in reasoning as well as multimodal capabilities of even small models, I must say personal/local llms might be feasible and efficient in the upcoming years. Kudos to Google Deepmind for creating the Gemma 4 family- its great.

Pranay Patikar • May 12

🔥🔥🔥great work

Sreejit Pradhan • May 13

Thanks yaar❤️‍🔥

Youdiowei Eteimorde • May 11

Great piece 🔥🔥🔥

Sreejit Pradhan • May 11

Thank you so much!😀

Grandionn • May 15

Currently having a GTX 1650, this article landed like a gem to me. Did not expect a model like E4B would be able to execute this nicely on a GPU like this.

Sreejit Pradhan • May 15

Even i had doubts at first but surprisingly, GTX handled it quite well