David

Posted on Apr 3

How to Run Google's Gemma 4 Locally with Ollama — All 4 Model Sizes Compared

#ai #google #opensource #tutorial

Google dropped Gemma 4 two days ago and it's already everywhere — 1,700+ points on Hacker News, 80K+ downloads on HuggingFace. The benchmarks are genuinely insane: the E4B model (4.5B active parameters) beats Gemma 3 27B across the board. Math scores jumped from 20% to 89%. Agentic tasks from 6% to 86%.

I've been building a local AI desktop app (Locally Uncensored) and added Gemma 4 support on day one. Here's a quick guide to running it locally with Ollama, plus what I've learned about the different model sizes.

Install Gemma 4 with Ollama

If you have Ollama installed, it's one command:

# Default (E4B - best bang for buck)
ollama run gemma4

# All available variants
ollama run gemma4:e2b   # 2.3B effective, 7.2 GB download
ollama run gemma4:e4b   # 4.5B effective, 9.6 GB download  
ollama run gemma4:26b   # 3.8B active (MoE, 128 experts), 18 GB download
ollama run gemma4:31b   # 30.7B dense, 20 GB download

All models support 128K-256K context, vision (image input), and native function calling.

Which Size Should You Run?

Here's what actually matters for picking the right variant:

gemma4:e2b — Runs on basically anything. 8GB RAM laptop, Raspberry Pi 5 with swap, old GPUs. Good for quick Q&A and lightweight tasks. Don't expect deep reasoning.

gemma4:e4b — The sweet spot. 6GB VRAM minimum. Beats Gemma 3 27B on benchmarks while being 6x smaller in active parameters. This is the one I'd recommend for most people running Ollama on a desktop.

gemma4:26b — MoE (Mixture of Experts) architecture. 128 experts but only 3.8B active at a time, so it's surprisingly fast for its quality. Needs ~8GB VRAM. 256K context. This is the one that makes you wonder why dense models still exist.

gemma4:31b — Dense 31B. Best raw quality but needs the most resources (~20GB). If you have an RTX 4090 or M-series Mac with 32GB+, this is the ceiling.

What's Actually New vs Gemma 3

The headlines say "better benchmarks" but the real changes are more interesting:

Apache 2.0 license — Gemma 3 had a restrictive license. Gemma 4 is fully open. Commercial use, fine-tuning, redistribution — all fair game.
Native function calling — Gemma 4 was trained with tool-use capabilities built in. You can send it function definitions and it returns structured tool calls. No prompt hacking needed.
Built-in thinking mode — Configurable chain-of-thought reasoning. The model can "think" before responding, similar to what we've seen from DeepSeek and QwQ.
MoE architecture — The 26B variant uses 128 experts with only 3.8B active per token. This is why it's fast despite being "26B" — your GPU only processes a fraction of the weights per inference step.
Audio input on edge models — E2B and E4B can process audio. Not just text and images anymore.

Running Gemma 4 with Agent Mode

Since Gemma 4 has native function calling, it works out of the box with agent frameworks. In Locally Uncensored, it's already in the compatibility list — the app auto-detects it and enables tool calling (web search, file I/O, code execution, image gen).

For the 26B MoE variant specifically, agent tasks work surprisingly well. The MoE architecture seems to help with structured output — the model is more consistent at producing valid JSON tool calls compared to dense models at similar effective parameter counts.

If you're building your own agent setup, the key thing to know is that Gemma 4 follows the standard OpenAI function calling format through Ollama's API. No special prompting or template hacks needed.

Quick Benchmark Numbers

On my setup (running through Ollama):

Model	Size	tok/s	Quality impression
gemma4:e2b	2.3B eff.	Fast	Good for chat, weak on complex reasoning
gemma4:e4b	4.5B eff.	Solid	Surprisingly capable, great daily driver
gemma4:26b	3.8B active	Fast (MoE)	Punches way above weight class
gemma4:31b	31B	Slower	Best quality, needs beefy hardware

The E4B scoring 80% on coding benchmarks (HumanEval) with 4.5B effective parameters is genuinely wild. For context, Gemma 3 27B scored 29% on the same test.

VRAM Requirements

E2B: 4-6 GB VRAM (or CPU-only with 8GB RAM)
E4B: 6-8 GB VRAM
26B MoE: 8-12 GB VRAM (despite being "26B")
31B Dense: 16-20 GB VRAM

The MoE model is the efficiency story here. "26B quality at 8GB VRAM" is the pitch, and from initial testing it largely delivers.

Try It

If you want a UI on top of Ollama that handles Gemma 4 out of the box — chat, agent mode, image gen, A/B model comparison — check out Locally Uncensored. Single .exe/.AppImage/.dmg, MIT licensed.

Or just ollama run gemma4 and start chatting in the terminal. Either way, this is probably the best small model Google has shipped.

What model size are you running? Curious about real-world experiences, especially the 26B MoE on different hardware.

Locally Uncensored — free, open source, MIT licensed. GitHub.

DEV Community