Amit Raz

Posted on Apr 6

I Ran Google's New Gemma 4 Models Locally (26B and 31B) — Here's What I Found

#ai #llm #gemma

Google dropped Gemma 4 a few days ago and I immediately wanted to know: can
you actually run these things locally on consumer hardware? Not for a research
project. For real use.

I had two machines to test with:

An i9 with 96GB RAM and an RTX 4090
A 64-core / 128-thread AMD machine (CPU-only)

I ran the 26B and 31B variants. Here's what happened.

A quick note on the architecture

Before the numbers, one thing worth knowing: these two models are
architecturally different.

The 26B is a Mixture-of-Experts (MoE) model with 128 experts, but only
~4B parameters are active at any given time. That's why it's fast and fits
comfortably in VRAM despite the 26B label.

The 31B is a dense model — all 31 billion parameters are active on every
token. That's why it hits the memory wall hard.

This distinction explains everything you're about to see in the benchmarks.

Setup

I used Ollama to pull and run both models:

ollama run gemma4:26b
ollama run gemma4:31b

Both support a 256K context window and native function calling out of the box.

Benchmarks

I ran a mix of prompts: simple factual questions, some reasoning tasks, and
something heavier — a complex trading algorithm that uses AI-based prediction.
I asked the models to explain the logic and suggest improvements.

I also compared the outputs directly against Claude Code on the same prompts.

26B (MoE) on RTX 4090

Metric	Value
Prompt eval rate	15.56 tokens/s
Eval duration	~10.5s
Generation rate	149.56 tokens/s

This is fast. Like, actually fast. 149 tokens per second means you're not
sitting and watching a cursor blink. It feels close to real-time. The MoE
architecture earns its keep here — only 4B parameters are active, so the
4090's 24GB VRAM handles it cleanly with room to spare.

31B (Dense) on RTX 4090

Metric	Value
Prompt eval rate	26.30 tokens/s
Eval duration	~3m 5s
Generation rate	7.84 tokens/s

Big drop. Unlike the 26B, the dense 31B has to load all its parameters for
every token. It doesn't fit cleanly into the 4090's VRAM and spills into
system RAM — you feel every bit of it. For interactive use, this is painful.

The screenshot below shows what that looks like in Task Manager: GPU Dedicated

Memory is maxed out at ~45.9GB out of ~49GB available. The GPU Usage reads
low (around 24%) not because there's no work, but because the GPU spends most
of its time waiting on data coming from system RAM.

26B (MoE) on AMD 64-core / 128-thread (CPU only)

Metric	Value
Prompt eval rate	45.33 tokens/s
Eval duration	~3m 20s
Generation rate	8.80 tokens/s

Slower for generation, but the prompt eval rate is actually higher than the
4090 — all those cores load context fast. Generation at 8.80 tokens/s is slow
for interactive chat, but more usable than you'd expect for background tasks.

Quality

All three runs handled the trading algorithm task well. The output was
structured, accurate, and included reasonable improvement suggestions.

I compared the responses directly against Claude Code on the same prompts.
They were practically identical. Not "close enough" — genuinely hard to tell
apart on this type of task.

That surprised me. A model running locally on your own hardware, for free,
producing output indistinguishable from a frontier cloud API on complex
reasoning tasks.

The part that surprised me most — and it applies to every setup

Here's the thing that changed how I think about local models, regardless of
whether you're running on a GPU or CPU:

A local model isn't subject to API limits. No token limits per minute, no cost
per call, no rate limiting. If you're running agents that need to process large
contexts, search through a codebase, analyze documents, or run long autonomous
tasks — you can just let them run overnight. The agent works while you sleep.

For agentic workflows specifically, this is a bigger deal than the raw token/s
numbers suggest. A 8.80 tok/s model running uninterrupted for 8 hours
processes a lot more work than a faster cloud model that hits rate limits every
few minutes.

Verdict

The 26B MoE on a 4090 is the sweet spot right now. It fits cleanly in VRAM,
generates at 149 tok/s, and produces quality that holds up against frontier
models on reasoning tasks. For most local development and agentic use cases,
you won't feel a meaningful gap.

The 31B dense needs more VRAM than most people have. Unless you have a
multi-GPU setup or an M-series Mac with 64GB+, the memory pressure kills the
speed advantage you'd expect from the larger model.

The CPU-only path is more viable than I expected for non-latency-sensitive
work. If you have a powerful server without a GPU, the 26B MoE is genuinely
runnable for batch tasks.

What's next

In the next post I'll show how to connect Cursor, VS Code, and Claude Code to
a locally running model like this. That's where it becomes practically useful
for day-to-day development.

I'm Amit Raz, a Software Architect specializing in AI and software
development. I build tools and apps at rzailabs.com.

DEV Community