DEV Community: Aman Bhargav

After Testing Gemma 4 Locally, I Finally Understand Why MoE Models Matter

Aman Bhargav — Fri, 08 May 2026 06:57:48 +0000

I’ve tested enough local models at this point to stop trusting benchmark charts.

Most of them look impressive until you actually give them a real project.

Then things fall apart:

context gets messy
reasoning becomes inconsistent
responses drift
code suggestions start contradicting earlier answers

So when Google released the Gemma 4 models, I wasn’t expecting much beyond another benchmark-heavy launch.

But after spending a few days testing the 26B MoE model locally, I think this is the first open Mixture-of-Experts model that actually feels stable enough for real development work.

Not perfect.

But noticeably different.

My Test Was Simple

Instead of synthetic prompts, I used an actual Rails codebase I work on regularly.

I fed the model:

Sidekiq workers
service objects
serializers
migrations
API integrations
ActiveRecord scopes
some old messy business logic I never cleaned up

Around 40+ files total.

This is usually where smaller or poorly optimized models start losing track of relationships between files.

Especially once the context gets large.

Gemma 4 held up longer than I expected.

At one point it pointed out:

a stale migration
duplicated validation logic
an unnecessary retry pattern inside a Sidekiq worker

Not groundbreaking individually.

But the interesting part was that it maintained consistency while discussing those files together.

That’s normally where local models start hallucinating.

The “Thinking” Behavior Felt Different

I tested the reasoning mode with a few debugging tasks.

Adding:

<|think|>

changed the responses more than I expected.

Instead of immediately generating code, the model started breaking problems into smaller steps internally first.

Sometimes it even corrected its own assumptions midway through reasoning.

That sounds small, but behavior like that makes the model feel far more usable during debugging sessions.

Less autocomplete.
More actual reasoning.

Still not comparable to frontier cloud models.

But much closer than I expected from an open local model.

The MoE Architecture Finally Clicked for Me

Before this, most MoE models I tried felt inconsistent.

You’d get:

one excellent response
then one completely confused answer
then another strong answer again

Gemma 4 felt more stable across longer sessions.

After reading more about the “Always-On Shared Expert” design, that behavior started making sense.

The responses felt less chaotic between prompts.

For coding workflows, that matters more than benchmark spikes honestly.

I care less about leaderboard numbers and more about whether the model stays coherent after 30 minutes of back-and-forth debugging.

The Context Window Is Actually Useful

A lot of models advertise huge context windows now.

Very few stay reliable when you push them hard.

I tested Gemma 4 with larger chunks of project structure and it handled repository-level understanding surprisingly well.

Not perfectly.

But well enough that I’d realistically use it for:

onboarding into old codebases
tracing Sidekiq flows
understanding legacy service layers
finding duplicated business logic
reviewing migrations

That’s the first time I’ve seriously considered using a local model regularly for repository analysis.

Where It Still Struggles

There are still obvious limitations.

I noticed weaker performance with:

deeply nested metaprogramming
very niche gems
long autonomous coding loops
highly abstract architecture discussions

And realistically, the larger variants still require more hardware than most developers have access to.

The AI industry keeps saying “local AI for everyone,” but large models are still expensive to run comfortably.

That part hasn’t magically changed.

The Apache 2.0 License Might Matter More Than The Model

Honestly, this may end up being the biggest long-term win.

The Apache 2.0 licensing removes a lot of hesitation around enterprise adoption.

A powerful model with unclear licensing is still difficult to use safely inside real products.

Gemma 4 finally feels deployable without legal uncertainty hanging over it.

That changes things for startups and internal tooling teams immediately.

Final Thoughts

I don’t think Gemma 4 replaces GPT-5 or Claude for difficult engineering work.

That’s not the point.

The important shift is this:

Open local models are finally becoming practical enough that developers can genuinely build around them instead of just experimenting with them.

And honestly, this is the first open MoE model I’ve used where that future felt believable.

I Ran Gemma 4 on an 8GB Laptop Expecting a Toy Model. I Was Completely Wrong.

Aman Bhargav — Fri, 08 May 2026 06:54:52 +0000

**Every AI release claims to be “efficient now.”

Most of the time, that translates to:

still needs expensive hardware
still feels slow locally
still breaks on reasoning tasks

So when Google released Gemma 4 E2B, I honestly assumed it would be another lightweight model that looked good in benchmarks and failed in real usage.

I tested it anyway.

And after a week of running it locally, I think small models just crossed an important line.

My Setup

Nothing fancy.

ollama run gemma4:2b

Hardware:

MacBook Air M1
8GB RAM
Ollama
No external GPU

Performance I saw:

~40 tokens/sec average
First pull took around 3 minutes
RAM usage stayed around 5GB
Fan noise was surprisingly manageable

Most importantly:
it actually felt responsive enough to use continuously.

That’s rare for local models on weak hardware.

The Moment That Changed My Opinion

I tested a simple logic puzzle first.

The kind of question smaller models usually fail because they rush into an answer.

Without reasoning enabled:
wrong answer instantly.

Then I tested Gemma 4’s Thinking Mode.

I added:

<|think|>

before the task.

And the behavior changed completely.

Instead of rushing, the model started breaking the problem into steps internally before responding.

It literally looked like the model was “thinking out loud.”

That was the first moment where a 2B local model genuinely felt different.

Not smarter in a benchmark sense.

Smarter in behavior.

The Most Underrated Feature: Native Audio

This honestly surprised me more than the reasoning.

I tested raw audio input using a messy voice note where I explained a Rails debugging issue while walking outside.

No Whisper pipeline.
No speech-to-text preprocessing.
No extra tooling.

Gemma 4 understood the context directly from audio input.

That matters more than people realize.

Most “voice AI” stacks today are still multiple systems stitched together:

speech recognition
cleanup
context formatting
LLM inference

Gemma 4 reducing that complexity is a huge deal for local privacy-first apps.

Especially for:

offline assistants
internal enterprise tooling
edge devices
mobile AI workflows
I Tried Feeding It Real Code

Benchmarks are one thing.

Real projects are another.

So I gave it pieces of a Rails project:

Sidekiq jobs
service objects
migrations
serializers
ActiveRecord scopes

And honestly?

It performed better than I expected on:

debugging
explaining legacy code
identifying duplicated logic
finding missing indexes

Where it still struggles:

large architectural refactors
very deep Rails metaprogramming
maintaining consistency across long sessions
niche gems with poor documentation

So no, this is not replacing larger cloud models yet.

But that’s also the wrong comparison.

The Real Story Here

The important part isn’t that Gemma 4 beats massive cloud models.

It doesn’t.

The important part is this:

For the first time, running capable AI locally feels practical without needing expensive hardware.

That changes who gets access to AI development.

Students.
Indie hackers.
Developers in low-resource environments.
Privacy-focused teams.

Small local models used to feel like demos.

Gemma 4 is the first one I’ve used that feels like an actual tool.

And honestly, I didn’t expect to say that.**