This is a submission for the Gemma 4 Challenge: Write About Gemma 4
I've been thinking about AI wrong for the past two years.
Not completely wrong. But there's an assumption I'd quietly accepted without realizing it — that serious AI models require serious hardware. That if you wanted something capable enough to reason, handle images, work through multi-step problems, you needed a cloud API, or at minimum a machine with a real GPU. That the tradeoff between capability and accessibility was fixed: more of one meant less of the other.
Gemma 4 broke that assumption for me. And the thing that broke it wasn't a benchmark number or a blog post. It was one sentence buried in the release notes:
The E2B model runs on a Raspberry Pi 5.
What Gemma 4 Actually Is
Google DeepMind released Gemma 4 on April 2, 2026, under an Apache 2.0 license — fully open-weight, commercially usable, no strings attached. It's built from the same research as Gemini 3, which is a meaningful statement: this isn't Google's B-team effort. It's frontier-level research packaged for the open ecosystem.
The family ships as four distinct variants, each targeting a different tier of hardware:
| Model | Architecture | Active Params | Target Hardware |
|---|---|---|---|
| E2B | PLE (edge) | ~2.3B | Raspberry Pi, smartphones |
| E4B | PLE (edge) | ~4.5B | Mobile devices, laptops |
| 26B A4B | MoE (8 of 128 experts) | ~4B active | Consumer GPU |
| 31B | Dense | 30.7B | Workstation / multi-GPU |
Every single one of them is multimodal from the ground up — text, images, video. The smaller models also handle audio natively. Context window is 128K for the edge models and 256K for the larger ones. All support function calling, multi-step reasoning, and configurable thinking modes.
That's the table stakes. Here's what's actually interesting.
The Part That Stopped Me
The E2B model runs in approximately 1.5 GB of RAM at INT4 quantization.
A Raspberry Pi 5 with 8 GB of RAM can run it. Google published numbers confirming this — it's not theoretical, it's a tested deployment target.
I keep coming back to what that actually means. A Raspberry Pi 5 costs around $80. It's a credit-card-sized single-board computer. And it can now run a multimodal AI model that handles text, images, and audio, with a 128K context window, with reasoning capabilities, offline, with no API calls, no cloud dependency, no subscription.
For the past two years, "run AI locally" has been something that required either a powerful laptop or a desktop with a dedicated GPU. The conversation around local AI models has largely been for people with the hardware to run 7B+ parameter models comfortably. E2B changes who that conversation is for.
Why This Is a Bigger Deal Than the Benchmarks
I want to be honest about something: the benchmarks for Gemma 4 are genuinely impressive, but I don't think they're the most important part of this release.
The 31B model sits at #3 on Arena AI's open model leaderboard (ELO 1452 as of release), scores 89.2% on AIME 2026, and hits 80% on LiveCodeBench. Those are strong numbers that hold up against Qwen 3.5 and Llama 4 Scout in most categories.
But benchmark performance is something you read and nod at. The Raspberry Pi deployment is something that changes what you can build.
Think about the categories of projects that become possible when a capable multimodal model can run locally on $80 hardware:
Privacy-first applications. Medical data, personal journals, private documents — things you'd never send to a cloud API. A model that runs entirely on your own device means sensitive data never leaves it.
Offline-first tooling. Field work, remote locations, environments with unreliable connectivity. A capable AI model that works without internet is a different category of useful.
Embedded systems. NVIDIA Jetson devices, edge computing nodes, IoT hardware with more capability than a microcontroller. Gemma 4 E2B was specifically built with these targets in mind.
Projects for students and developers in contexts where cloud API costs are a real barrier. $80 hardware is still not nothing. But it's a different category of accessible than "pay per token forever."
How the Four Variants Actually Fit Together
After spending time with the model card and the community documentation, here's how I'd think about which variant to reach for:
E2B — for genuine edge deployment. Raspberry Pi, smartphones, embedded hardware. If your constraint is RAM and you need offline capability, this is the one. Don't reach for it if you're on a laptop — you're unnecessarily limiting yourself.
E4B — the sweet spot for most personal projects and local laptop experimentation. Fits in roughly 5 GB RAM. Strong enough for real tasks, accessible enough to run comfortably on most modern machines.
26B A4B (MoE) — deceptively efficient. It has 26 billion total parameters but only activates around 4 billion per token pass thanks to Mixture-of-Experts routing. If you have a consumer GPU with 8-12 GB VRAM, this is where serious capability starts without the full cost of running a dense model. The MoE architecture means inference is faster than the parameter count suggests.
31B Dense — for when you need the ceiling. All 30.7 billion parameters active on every pass. Highest benchmark scores, highest hardware requirements. Realistically a multi-GPU or high-VRAM workstation setup.
One thing I appreciate about how Google structured this: these aren't just size variants of the same model. The E2B and E4B use Per-Layer Embeddings rather than traditional MoE routing — it's a genuinely different architectural approach to making models efficient at the edge, not just a quantized-down version of the larger model.
What I'm Actually Thinking About Building
I'll be honest: my hardware situation right now is a mid-range laptop and a Raspberry Pi 4 I've had sitting around for two years mostly running Pi-hole. Not exactly a GPU workstation.
Before Gemma 4, my options for local AI were limited enough that I mostly reached for APIs. After reading through the E2B specs, I'm genuinely reconsidering that default.
What I want to explore: a local document analysis tool that processes my own notes and files without any of that data leaving my machine. The 128K context window on E2B means I could feed in reasonably sized documents. The multimodal support means I could include images and diagrams. The offline capability means it works whether I'm connected or not.
That's not a groundbreaking project. But it's exactly the kind of project I'd been mentally filing under "not possible without a paid API" — and Gemma 4 moves it into the "actually try it this weekend" category.
The Honest Caveat
The E2B running on a Raspberry Pi is real, but context matters: Google tested with INT4 quantization, which trades some precision for memory efficiency. Performance at INT4 is lower than at higher precision levels. For tasks requiring nuanced reasoning or precise outputs, you'll notice the difference.
The 31B dense model's hardware requirements are also substantial — up to 19 GB RAM in some configurations. "Runs on consumer hardware" is true for the family, but not uniformly true across all variants.
And the MoE architecture of the 26B, while efficient, behaves differently in practice than a dense model of equivalent active parameter count. Worth benchmarking for your specific use case rather than assuming the numbers translate directly.
Why Open-Weight Matters Here
Gemma 4 ships under Apache 2.0. That's not just a licensing detail — it's what makes the edge deployment story meaningful for the long term.
A proprietary model that runs locally is still a dependency on the vendor's continued goodwill, roadmap, and pricing decisions. An Apache 2.0 model is something you can fork, fine-tune, redistribute, and build on without those constraints. The 100,000+ community variants already built on earlier Gemma models exist because the license made that experimentation possible.
For developers building on top of Gemma 4 — fine-tuning it for a specific domain, integrating it into a product, deploying it at the edge — the license is as important as the capability.
Where I Land
I started this thinking about Gemma 4 primarily as another open model release in a busy year of open model releases. I'm ending it thinking about it as something slightly different: a signal that the capability-accessibility tradeoff in AI is more flexible than I'd assumed.
Running a capable multimodal model with a 128K context window offline on an $80 single-board computer is not the ceiling of what Gemma 4 can do. It's the floor.
That reframe matters for what I think about building next.
Wrote this after going through the Gemma 4 model card, release blog, and community documentation. To get started: Gemma 4 on Hugging Face, Ollama setup guide, or the official Google AI docs.
Top comments (0)