The Question That Started This
My internet went out for three hours last Tuesday. In the middle of a coding session.
For most of the last five years, that would have been the end of my AI-assisted workflow. No API call, no GPT-4, no Gemini nothing. But this time I just opened a terminal and kept working. Because Gemma 4 was already running on my own machine.
That three-hour outage became, unexpectedly, one of the most clarifying experiences I've had as a developer in 2026. Not because AI saved me but because local AI saved me. And it made me think hard about what it actually means that Google just released one of the most capable AI model families in the world under a license that lets anyone download it, run it, and build with it for free.
This is my attempt to share that experience starting with how to actually get Gemma 4 running on your hardware, and ending with the question I can't stop thinking about.
What Gemma 4 Actually Is (The Short Version)
Released on April 2, 2026, Gemma 4 is Google DeepMind's newest family of open-weight multimodal models, built under a fully permissive Apache 2.0 license. That last part Apache 2.0 is the headline that didn't get enough headlines. Previous Gemma releases used Google's custom Terms of Use. This time, Google went fully open-source. You can use these models commercially, modify them, integrate them into products, and redistribute them. No special permissions required.
Gemma 4 shipped with four model sizes, two architectural patterns (dense and MoE), and a clean split between edge and server tiers. Here's what the family looks like:
| Model | Parameters | Best For | Context Window |
|---|---|---|---|
| E2B | ~2.3B effective | Smartphones, Raspberry Pi, IoT | 128K |
| E4B | ~4.5B effective | Laptops, mid-range devices | 128K |
| 26B A4B (MoE) | 26B total / 4B active | Desktop GPUs, power users | 256K |
| 31B Dense | 31B | Workstations, GPU servers | 256K |
The magic in that table is the 26B A4B model. It uses a Mixture-of-Experts architecture, meaning only 4 billion parameters fire on any given forward pass, so latency and cost behave like a 4B model, while quality benefits from the full 26B parameter pool. You get near-flagship performance at mid-tier hardware cost. That is a genuinely big deal.
The Numbers That Matter
Before we get to setup, let me briefly share why this model is worth running at all.
The 31B Gemma 4 model scores 89.2% on AIME 2026, a mathematics benchmark where most models struggle to reach 60%. More impressively, it achieves 80% on LiveCodeBench v6 outperforming Llama 4 Maverick's 43.4% despite having 13x fewer parameters.
Read that again. 13x fewer parameters. Better coding results.
Benchmarks place the 31B model at #3 on Arena AI's open model leaderboard with an ELO of 1452. It also supports text and image inputs with variable aspect ratio and resolution (all models), plus video and audio natively on the E2B and E4B edge models, along with 140+ languages.
These aren't "good for an open model" numbers. These are just good numbers.
Getting Started: Running Gemma 4 Locally in Under 10 Minutes
Step 1: Install Ollama
Ollama is the easiest way to run Gemma 4 locally. It handles model downloads, quantization, and serving all from a single command-line tool.
macOS / Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer directly from ollama.com
Verify the install:
ollama --version
# Should show 0.22.x or higher (Gemma 4 requires 0.20.0+)
Step 2: Pick Your Model Size
Choose based on your hardware. Here's the honest guide:
# 8GB RAM, no dedicated GPU — start here
ollama pull gemma4:2b
# 16GB RAM, mid-range GPU (GTX 1080 / M1 Mac) — best balance
ollama pull gemma4:4b
# 24GB+ VRAM (RTX 3090 / 4090 / M2 Max) — excellent quality
ollama pull gemma4:27b
My recommendation if you're unsure: Start with
gemma4:4b. It's the sweet spot fast enough to feel responsive, smart enough to actually be useful for real coding tasks. You can always pull a larger model later.
Download times vary. The 4B model is roughly 3GB and downloads in 5–10 minutes on a standard broadband connection.
Step 3: Run Your First Chat
Once downloaded, it's one command:
ollama run gemma4:4b
You'll drop into an interactive terminal session. Try prompting it with something real:
>>> Explain the difference between MoE and dense transformer architectures
in simple terms, then give me a practical example of when I'd choose one over the other.
You'll notice two things immediately: how fast the response starts, and how coherent the reasoning is. This is running entirely on your machine, with no network request involved.
Step 4: Fix the Context Window (Important!)
Ollama has a known quirk with Gemma 4 it silently defaults to a 4K context window instead of the model's full capacity. To unlock the full 128K or 256K context your model supports, you need to set num_ctx explicitly:
ollama run gemma4:4b --num_ctx 65536
Or, for a persistent configuration, create a Modelfile:
FROM gemma4:4b
PARAMETER num_ctx 65536
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM "You are a helpful assistant with access to a large context window. Use it."
Then build your custom variant:
ollama create gemma4-fullctx -f Modelfile
ollama run gemma4-fullctx
This small change makes a significant practical difference especially for longer codebases, document analysis, or multi-file reasoning tasks.
Step 5: Connect to Your Code Editor
Running Gemma 4 in the terminal is fine for testing, but you'll want it in your actual workflow. With Ollama running, it exposes a local OpenAI-compatible API at http://localhost:11434.
Using Continue.dev in VS Code:
Install the Continue extension, then add this to your ~/.continue/config.json:
{
"models": [
{
"title": "Gemma 4 (Local)",
"provider": "ollama",
"model": "gemma4:4b",
"contextLength": 65536
}
]
}
Restart VS Code and you'll have Gemma 4 available as your inline code assistant Tab to autocomplete, Cmd+L to chat with zero API costs and zero data leaving your machine.
Bonus: Using the Multimodal Capability
Gemma 4 understands images natively. Here's a quick Python example that sends an image for analysis:
import ollama
response = ollama.chat(
model='gemma4:4b',
messages=[
{
'role': 'user',
'content': 'What is shown in this image? Describe any text you can see.',
'images': ['./screenshot.png'] # Path to your image
}
]
)
print(response['message']['content'])
I've been using this for debugging UI screenshots paste a broken layout image and ask Gemma 4 to identify the likely CSS issue. It's surprisingly accurate.
What Nobody Else Is Talking About
Here's the part I keep coming back to, beyond the setup guide.
Since the first Gemma models launched, developers around the world have downloaded them over 500 million times and created more than 100,000 custom variants. That level of adoption isn't accident it tells you something real about what the developer community has been hungry for: open models that are practical, fast, and deployable beyond the cloud.
Gemma 4 is the first time that hunger is fully satisfied.
Every AI model before this required a tradeoff. Want power? Pay for the API. Want privacy? Accept weaker models. Want no usage limits? Live with slow inference. Gemma 4 collapses that tradeoff in a way no previous open model has managed. The 31B model ranks #3 on open model leaderboards outperforming models twenty times its size and it runs on hardware that a mid-career developer could reasonably own.
Think about what that actually means:
- A developer in Lagos building a healthcare chatbot can use frontier AI without touching a cloud bill
- A student in Jakarta can fine-tune a model on their own data without sending that data anywhere
- A startup in Berlin can ship a product powered by Gemma 4 without negotiating API terms with a giant
Gemma 4 shifts the focus from pure scale to practical intelligence, making frontier-level capabilities accessible for local and edge deployment, significantly narrowing the gap between open and proprietary models in reasoning and multimodal tasks.
That's not a benchmark. That's a geopolitical shift in who gets access to cutting-edge AI.
My Honest Take: Where It Falls Short
Because a submission that's just praise isn't useful here's what Gemma 4 still doesn't fully nail:
Complex multi-file reasoning at scale. The 4B and even 27B models can struggle when given a large, messy codebase and asked to reason about architectural decisions across 30+ files simultaneously. GPT-4.5 and Gemini 3.5 still have an edge on truly sprawling projects.
The 256K context window is theoretical for most users. Actually running the 31B model with a full 256K context requires serious hardware. On consumer hardware, you'll realistically be working at 64–128K still excellent, but worth knowing going in.
Fine-tuning documentation is still catching up. The model is capable of fine-tuning for specific domains, but official Google documentation and tooling around PEFT/LoRA workflows for Gemma 4 is thinner than you'd want for production use.
None of these are dealbreakers. But if you're planning to deploy Gemma 4 in a serious production context, go in with clear eyes about the current state.
The Bigger Picture
When my internet came back on after those three hours, I didn't switch back to the cloud. I just kept working with Gemma 4.
That's the sentence I didn't expect to write six months ago.
We've spent the last few years talking about AI democratization as a future thing something that would happen when models got smaller or hardware got cheaper. Gemma 4 is the moment that future became present. From phones and Raspberry Pi to GPU workstations you keep your data on-device and get up to 256K tokens of context. No subscription. No data leaving your network. No dependency on a server farm 10,000 miles away.
The question this model leaves me with is the one I think the whole industry should be sitting with in 2026:
If frontier AI can now run locally, privately, and for free what does that do to the assumption that AI is a cloud service?
I don't have a clean answer. But I think the developers who ask that question seriously, and build accordingly, are going to be the ones who define what software looks like in five years.
Go pull gemma4:4b. Start there. See what you build.
Quick Reference Card
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull your model (choose one)
ollama pull gemma4:2b # 8GB RAM minimum
ollama pull gemma4:4b # 16GB RAM recommended
ollama pull gemma4:27b # 24GB+ VRAM
# Run with full context window
ollama run gemma4:4b --num_ctx 65536
# Local API endpoint (OpenAI-compatible)
http://localhost:11434/v1
# Multimodal: text + image in Python
import ollama
ollama.chat(model='gemma4:4b', messages=[{'role':'user','content':'describe this','images':['img.png']}])
This post was written for the Google I/O 2026 "Write About Gemma 4" community challenge. All setup steps were verified on Ubuntu 24.04 and macOS Sequoia in May 2026. Ollama version: 0.22.x.
Top comments (1)
Thank you for reading through!