This is a submission for the Gemma 4 Challenge: Write About Gemma 4
The single most common mistake developers make when picking a local model is choosing based on benchmark scores. The second most common mistake is choosing based on what fits in VRAM.
Both of those things matter. But neither one is the actual first question.
The actual first question is: where does your model need to live, and what does it need to do there?
Gemma 4 ships in four variants - E2B, E4B, 26B A4B (MoE), and 31B - and Google made very deliberate architectural choices for each one. If you understand those choices, picking the right variant takes about five minutes. If you skip that step and benchmark-shop, you'll end up either underbuilding (a phone-ready E4B doing work that needs 256K context) or overbuilding (a 31B model sitting on $80/month of cloud compute when an E4B running locally would have been fine).
This post is that five-minute decision guide.
What Gemma 4 Actually Is
Released on April 2, 2026 under Apache 2.0, Gemma 4 is Google DeepMind's latest open-weight model family. Every variant ships with multimodal understanding (text + image as baseline, audio natively on the two smallest models), native function calling, and support for over 140 languages.
The headline capability that separates Gemma 4 from previous generations isn't any single feature. It's the intelligence-per-parameter ratio. The 26B MoE model only activates roughly 4B parameters per forward pass. The E4B runs on a phone. The 31B scores 89.2% on AIME 2026 math benchmarks - a score that would have required a model several times larger just a year ago.
The architecture decisions that make this possible:
- Alternating local/global attention layers (local layers use sliding windows of 512-1024 tokens, global layers handle long-range context)
- Per-Layer Embeddings (PLE) on the edge variants, which keeps the parameter count low while maintaining expressivity
- Mixture-of-Experts on the 26B that routes each token through only the relevant expert layers, not the full network
This isn't just efficiency for efficiency's sake. It's what allows a 4-billion-parameter model to run offline on an Android phone with 4GB of RAM while still having a 128K context window. That combination didn't exist before.
The Four Variants, Actually Explained
Gemma 4 E2B - The Phone Model
~2.3B effective parameters, ~5.1B total with PLE, 35 layers, 128K context
This is the model you reach for when the edge is the deployment target. It runs on Android 12+ via Google AICore, on Raspberry Pi, and on Jetson devices. It supports text, image, and audio natively.
The "E" in the name stands for effective - because PLE means the model has more total parameters than it activates per forward pass, similar to how MoE works at a different level of the architecture. The practical result is a 1.5GB footprint with capabilities that land well above what a raw 2B parameter count would suggest.
Use E2B when: you're building a mobile app, an edge inference pipeline, a device-local assistant, or anything where network latency or data privacy makes sending requests to a remote API unacceptable.
Real use case: a receipt-scanning expense tracker that runs fully offline, reads image input, parses line items, and categorizes spending - all on device, no API call, no data leaving the phone.
# Running E2B locally with transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "google/gemma-4-E2B-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{
"role": "user",
"content": "Extract the total amount and vendor name from this receipt text: ..."
}
]
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
return_dict=True
).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Gemma 4 E4B - The Laptop Model
~4.5B effective parameters, ~8B total, 42 layers, 128K context
This is the everyday workhorse for developers who want to run a capable model locally without dedicated GPU hardware. It runs comfortably on a MacBook with 16GB unified memory, on a mid-range laptop with an integrated GPU, and on any machine where you'd rather not spin up a cloud instance.
The jump from E2B to E4B isn't just more parameters. The additional layers and parameter budget give it noticeably better instruction following, more reliable structured output, and stronger performance on tasks that require holding context across a long conversation.
It supports the same text, image, and audio modalities as E2B, which makes it genuinely multimodal in a way that matters for developer tooling - you can feed it screenshots, diagrams, or audio transcripts as part of a pipeline without needing a separate vision model.
Use E4B when: local inference is the requirement, your hardware doesn't have a discrete GPU, or you're prototyping something you'll later scale to a larger model and want fast iteration cycles.
Real use case: a local code review tool that takes a screenshot of your editor alongside the diff, understands both, and gives context-aware feedback - all running on your laptop, no telemetry.
# Quick Ollama setup for E4B (easiest local path)
# After installing Ollama: https://ollama.com
# In terminal:
# ollama pull gemma4:e4b
import ollama
response = ollama.chat(
model="gemma4:e4b",
messages=[
{
"role": "user",
"content": "Review this function for edge cases and suggest improvements:",
}
],
options={
"temperature": 0.3,
"num_ctx": 8192 # can go up to 128K
}
)
print(response["message"]["content"])
Gemma 4 26B A4B (MoE) - The Consumer GPU Model
25.2B total parameters, ~3.8B active per forward pass, ~30 layers, 256K context
This is the one that makes the architecture story interesting. The 26B MoE sounds like it needs 26 billion parameters worth of compute. It doesn't. Only about 4 billion parameters activate for each token, which means it runs on a single RTX 3090 or RTX 4090 at full precision while delivering quality that competes with much larger dense models.
The jump to 256K context window is significant for developers. At 128K you can fit roughly a medium-sized codebase or a very long document. At 256K you're fitting large repositories, multi-document research contexts, or full conversation histories in customer-facing applications.
The MoE architecture also means that quality degrades more gracefully with quantization than a dense model of equivalent total parameters would. INT4 at 26B MoE looks better than INT4 at a comparable dense model.
Use 26B A4B when: you have a consumer GPU (24GB VRAM), need 256K context, and want near-flagship quality without flagship hardware costs. Also the right choice for anything agentic where the model needs to reason across large amounts of context to plan multi-step tasks.
Real use case: an agentic document processor that ingests a full legal contract (or a full codebase) in a single prompt, reasons across the entire document, and extracts structured data or answers specific questions - running locally on a 4090.
# Using the Gemma 4 26B with native function calling
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import json
model_id = "google/gemma-4-26B-A4B-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
load_in_4bit=True # fits on 24GB with 4-bit quant
)
# Native function calling - define your tools
tools = [
{
"name": "search_contracts",
"description": "Search the contract database by clause type or party name",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"clause_type": {
"type": "string",
"enum": ["liability", "termination", "payment", "IP"],
"description": "Type of clause to filter by"
}
},
"required": ["query"]
}
}
]
messages = [
{
"role": "user",
"content": "Find all termination clauses across the Q1 vendor contracts and summarize the notice periods."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tools=tools,
return_tensors="pt",
return_dict=True
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Gemma 4 31B - The Server Model
31 billion dense parameters, 256K context, full multimodal, thinking mode
This is the flagship. Every capability available in the family is present here. Thinking mode (chain-of-thought reasoning) is enabled. Math benchmark scores are serious: 89.2% on AIME 2026, compared to Gemma 3 27B's 20.8% on the same benchmark. It sits at #3 on the Arena open model leaderboard.
It requires ~20GB VRAM at FP16, or ~12GB with INT4 quantization. A single A100 80GB handles it comfortably at full precision. Two RTX 4090s with tensor parallelism also work. This is the model you deploy to a server, not run on a laptop.
Use 31B when: benchmark quality matters for your application, you need thinking mode for reasoning-heavy tasks, you're building a production service that will handle requests from multiple users, or you need the best math and coding performance available in an open-weight model.
Real use case: a coding assistant API that developers on your team query through a self-hosted endpoint - one 31B instance serving your whole engineering org at a cost that's a fraction of equivalent proprietary API calls.
# Serving 31B with vLLM for production throughput
# pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(
model="google/gemma-4-31B-it",
tensor_parallel_size=2, # across 2x RTX 4090
dtype="bfloat16",
max_model_len=65536 # 64K for production balance
)
sampling_params = SamplingParams(
temperature=0.2,
top_p=0.9,
max_tokens=2048
)
# Thinking mode for complex reasoning
prompts = [
"<start_of_turn>user\nThink step by step: Given this algorithm, what's the worst-case time complexity and where is the bottleneck?\n\n[your code here]\n<end_of_turn>\n<start_of_turn>model\n"
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
The Decision Matrix
Here's the five-minute version:
| Situation | Model |
|---|---|
| Mobile app, Raspberry Pi, offline-first | E2B |
| Laptop development, no GPU, fast iteration | E4B |
| Consumer GPU (24GB), 256K context needed | 26B A4B MoE |
| Server deployment, best quality, team-serving | 31B |
| Agentic pipeline with many tool calls | 26B A4B MoE (active param efficiency) |
| Math, coding, or reasoning-heavy production | 31B |
| Privacy-sensitive user data, no API calls | E4B or E2B |
| You have an A100 and want the best | 31B |
The Bigger Thing Happening Here
I want to step back from the specs for a second.
A model that scores 89.2% on a serious math benchmark, supports 256K context, runs multimodal inference, and has native function calling for agentic tasks... is now open-weight, Apache 2.0, and runs on hardware that a developer can actually own.
The E4B running on a laptop with 128K context and audio support isn't a "small model compromise." It's a capability that would have been frontier-level two years ago. The E2B running on a phone offline isn't a demo trick. It's a production-viable deployment target.
What that actually means is that the architectural question of "cloud or local?" is no longer primarily a capability question. It's a cost, latency, and privacy question. And for a lot of applications - the ones where user data is sensitive, where offline availability matters, where API costs compound at scale - local wins.
Gemma 4 doesn't make that argument. It just makes it very hard to argue against.
Getting Started in Under 5 Minutes
The fastest path to running any Gemma 4 variant locally is Ollama:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull the variant you want
ollama pull gemma4:e4b # ~5GB, laptop-ready
ollama pull gemma4:26b # ~15GB, GPU-ready
# Run it
ollama run gemma4:e4b
# Or use the API directly
curl http://localhost:11434/api/chat -d '{
"model": "gemma4:e4b",
"messages": [
{ "role": "user", "content": "Hello, what can you do?" }
]
}'
If you want Python with the full transformers ecosystem (function calling, thinking mode, multimodal), the Hugging Face model cards for each variant have complete working examples. Start with google/gemma-4-E4B-it - it's the most accessible entry point and covers most development use cases.
Quick Note on Licensing
Apache 2.0 means you can use Gemma 4 commercially, modify the weights, build products on top of it, and distribute your derivative work - without paying royalties or asking permission. That is not the case for every "open" model out there, and it matters a lot for anyone building a business on top of local inference.
The right Gemma 4 variant is the one that runs where your users are, fits the hardware you can actually provision, and has enough context to do the task you're designing for. Everything else is optimization.
Start with E4B if you're unsure. Scale up when the task demands it.
Tags: devchallenge gemmachallenge gemma ai machinelearning python opensource
Top comments (0)