Complete Guide to Google Gemma 4 — The New Open Model Benchmark Under Apache 2.0, from PLE Architecture to Ollama Local Deployment
In April 2026, Google DeepMind released Gemma 4 — a family of open-weight models built on Gemini 3 research and distributed under the Apache 2.0 license. No MAU limits, no commercial restrictions. The 31B Dense model scores 89.2% on AIME 2026 math, 80.0% on LiveCodeBench v6 coding, and 84.3% on GPQA Diamond science — competing with 400B-class proprietary models on a parameter-efficiency basis.
Four key innovations stand out. First, Per-Layer Embeddings (PLE) architecture enables the E2B edge model to achieve 5.1B-class quality with only 2.3B active parameters. Second, all models support native multimodal input (vision + audio). Third, Function Calling is trained into the model from the ground up, optimized for multi-turn agentic workflows. Fourth, up to 256K context window handles entire codebases and long documents in a single prompt.
This guide covers Gemma 4's architectural innovations (PLE, MoE, alternating attention), model specifications and benchmark comparisons, the competitive landscape against Llama 4 and Qwen 3.5, practical local deployment with Ollama/vLLM, and fine-tuning strategies for production use.
The Gemma 4 Model Family — Four Models, Three Architectures
Full Specification Overview
Gemma 4 isn't a single model but a family of four models with distinct hardware targets and architectures. All share Gemini 3's training data and techniques but employ different strategies for inference efficiency.
| Model | Total Params | Active Params | Architecture | Context | Multimodal |
|---|---|---|---|---|---|
| Gemma 4 31B | 31B | 31B | Dense | 256K | Vision |
| Gemma 4 26B MoE | 25.2B | 3.8B | MoE (128E/8A+1S) | 256K | Vision |
| Gemma 4 E4B | ~5B | ~4B | Dense + PLE | 128K | Vision + Audio |
| Gemma 4 E2B | ~5.1B | ~2.3B | Dense + PLE | 128K | Vision + Audio |
The 26B MoE model uses 128 small experts with 8 active per token plus one always-on shared expert. Despite 25.2B total parameters, only 3.8B are activated during inference — achieving 97% of 31B Dense quality at roughly 8x less compute. This contrasts sharply with Llama 4 Scout's 16 large expert approach.
Apache 2.0 — Why the License Matters
The license change from Google's proprietary license to Apache 2.0 is as significant as the technical innovations. This means no MAU limits, full commercial freedom, fine-tuning with redistribution, and embedding in cloud services. While Llama 4's community license restricts apps with 700M+ monthly users, Gemma 4 can be adopted without constraints from startups to hyperscalers.
Architecture Deep Dive — PLE, MoE, and Alternating Attention
Per-Layer Embeddings (PLE) — The Edge Model Innovation
PLE is the novel architecture powering Gemma 4's E2B and E4B edge models. Traditional transformers generate a token vector once at the input embedding layer, then pass it identically through all decoder layers. PLE inverts this paradigm by providing dedicated embedding vectors for each decoder layer.
Specifically, PLE adds a parallel, lower-dimensional conditioning pathway alongside the main residual stream. For each token, it combines a token-identity component (from an embedding lookup) and a context-aware component (a learned projection of the main embeddings) to generate per-layer vectors. Each decoder layer then modulates hidden states via a lightweight residual block after attention and feed-forward.
Because the PLE dimension is much smaller than the main hidden size, parameter cost is modest while layer-specific specialization improves dramatically. The E2B model achieves the representational depth of its full 5.1B parameter count with only 2.3B active parameters — enabling execution under 1.5GB RAM on mobile devices via LiteRT-LM.
# PLE Architecture Concept (Pseudocode)
class PLEDecoderLayer:
def forward(self, hidden_states, ple_vectors):
# 1. Standard attention + FFN (same as vanilla transformer)
attn_out = self.attention(hidden_states)
ffn_out = self.feed_forward(attn_out)
# 2. PLE conditional modulation (novel addition)
ple_signal = self.ple_residual_block(ple_vectors[self.layer_idx])
# 3. Inject PLE signal into main stream
output = ffn_out + ple_signal # lightweight residual connection
return output
class PLEEmbedding:
def generate_per_layer_vectors(self, token_ids, main_embeddings):
identity = self.token_embed_lookup(token_ids) # fixed per-token
context = self.context_projection(main_embeddings) # context-dependent
per_layer = self.layer_projection(identity + context)
return per_layer # [num_layers, low_dim]
Mixture of Experts (MoE) — The 128 Small Experts Strategy
The 26B MoE model's expert structure differs dramatically from other MoE models. While Llama 4 Scout uses 16 large experts, Gemma 4 employs 128 small experts + 1 shared expert, activating 8 per token with the shared expert always on.
| Metric | Gemma 4 26B MoE | Llama 4 Scout |
|---|---|---|
| Expert count | 128 + 1 shared | 16 |
| Active experts/token | 8 + 1 | 1 |
| Total parameters | 25.2B | 109B |
| Active parameters | 3.8B | 17B |
| Context length | 256K | 10M |
The 128 small expert strategy enables finer-grained specialization — each expert covers a narrower knowledge domain, improving routing accuracy and maintaining high quality with fewer active parameters.
Alternating Attention — Balancing Efficiency and Long-Range Understanding
All Gemma 4 models use alternating attention, where decoder layers alternate between local sliding-window attention (512–1024 tokens) and global full-context attention. Sliding-window layers use standard RoPE, while global layers use Proportional RoPE to enable the 256K context window.
This design bypasses the O(n²) complexity of full attention. Most layers process only local windows quickly, while periodic global layers maintain long-range dependencies — keeping memory manageable even at 256K tokens.
Benchmark Deep Dive — Gemma 4 vs Llama 4 vs Qwen 3.5
Core Benchmark Results
| Benchmark | Gemma 4 31B | Gemma 4 26B MoE | Gemma 4 E4B | Gemma 3 27B |
|---|---|---|---|---|
| Arena AI (text) | 1452 | 1441 | — | 1365 |
| AIME 2026 (math) | 89.2% | 88.3% | 42.5% | — |
| LiveCodeBench v6 | 80.0% | 77.1% | — | — |
| GPQA Diamond (science) | 84.3% | 82.3% | — | 42.4% |
| Codeforces ELO | 2150 | — | — | 110 |
The most striking number is the Codeforces ELO jump from 110 to 2150 — corresponding to an Expert-rated competitive programmer. GPQA Diamond nearly doubled from 42.4% to 84.3%, demonstrating massive improvements in graduate-level science reasoning.
Competitive Comparison
| Metric | Gemma 4 31B | Llama 4 Scout | Qwen 3.5 27B |
|---|---|---|---|
| License | Apache 2.0 | Community (700M MAU limit) | Apache 2.0 |
| AIME 2026 math | 89.2% | — | ~49% |
| MMLU Pro | 85.2% | — | 86.1% |
| Context length | 256K | 10M | — |
| Multilingual | 140+ | — | 201 |
| Native audio | E2B/E4B | No | No |
| Native function calling | All models | Yes | Yes |
Gemma 4 dominates in math/coding reasoning, Qwen 3.5 edges ahead in general knowledge (MMLU Pro) and multilingual, and Llama 4 Scout remains unmatched in ultra-long context (10M tokens).
Native Function Calling and Agent Workflows
Gemma 4's function calling isn't prompt engineering — it's built on FunctionGemma research and trained into the model, optimized for multi-turn agentic flows with multiple tools.
from google.adk import Agent, ToolDeclaration
import google.generativeai as genai
model = genai.GenerativeModel("gemma-4-31b-it")
tools = [
genai.Tool(function_declarations=[
genai.FunctionDeclaration(
name="search_database",
description="Search the user database for information",
parameters={
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "description": "Result limit"}
},
"required": ["query"]
}
),
genai.FunctionDeclaration(
name="send_notification",
description="Send a notification to a user",
parameters={
"type": "object",
"properties": {
"user_id": {"type": "string"},
"message": {"type": "string"},
"channel": {"type": "string", "enum": ["email", "slack", "sms"]}
},
"required": ["user_id", "message"]
}
)
])
]
chat = model.start_chat()
response = chat.send_message(
"Find the 5 most recently registered users and send them a welcome message on Slack",
tools=tools
)
# Gemma 4 automatically chains:
# 1. search_database(query="recent_users", limit=5)
# 2. Parses results, then for each user:
# 3. send_notification(user_id=..., message=..., channel="slack")
The synergy with MCP (Model Context Protocol) is particularly powerful — map MCP server tools to Gemma 4 Function Declarations and a locally-running agent seamlessly interacts with external services. Under Apache 2.0, you can embed such agents in commercial services without restrictions.
Local Deployment Guide — Ollama and vLLM
Getting Started with Ollama in 5 Minutes
Ollama is currently the simplest way to run Gemma 4 locally.
# 1. Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# 2. Choose and run your model
ollama run gemma4:e2b # Edge — mobile/laptop (8GB RAM enough)
ollama run gemma4:e4b # Mid — desktop (16GB+ RAM recommended)
ollama run gemma4:26b # MoE — workstation (18GB+ VRAM)
ollama run gemma4:31b # Dense — server-class (24GB+ VRAM)
# 3. Use as OpenAI-compatible API server
ollama serve # http://localhost:11434
# 4. Test with curl
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:26b",
"messages": [
{"role": "user", "content": "Explain the difference between Kubernetes HPA and VPA"}
]
}'
Production Serving with vLLM
For multi-user environments and production pipelines, vLLM provides continuous batching and paged attention for maximum throughput.
# Install vLLM (uv recommended — 10x faster than pip)
uv pip install vllm
# Single GPU (E4B)
vllm serve google/gemma-4-E4B-it \
--host 0.0.0.0 --port 8000 --max-model-len 32768
# Multi-GPU tensor parallel (31B)
vllm serve google/gemma-4-31B-it \
--tensor-parallel-size 2 --host 0.0.0.0 --port 8000
⚠️ Note: As of April 2026, a known vLLM bug drops Gemma 4 to ~9 tok/s on RTX 4090, while Ollama achieves 40–60 tok/s on the same hardware. Ollama is recommended for single-user setups; use vLLM only when multi-user serving is required.
Hardware Requirements
| Model | Min VRAM/RAM | Recommended Hardware | Expected Speed |
|---|---|---|---|
| E2B | 1.5GB RAM | Smartphone, Raspberry Pi | Mobile-optimized |
| E4B | 8GB VRAM | Laptop, Apple Silicon | 40-60 tok/s (Ollama) |
| 26B MoE | 18GB+ VRAM | RTX 4090, A6000 | 30-50 tok/s (Ollama) |
| 31B Dense | 24GB+ VRAM | RTX 4090, A100, H100 | 20-35 tok/s (Ollama) |
Kubernetes Deployment Architecture
For production Gemma 4 deployment on Kubernetes, the vLLM + OpenAI-compatible API + LiteLLM router combination is effective.
# gemma4-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemma4-moe-serving
namespace: ai-inference
spec:
replicas: 2
selector:
matchLabels:
app: gemma4-moe
template:
metadata:
labels:
app: gemma4-moe
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=google/gemma-4-26B-MoE-it
- --max-model-len=65536
- --gpu-memory-utilization=0.9
- --enable-chunked-prefill
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: "40Gi"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
nodeSelector:
gpu-type: a100
Fine-Tuning and Customization
Apache 2.0 permits free redistribution of fine-tuned models. With LoRA/QLoRA, you can create domain-specific models on consumer GPUs.
from unsloth import FastLanguageModel
# Load with 4-bit quantization (saves VRAM)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="google/gemma-4-E4B-it",
max_seq_length=4096,
load_in_4bit=True, # QLoRA — works on RTX 3090
)
# Add LoRA adapter
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0.05,
)
# Run SFT
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=your_dataset,
dataset_text_field="text",
max_seq_length=4096,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
output_dir="gemma4-finetuned",
),
)
trainer.train()
# Convert to GGUF for Ollama
model.save_pretrained_gguf("gemma4-custom", tokenizer, quantization_method="q4_k_m")
Practical Recommendations
Gemma 4 sets a new benchmark for open models. Here are key production recommendations:
Quick start: Run E4B locally with Ollama for code review, document analysis, and architecture queries. 8GB VRAM is sufficient, and the OpenAI-compatible API integrates with existing toolchains instantly.
Cost optimization: The 26B MoE model delivers 31B-class quality with 3.8B active parameters — 8x compute savings with only 3% quality trade-off. Ideal for RAG pipelines and agent backends.
Edge AI strategy: E2B's PLE architecture runs under 1.5GB RAM, making it a game-changer for embedding AI in mobile apps or IoT devices. LiteRT-LM integration simplifies native Android/iOS deployment.
Caution: Monitor the vLLM performance issue, use Ollama for single-user environments. When using 256K context, monitor memory and consider limiting via --max-model-len.
The 2026 open model landscape has shifted decisively. Gemma 4's combination of "Apache 2.0 + PLE + native multimodal + Function Calling" provides a realistic alternative for reducing proprietary model dependency in enterprise AI infrastructure.
This article was written with the assistance of AI (Claude Opus 4.6). Technical facts were verified against official documentation and benchmark data. For the latest information, refer to the official Google Gemma documentation, DeepMind Gemma 4 page, and HuggingFace Gemma 4 blog.
Originally published at ManoIT Tech Blog.
Top comments (0)