TL;DR
Gemma 4 launched on April 3, 2026, with Ollama v0.20.0 providing same-day support. You can pull and run the default gemma4:e4b model with just two commands. This tutorial shows you how to set up, select models, use the API, and test your local Gemma 4 endpoints using Apidog.
Introduction
Google released Gemma 4 on April 2, 2026. Ollama v0.20.0 shipped within 24 hours, supporting all four model variants.
Why should developers care? Gemma 4 is a significant upgrade: 89.2% on AIME 2026 (vs. Gemma 3's 20.8%) and a jump to 2150 ELO on Codeforces for coding. It features native function calling, configurable thinking modes, and a 256K context window on larger variants—all running locally.
For API-powered app development, local setup means you get a fast, private AI layer. Use it for generating mock data, writing test scenarios, and validating API responses—no cloud dependency.
💡 Once Gemma 4 runs locally, Apidog's Smart Mock can generate realistic API response data from your schema using AI-backed inference. Define your API shape once; Apidog handles the mock data—ideal for consistent, schema-compliant test data in local experiments.
This guide covers installation, running models, using the API, and testing endpoints.
What's new in Gemma 4
Gemma 4 ships four model variants:
Key improvements:
- Reasoning and coding: 31B model scores 80% on LiveCodeBench v6 (Gemma 3 27B: 29.1%).
- Mixture-of-Experts (MoE): 26B uses MoE (4B active params), giving high quality at lower compute.
- Longer context: E2B/E4B support 128K tokens; 26B/31B support 256K—enough for large codebases or specs.
- Native function calling: All models accept function schemas and return valid JSON—no prompt tricks.
- Audio and image input: E2B/E4B accept audio and images.
- Thinking modes: Enable/disable chain-of-thought per request as needed.
Gemma 4 model variants explained
Choose a model based on your hardware:
| Model | Size on disk | Context | Architecture | Best for |
|---|---|---|---|---|
gemma4:e2b |
7.2 GB | 128K | Dense | Laptops, edge, audio/image |
gemma4:e4b |
9.6 GB | 128K | Dense | Most developers |
gemma4:26b |
18 GB | 256K | MoE (4B active) | Best quality per GB |
gemma4:31b |
20 GB | 256K | Dense | Max quality |
- The
e4bmodel is default (ollama run gemma4). Fits most GPUs (10+ GB VRAM) and Apple Silicon. -
26bis MoE: only 4B parameters active per token. Fast inference with near-flagship quality—good for 20+ GB RAM.
Prerequisites
- Ollama v0.20.0 or later is required.
Check version:
ollama --version
Upgrade if needed:
# macOS
brew upgrade ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
For Windows, download the latest from ollama.com.
Hardware requirements:
-
gemma4:e2b: 8 GB RAM min (16 GB recommended) -
gemma4:e4b: 10 GB VRAM or 16 GB unified memory -
gemma4:26b: 20+ GB RAM or unified memory -
gemma4:31b: 24 GB VRAM or 32 GB unified memory
Installing and running Gemma 4
Pull and run the default e4b model:
ollama run gemma4
This downloads ~9.6 GB and starts an interactive session. Try it:
>>> What are the HTTP status codes for client errors?
Run specific variants:
# Edge model, smallest
ollama run gemma4:e2b
# MoE for quality/size
ollama run gemma4:26b
# Full flagship
ollama run gemma4:31b
Pull without running:
ollama pull gemma4
ollama pull gemma4:26b
List installed models:
ollama list
Using the Gemma 4 API locally
Ollama exposes a REST API at http://localhost:11434.
Generate a completion
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"prompt": "Write a JSON response for a user profile API endpoint",
"stream": false
}'
Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"messages": [
{
"role": "user",
"content": "Generate a realistic JSON mock for an e-commerce order API response"
}
]
}'
Python client
import requests
def ask_gemma4(prompt: str, model: str = "gemma4") -> str:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
response.raise_for_status()
return response.json()["response"]
result = ask_gemma4("List the fields a payment API response should include")
print(result)
Using the OpenAI Python SDK
Ollama's API supports the OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required by SDK, unused by Ollama
)
response = client.chat.completions.create(
model="gemma4",
messages=[
{
"role": "system",
"content": "You generate realistic API response data in JSON format."
},
{
"role": "user",
"content": "Generate a sample response for a GET /users/{id} endpoint"
}
]
)
print(response.choices[0].message.content)
Using function calling with Gemma 4
Gemma 4 supports native function calling—define a tool schema, get structured JSON matching your function signature.
Example:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
tools = [
{
"type": "function",
"function": {
"name": "get_user",
"description": "Retrieve a user by ID from the API",
"parameters": {
"type": "object",
"properties": {
"user_id": {
"type": "integer",
"description": "The unique user ID"
},
"include_orders": {
"type": "boolean",
"description": "Whether to include order history"
}
},
"required": ["user_id"]
}
}
}
]
response = client.chat.completions.create(
model="gemma4",
messages=[
{"role": "user", "content": "Get user 42 with their order history"}
],
tools=tools,
tool_choice="auto"
)
tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name) # get_user
print(tool_call.function.arguments) # {"user_id": 42, "include_orders": true}
The model extracts parameters from natural language, returning valid JSON—no post-processing needed.
Enabling thinking mode
For complex tasks (e.g., writing test scenarios, analyzing API specs), enable chain-of-thought reasoning:
response = client.chat.completions.create(
model="gemma4",
messages=[
{
"role": "user",
"content": "Design a complete test scenario for a payment processing API with edge cases"
}
],
extra_body={"think": True}
)
print(response.choices[0].message.content)
Skip thinking mode for simple requests to reduce latency.
Testing Gemma 4 API responses with Apidog
With Gemma 4 running locally, use Apidog to test endpoints efficiently.
Steps:
-
Import Ollama API spec: In Apidog, create a new project; set base URL to
http://localhost:11434. -
Define endpoints: Add:
-
POST /api/generate(single-turn completions) -
POST /v1/chat/completions(multi-turn chat) -
GET /api/tags(list models)
-
-
Set up Test Scenario: Chain requests with assertions:
- Step 1:
GET /api/tags—assertgemma4is listed. - Step 2:
POST /api/generate—assertresponsefield is non-empty. - Step 3:
POST /v1/chat/completions—assert reply format. - Use Apidog's Extract Variable processor to pass responses between steps for multi-turn flow testing.
- Step 1:
- Validate schemas: Apidog Contract Testing validates API responses against your OpenAPI spec. Define expected response shapes and run contract tests after model updates.
- Parallel development with Smart Mock: Apidog's Smart Mock generates schema-compliant responses from your API spec, letting frontend teams work without waiting for the local model.
Multimodal input with Gemma 4
E2B and E4B models accept images alongside text. Send images as base64-encoded strings:
import base64
with open("api_diagram.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gemma4:e4b",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_data}"
}
},
{
"type": "text",
"text": "Describe the API flow shown in this diagram and identify potential error paths"
}
]
}
]
)
Use this to analyze diagrams, screenshots, or extract info from images needed for your API.
Common issues and fixes
-
Model not found: Run
ollama pull gemma4or verify withollama list. -
Slow inference on CPU: Use
gemma4:e2bfor better performance. -
Out of memory: Check VRAM/unified memory with
ollama ps. Use smaller models if needed. - Apple Silicon issues: Update Ollama (0.20.0+ adds MLX support).
-
Port in use: Run
OLLAMA_HOST=0.0.0.0:11435 ollama serveto use a different port. -
Cut-off responses: Increase context window: add
"options": {"num_ctx": 8192}to your request body.
Gemma 4 vs other local models
| Model | Best size for most users | Context | Function calling | Coding benchmark |
|---|---|---|---|---|
| Gemma 4 | e4b (9.6 GB) | 128K-256K | Native | 80% LiveCodeBench |
| Llama 3.3 | 70B-Q4 (40 GB) | 128K | Native | ~60% LiveCodeBench |
| Qwen3.6-Plus | 72B-Q4 (44 GB) | 128K | Native | Strong |
| Mistral Small | 24B (14 GB) | 128K | Native | Moderate |
Gemma 4's MoE 26B (18 GB) delivers near-flagship quality with better tokens/sec than larger dense models.
- For coding, 31B is competitive with larger models.
- For laptops/edge,
e2bruns under 8 GB.
Conclusion
Gemma 4 with Ollama is a powerful local AI setup. Installation is fast, the default model fits most developer machines, and the improvements over Gemma 3 are substantial.
Start with:
ollama run gemma4
Test the API using Apidog to validate endpoints, then select the right model variant for your needs.
For API-driven development, combining local inference with Apidog's Smart Mock and Test Scenarios delivers a complete, cloud-free workflow.
FAQ
How do I update Gemma 4 in Ollama when a new version comes out?
Run ollama pull gemma4 to fetch the latest version.
Can I run Gemma 4 on a machine without a GPU?
Yes, but it's slow (1–3 tokens/sec). e2b is best for CPU-only.
What's the difference between gemma4:e2b and gemma4:e4b?
Both are dense models. E4B has more parameters and better reasoning; E2B is smaller and supports audio input. For text, e4b is the better default.
Does Gemma 4 work with LangChain and LlamaIndex?
Yes. Point the provider to http://localhost:11434 and use gemma4 as the model name.
Is the local Gemma 4 API compatible with OpenAI code?
Mostly yes. Ollama's /v1/chat/completions endpoint matches the OpenAI format. Set base_url to http://localhost:11434/v1 and use any api_key.
How do I use Gemma 4's thinking mode?
Add "think": true in the extra_body (OpenAI SDK) or top-level JSON in direct API calls.
Can I serve Gemma 4 to other machines on my network?
Yes. Start Ollama with OLLAMA_HOST=0.0.0.0:11434 ollama serve and use your IP address.
What's the best Gemma 4 model for API development?
For mock data and tests, e4b balances speed and quality. For complex analysis, 26b MoE offers better results at lower resource cost.



Top comments (0)