Wanda

Posted on Apr 3 • Originally published at apidog.com

How to run Gemma 4 locally with Ollama: a complete guide

TL;DR

Gemma 4 launched on April 3, 2026, with Ollama v0.20.0 providing same-day support. You can pull and run the default gemma4:e4b model with just two commands. This tutorial shows you how to set up, select models, use the API, and test your local Gemma 4 endpoints using Apidog.

Try Apidog today

Introduction

Google released Gemma 4 on April 2, 2026. Ollama v0.20.0 shipped within 24 hours, supporting all four model variants.

Why should developers care? Gemma 4 is a significant upgrade: 89.2% on AIME 2026 (vs. Gemma 3's 20.8%) and a jump to 2150 ELO on Codeforces for coding. It features native function calling, configurable thinking modes, and a 256K context window on larger variants—all running locally.

For API-powered app development, local setup means you get a fast, private AI layer. Use it for generating mock data, writing test scenarios, and validating API responses—no cloud dependency.

💡 Once Gemma 4 runs locally, Apidog's Smart Mock can generate realistic API response data from your schema using AI-backed inference. Define your API shape once; Apidog handles the mock data—ideal for consistent, schema-compliant test data in local experiments.

This guide covers installation, running models, using the API, and testing endpoints.

What's new in Gemma 4

Gemma 4 ships four model variants:

Key improvements:

Reasoning and coding: 31B model scores 80% on LiveCodeBench v6 (Gemma 3 27B: 29.1%).
Mixture-of-Experts (MoE): 26B uses MoE (4B active params), giving high quality at lower compute.
Longer context: E2B/E4B support 128K tokens; 26B/31B support 256K—enough for large codebases or specs.
Native function calling: All models accept function schemas and return valid JSON—no prompt tricks.
Audio and image input: E2B/E4B accept audio and images.
Thinking modes: Enable/disable chain-of-thought per request as needed.

Gemma 4 model variants explained

Choose a model based on your hardware:

Model	Size on disk	Context	Architecture	Best for
`gemma4:e2b`	7.2 GB	128K	Dense	Laptops, edge, audio/image
`gemma4:e4b`	9.6 GB	128K	Dense	Most developers
`gemma4:26b`	18 GB	256K	MoE (4B active)	Best quality per GB
`gemma4:31b`	20 GB	256K	Dense	Max quality

The e4b model is default (ollama run gemma4). Fits most GPUs (10+ GB VRAM) and Apple Silicon.
26b is MoE: only 4B parameters active per token. Fast inference with near-flagship quality—good for 20+ GB RAM.

Prerequisites

Ollama v0.20.0 or later is required.

Check version:

ollama --version

Upgrade if needed:

# macOS
brew upgrade ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

For Windows, download the latest from ollama.com.

Hardware requirements:

gemma4:e2b: 8 GB RAM min (16 GB recommended)
gemma4:e4b: 10 GB VRAM or 16 GB unified memory
gemma4:26b: 20+ GB RAM or unified memory
gemma4:31b: 24 GB VRAM or 32 GB unified memory

Installing and running Gemma 4

Pull and run the default e4b model:

ollama run gemma4

This downloads ~9.6 GB and starts an interactive session. Try it:

>>> What are the HTTP status codes for client errors?

Run specific variants:

# Edge model, smallest
ollama run gemma4:e2b

# MoE for quality/size
ollama run gemma4:26b

# Full flagship
ollama run gemma4:31b

Pull without running:

ollama pull gemma4
ollama pull gemma4:26b

List installed models:

ollama list

Using the Gemma 4 API locally

Ollama exposes a REST API at http://localhost:11434.

Generate a completion

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4",
    "prompt": "Write a JSON response for a user profile API endpoint",
    "stream": false
  }'

Chat completion (OpenAI-compatible)

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4",
    "messages": [
      {
        "role": "user",
        "content": "Generate a realistic JSON mock for an e-commerce order API response"
      }
    ]
  }'

Python client

import requests

def ask_gemma4(prompt: str, model: str = "gemma4") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    response.raise_for_status()
    return response.json()["response"]

result = ask_gemma4("List the fields a payment API response should include")
print(result)

Using the OpenAI Python SDK

Ollama's API supports the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required by SDK, unused by Ollama
)

response = client.chat.completions.create(
    model="gemma4",
    messages=[
        {
            "role": "system",
            "content": "You generate realistic API response data in JSON format."
        },
        {
            "role": "user",
            "content": "Generate a sample response for a GET /users/{id} endpoint"
        }
    ]
)

print(response.choices[0].message.content)

Using function calling with Gemma 4

Gemma 4 supports native function calling—define a tool schema, get structured JSON matching your function signature.

Example:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_user",
            "description": "Retrieve a user by ID from the API",
            "parameters": {
                "type": "object",
                "properties": {
                    "user_id": {
                        "type": "integer",
                        "description": "The unique user ID"
                    },
                    "include_orders": {
                        "type": "boolean",
                        "description": "Whether to include order history"
                    }
                },
                "required": ["user_id"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gemma4",
    messages=[
        {"role": "user", "content": "Get user 42 with their order history"}
    ],
    tools=tools,
    tool_choice="auto"
)

tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name)       # get_user
print(tool_call.function.arguments)  # {"user_id": 42, "include_orders": true}

The model extracts parameters from natural language, returning valid JSON—no post-processing needed.

Enabling thinking mode

For complex tasks (e.g., writing test scenarios, analyzing API specs), enable chain-of-thought reasoning:

response = client.chat.completions.create(
    model="gemma4",
    messages=[
        {
            "role": "user",
            "content": "Design a complete test scenario for a payment processing API with edge cases"
        }
    ],
    extra_body={"think": True}
)

print(response.choices[0].message.content)

Skip thinking mode for simple requests to reduce latency.

Testing Gemma 4 API responses with Apidog

With Gemma 4 running locally, use Apidog to test endpoints efficiently.

Steps:

Import Ollama API spec: In Apidog, create a new project; set base URL to http://localhost:11434.
Define endpoints: Add:
- POST /api/generate (single-turn completions)
- POST /v1/chat/completions (multi-turn chat)
- GET /api/tags (list models)
Set up Test Scenario: Chain requests with assertions:
- Step 1: GET /api/tags—assert gemma4 is listed.
- Step 2: POST /api/generate—assert response field is non-empty.
- Step 3: POST /v1/chat/completions—assert reply format.
- Use Apidog's Extract Variable processor to pass responses between steps for multi-turn flow testing.
Validate schemas: Apidog Contract Testing validates API responses against your OpenAPI spec. Define expected response shapes and run contract tests after model updates.
Parallel development with Smart Mock: Apidog's Smart Mock generates schema-compliant responses from your API spec, letting frontend teams work without waiting for the local model.

Multimodal input with Gemma 4

E2B and E4B models accept images alongside text. Send images as base64-encoded strings:

import base64

with open("api_diagram.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gemma4:e4b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_data}"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe the API flow shown in this diagram and identify potential error paths"
                }
            ]
        }
    ]
)

Use this to analyze diagrams, screenshots, or extract info from images needed for your API.

Common issues and fixes

Model not found: Run ollama pull gemma4 or verify with ollama list.
Slow inference on CPU: Use gemma4:e2b for better performance.
Out of memory: Check VRAM/unified memory with ollama ps. Use smaller models if needed.
Apple Silicon issues: Update Ollama (0.20.0+ adds MLX support).
Port in use: Run OLLAMA_HOST=0.0.0.0:11435 ollama serve to use a different port.
Cut-off responses: Increase context window: add "options": {"num_ctx": 8192} to your request body.

Gemma 4 vs other local models

Model	Best size for most users	Context	Function calling	Coding benchmark
Gemma 4	e4b (9.6 GB)	128K-256K	Native	80% LiveCodeBench
Llama 3.3	70B-Q4 (40 GB)	128K	Native	~60% LiveCodeBench
Qwen3.6-Plus	72B-Q4 (44 GB)	128K	Native	Strong
Mistral Small	24B (14 GB)	128K	Native	Moderate

Gemma 4's MoE 26B (18 GB) delivers near-flagship quality with better tokens/sec than larger dense models.

For coding, 31B is competitive with larger models.
For laptops/edge, e2b runs under 8 GB.

Conclusion

Gemma 4 with Ollama is a powerful local AI setup. Installation is fast, the default model fits most developer machines, and the improvements over Gemma 3 are substantial.

Start with:

ollama run gemma4

Test the API using Apidog to validate endpoints, then select the right model variant for your needs.

For API-driven development, combining local inference with Apidog's Smart Mock and Test Scenarios delivers a complete, cloud-free workflow.

FAQ

How do I update Gemma 4 in Ollama when a new version comes out?

Run ollama pull gemma4 to fetch the latest version.

Can I run Gemma 4 on a machine without a GPU?

Yes, but it's slow (1–3 tokens/sec). e2b is best for CPU-only.

What's the difference between gemma4:e2b and gemma4:e4b?

Both are dense models. E4B has more parameters and better reasoning; E2B is smaller and supports audio input. For text, e4b is the better default.

Does Gemma 4 work with LangChain and LlamaIndex?

Yes. Point the provider to http://localhost:11434 and use gemma4 as the model name.

Is the local Gemma 4 API compatible with OpenAI code?

Mostly yes. Ollama's /v1/chat/completions endpoint matches the OpenAI format. Set base_url to http://localhost:11434/v1 and use any api_key.

How do I use Gemma 4's thinking mode?

Add "think": true in the extra_body (OpenAI SDK) or top-level JSON in direct API calls.

Can I serve Gemma 4 to other machines on my network?

Yes. Start Ollama with OLLAMA_HOST=0.0.0.0:11434 ollama serve and use your IP address.

What's the best Gemma 4 model for API development?

For mock data and tests, e4b balances speed and quality. For complex analysis, 26b MoE offers better results at lower resource cost.

DEV Community