Gemma 4 Local Setup Guide 2026 — Run Google's Best Open Model with Ollama + Open WebUI
Google DeepMind released Gemma 4 on April 2, 2026. Within 48 hours, the models had crossed 207,000 pulls on Ollama, hit the front page of Hacker News, and Ollama shipped v0.20.0 with same-day support for all four model variants (source).
The hype is justified. Gemma 4 is built from the same research behind Gemini 3, released under a fully permissive Apache 2.0 license, and the 31B instruction-tuned model ranks #3 on Arena AI's text leaderboard at 1452 Elo — outperforming models twenty times its size (source).
But what actually makes Gemma 4 different from yet another open model release is the range. Four model sizes span from a 2B-parameter edge model that runs on a Raspberry Pi to a 31B dense model that competes with frontier APIs. Every size handles text and images natively. The smaller models even process audio. And the Apache 2.0 license means you can use them commercially without restrictions.
This guide covers everything: choosing the right model size for your hardware, setting up Ollama, adding a browser-based chat interface with Open WebUI, and deploying larger models on a Hetzner GPU server when your laptop is not enough. No fluff, no placeholder commands — every step has been verified.
What Is Gemma 4 and Why It Matters
Gemma 4 is Google DeepMind's fourth-generation family of open-weight language models. "Open-weight" means you get the full model weights to run locally — not just API access. The models are derived from the same research and training pipeline as Gemini 3, Google's flagship commercial model (source).
Key Features
- Apache 2.0 license. Full commercial use. No usage restrictions, no registration required, no "open but not really" clauses.
- Multimodal by default. All four model sizes handle text and image input. The E2B and E4B models also support audio input.
- Up to 256K context window. The 26B and 31B models support 256K tokens. The E2B and E4B models support 128K tokens.
- Native function calling. Built-in tool use support for agentic workflows.
- Configurable thinking modes. Control whether the model shows its reasoning chain or responds directly.
- 140+ language support. Broad multilingual fluency across all model sizes.
The Four Model Sizes
Gemma 4 ships in four variants, each targeting different hardware and use cases:
| Model | Parameters | Active Parameters | Architecture | Context Window | Download Size (Ollama) |
|---|---|---|---|---|---|
| E2B | ~2.3B effective | 2B | Dense (edge-optimized) | 128K | ~7.2 GB |
| E4B | ~4.5B effective | 4B | Dense (edge-optimized) | 128K | ~9.6 GB |
| 26B A4B | 26B total | 3.8B active | Mixture of Experts (128 experts) | 256K | ~18 GB |
| 31B | 31B | 31B (all active) | Dense | 256K | ~20 GB |
The "E" in E2B and E4B stands for "effective" — these models are optimized to activate only their effective parameter count during inference, preserving RAM and battery life on edge devices. The 26B model uses a Mixture of Experts architecture where only 3.8 billion parameters activate per token, making inference speed comparable to a 4B model while quality approaches a much larger one.
Hardware Requirements — What You Actually Need
This is the section most guides get wrong. They list theoretical minimums without telling you what the experience is actually like. Here is what we found:
Minimum and Recommended Hardware
| Model | Minimum RAM/VRAM | Recommended | CPU-Only Viable? | Speed Expectation |
|---|---|---|---|---|
| E2B | 4 GB RAM | 8 GB RAM | Yes | 5-15 tok/s on CPU, 30+ on GPU |
| E4B | 6 GB VRAM / 8 GB RAM | 10 GB VRAM / 16 GB RAM | Usable but slow | 3-10 tok/s on CPU, 25+ on GPU |
| 26B A4B | 8 GB VRAM / 16 GB RAM | 12+ GB VRAM / 24 GB RAM | Very slow | 1-3 tok/s on CPU, 15-25 on GPU |
| 31B | 20 GB VRAM / 32 GB RAM | 24+ GB VRAM / 48 GB RAM | Not practical | <1 tok/s on CPU, 10-20 on GPU |
Hardware Recommendations by Device
MacBook Air M1/M2 (8 GB unified memory)
Run E2B. It fits comfortably and gives usable speeds. E4B will load but may swap to disk during long conversations.
MacBook Pro M2/M3/M4 (16-36 GB unified memory)
E4B is the sweet spot. The 26B MoE model also works well on 24+ GB configurations thanks to its low active parameter count.
Desktop with NVIDIA GPU (8-12 GB VRAM)
E4B with full GPU offload. The 26B MoE model fits on 12 GB cards like the RTX 4070.
Desktop with NVIDIA GPU (24 GB VRAM — RTX 3090/4090)
Run the full 31B dense model. This is where Gemma 4 truly shines.
Linux server / VPS (CPU-only)
E2B for real-time chat. E4B for batch processing where speed is less critical. Anything larger is impractical without a GPU.
Hetzner GPU server
The 26B and 31B models run well on Hetzner's dedicated GPU servers with RTX 4000 Ada (20 GB VRAM). See the GPU deployment section below, and our full Hetzner GPU setup guide for detailed pricing and configuration.
Local Setup with Ollama — Step by Step
If you have not used Ollama before, it is a tool that downloads, manages, and runs language models locally with a single command. Think of it as Docker for LLMs. If you want the full setup walkthrough including Docker Compose and VPS deployment, see our Ollama + Open WebUI Self-Hosting Guide.
Step 1: Install Ollama
macOS:
brew install ollama
Or download the installer from ollama.com/download.
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download the installer from ollama.com/download. Ollama runs as a background service on Windows.
Verify the installation:
ollama --version
# Should show v0.20.0 or later for Gemma 4 support
Step 2: Pull and Run Your First Gemma 4 Model
The default gemma4 tag points to the E4B model. Start here unless you know you need a different size:
# Pull the E4B model (default, ~9.6 GB download)
ollama pull gemma4
# Run it
ollama run gemma4
You will see a chat prompt. Type a question and hit Enter. The model runs entirely on your machine — no API key, no internet connection needed after the initial download.
Step 3: Choose Your Model Size
Each model variant has its own tag:
# E2B — smallest, runs on almost anything
ollama pull gemma4:e2b
ollama run gemma4:e2b
# E4B — default, best balance of quality and speed
ollama pull gemma4:e4b
ollama run gemma4:e4b
# 26B MoE — sleeper pick, quality near 13B at 4B speed
ollama pull gemma4:26b
ollama run gemma4:26b
# 31B Dense — best quality, needs 24GB+ VRAM
ollama pull gemma4:31b
ollama run gemma4:31b
Step 4: Test with Different Tasks
Basic conversation:
>>> What are the main differences between REST and GraphQL?
Image analysis (multimodal):
>>> Describe this image: /path/to/screenshot.png
Code generation:
>>> Write a Python function that implements binary search on a sorted list. Include type hints and docstring.
Reasoning with thinking mode:
>>> /set parameter num_ctx 8192
>>> Think step by step: A farmer has 17 sheep. All but 9 die. How many are left?
Step 5: Configure Model Parameters
Ollama lets you tune generation parameters per session:
# Set context window (tokens)
/set parameter num_ctx 32768
# Set temperature (0.0 = deterministic, 1.0 = creative)
/set parameter temperature 0.7
# Set top_p (nucleus sampling)
/set parameter top_p 0.9
For long documents or RAG workflows, increase num_ctx. The E2B and E4B models support up to 128K tokens; the 26B and 31B support up to 256K. Keep in mind that larger context windows use more RAM.
Updating Gemma 4
When new versions or quantizations are released, update with:
ollama pull gemma4
Ollama checks for the latest version and downloads only what changed — similar to how Docker handles image layers.
Open WebUI Integration — Browser-Based Chat
Running Gemma 4 from the terminal works, but Open WebUI gives you a ChatGPT-style browser interface with conversation history, model switching, document upload, and multi-user support. If you have followed our Ollama + Open WebUI guide, you already have this running.
Quick Setup with Docker
Make sure Ollama is running first (it starts automatically on macOS after installation). Then:
docker run -d \
--name open-webui \
-p 3000:8080 \
-v open-webui:/app/backend/data \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
--restart always \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 in your browser. Create an admin account on first launch.
Using Gemma 4 in Open WebUI
- Click the model dropdown in the top-left of the chat interface.
- You will see all models pulled in Ollama listed —
gemma4:e4b,gemma4:26b, etc. - Select your preferred Gemma 4 variant and start chatting.
Multi-Model Comparison
One of Open WebUI's best features is side-by-side model comparison. Pull multiple Gemma 4 variants and compare them on the same prompt:
ollama pull gemma4:e4b
ollama pull gemma4:26b
In Open WebUI, enable the comparison view to see how the E4B and 26B respond to the same question. This is useful for deciding which model size fits your use case before committing to one.
Document Upload and RAG
Open WebUI supports uploading PDFs, text files, and other documents directly into the chat. Gemma 4's large context window (up to 256K on the 26B and 31B models) makes it effective for document Q&A without needing a separate RAG pipeline for shorter documents.
For production RAG setups, Open WebUI also integrates with external vector databases — but for most personal and small team use cases, the built-in document upload handles things well enough.
GPU Server Deployment on Hetzner
The E2B and E4B models run fine on consumer hardware. But if you want to run the 26B MoE or 31B Dense models with good performance — especially for team use or API serving — you need a GPU server.
Hetzner's dedicated GPU servers offer the best price-to-performance ratio for this. Their GEX44 plan with an NVIDIA RTX 4000 Ada (20 GB VRAM) starts at €184/month — roughly 75% cheaper than equivalent AWS GPU instances. See our complete Hetzner GPU setup guide for detailed pricing and server selection.
Server Setup
1. Provision a Hetzner dedicated GPU server.
Order a GEX44 or higher from Hetzner's Robot panel. Choose Ubuntu 24.04 as the OS.
2. Install NVIDIA drivers and Ollama.
# SSH into your server
ssh root@your-server-ip
# Install NVIDIA drivers (Ubuntu 24.04)
apt update && apt install -y nvidia-driver-560
# Reboot to load drivers
reboot
After reboot, verify the GPU is detected:
nvidia-smi
3. Install Ollama with GPU support.
curl -fsSL https://ollama.com/install.sh | sh
Ollama auto-detects NVIDIA GPUs. No additional configuration needed.
4. Pull and run the 31B model.
ollama pull gemma4:31b
ollama run gemma4:31b
With 20 GB VRAM on the RTX 4000 Ada, the 31B model fits with room for a reasonable context window. For the 26B MoE model, you get even more headroom since only 3.8B parameters activate at inference time.
Expose via Open WebUI
For team access, deploy Open WebUI on the same server:
docker run -d \
--name open-webui \
-p 3000:8080 \
--gpus all \
-v open-webui:/app/backend/data \
-e OLLAMA_BASE_URL=http://localhost:11434 \
--restart always \
ghcr.io/open-webui/open-webui:main
Set up a reverse proxy (Nginx or Caddy) with HTTPS, and your team has a private, self-hosted AI chat running Gemma 4 on dedicated GPU hardware. Our self-hosted dev stack guide covers Caddy reverse proxy setup if you need it.
Cost Comparison
| Setup | Monthly Cost | Model Size | Speed |
|---|---|---|---|
| MacBook Pro M3 (local) | $0 (already owned) | E4B | ~25 tok/s |
| Hetzner CX22 VPS (CPU-only) | ~€5/month | E2B | ~5-10 tok/s |
| Hetzner GEX44 GPU server | ~€184/month | 31B Dense | ~15-20 tok/s |
| AWS g5.xlarge (A10G) | ~$750/month | 31B Dense | ~15-20 tok/s |
For teams that need the 31B model running 24/7, Hetzner saves roughly €550/month compared to AWS for equivalent GPU performance.
Benchmarks — Gemma 4 vs. the Competition
Open model benchmarks shift fast. Here is where Gemma 4 stands as of April 2026, based on published results and leaderboard rankings.
Flagship Results (31B Dense)
| Benchmark | Gemma 4 31B | Llama 4 Scout | Qwen 3.5 27B | DeepSeek-V3.2 |
|---|---|---|---|---|
| Arena Elo (text) | 1452 (#3) | ~1420 | ~1430 | ~1460 |
| MMLU Pro | 85.2% | 83.1% | 84.8% | 87.5% |
| AIME 2026 (math) | 89.2% | 82.5% | 86.1% | 92.3% |
| GPQA Diamond (science) | 84.3% | 79.8% | 82.1% | 86.7% |
| LiveCodeBench v6 | 80.0% | 75.3% | 78.4% | 83.2% |
| Codeforces Elo | 2150 | 1980 | 2050 | 2280 |
Sources: Arena AI leaderboard, ai.rs comparison, Lushbinary benchmarks
What the Benchmarks Tell You
Gemma 4 wins on efficiency. At 31B parameters, it outperforms Llama 4 Scout (which uses a much larger MoE architecture) and matches or beats Qwen 3.5 27B on most tasks. For its size, it is the strongest open model available.
DeepSeek still leads on raw reasoning. If you need the absolute best performance on complex math, competitive coding, and chain-of-thought reasoning, DeepSeek-V3.2 remains ahead. But it is also significantly larger and more expensive to run.
The 26B MoE is the real story. With only 3.8B active parameters, the 26B model delivers quality close to the 31B dense model at a fraction of the compute cost. This is the model that most developers should try first on capable hardware.
E2B and E4B have no real competition at their size. The E2B with native multimodal support (text + image + audio) and 128K context in a 2B-parameter model has no equivalent in the Llama 4 or Qwen 3.5 families. These are genuinely new capabilities at this size tier.
An Honest Assessment
Gemma 4 is not the best open model at everything. It trails Chinese competitors (DeepSeek, Qwen) on deep reasoning benchmarks. Its 31B flagship is competitive but not dominant. Where Gemma 4 genuinely excels is the breadth of its lineup — four sizes covering everything from edge devices to workstations — and the multimodal capabilities baked into every variant.
For most local AI use cases (chat, code completion, document Q&A, simple agentic tasks), Gemma 4 is more than capable. For research-grade reasoning or competitive coding, you may want to pair it with DeepSeek or Qwen for those specific tasks.
Use Cases — Where Gemma 4 Fits in Your Workflow
Coding Assistant
Use Gemma 4 as a local code completion and generation model in your IDE. The 31B model handles multi-file refactoring and architectural decisions well. The E4B handles function-level generation and explanations. Pair it with tools like Continue.dev or your IDE's Ollama integration.
If you are building a free AI coding stack, Gemma 4 via Ollama is one of the strongest local model options available — it costs nothing to run and works offline.
Private Chat Interface
Run Open WebUI with Gemma 4 for a completely private ChatGPT alternative. No data leaves your machine. This is especially valuable for conversations involving proprietary code, confidential business information, or personal data.
RAG and Document Q&A
Gemma 4's 256K context window (on 26B and 31B) means you can feed in entire documents without chunking for many use cases. For larger document sets, pair it with a vector database through Open WebUI's RAG integration.
Embeddings
While Gemma 4 is primarily a generative model, the E2B variant works as a lightweight embedding model for local search and similarity applications. For dedicated embedding tasks, you may still prefer specialized models like nomic-embed-text, but Gemma 4 can handle both generation and basic embedding in a single model.
Edge and Mobile Deployment
The E2B model is explicitly designed for on-device deployment. Google has announced Gemma 4 support in Android AICore for on-device inference (source), and NVIDIA has published acceleration guides for running Gemma 4 on RTX hardware (source).
Agentic Workflows
Gemma 4's native function calling support makes it suitable for agentic workflows where the model needs to invoke tools, query databases, or call APIs. The 26B and 31B models handle multi-step reasoning chains reliably.
Troubleshooting Common Issues
Model fails to load or crashes
Cause: Not enough RAM or VRAM. Ollama will try to load the model and fail silently or crash.
Fix: Drop to a smaller model variant. If E4B crashes on 8 GB RAM, use E2B instead.
ollama run gemma4:e2b
Very slow generation (< 1 tok/s)
Cause: Model is running on CPU when it should be on GPU, or the model is too large for available memory and is swapping.
Fix: Check if Ollama detects your GPU:
ollama ps
If GPU is not listed, ensure NVIDIA drivers are installed (nvidia-smi should show your card). On macOS with Apple Silicon, GPU acceleration is automatic.
Open WebUI shows no models
Cause: Open WebUI cannot reach the Ollama server.
Fix: Ensure Ollama is running and check the OLLAMA_BASE_URL environment variable in your Docker run command. For Docker Desktop on Mac/Windows, use http://host.docker.internal:11434. For Linux, use http://localhost:11434 if running on the same machine, or use --network host.
Context window runs out mid-conversation
Cause: Default context window in Ollama is 2048 tokens. Gemma 4 supports much more but you need to set it.
Fix:
# In the Ollama chat, increase context
/set parameter num_ctx 32768
# Or when starting the model
ollama run gemma4 --num-ctx 32768
Higher context uses more memory. Scale according to your available RAM/VRAM.
Quick Reference Card
| Task | Command |
|---|---|
| Install Ollama (macOS) | brew install ollama |
| Install Ollama (Linux) | `curl -fsSL https://ollama.com/install.sh \ |
| Pull default model (E4B) | {% raw %}ollama pull gemma4
|
| Pull specific variant |
ollama pull gemma4:e2b / gemma4:26b / gemma4:31b
|
| Run model | ollama run gemma4 |
| List downloaded models | ollama list |
| Check running models | ollama ps |
| Update model |
ollama pull gemma4 (re-pull) |
| Set context window | /set parameter num_ctx 32768 |
| Run Open WebUI |
docker run -d -p 3000:8080 ... (see above) |
What to Read Next
- Ollama + Open WebUI Self-Hosting Guide — Full Docker Compose setup, VPS deployment, multi-user configuration, and model management.
- Hetzner Cloud GPU Server Guide — Detailed GPU server pricing, setup walkthrough, and AWS/GCP cost comparison for AI workloads.
- Self-Host Your Dev Stack Under $20/Month — The complete budget infrastructure stack including reverse proxy, CI/CD, and monitoring.
- Free AI Coding Tools: $0/Month Stack — How Gemma 4 fits into a zero-cost AI coding workflow alongside Gemini Code Assist and GitHub Copilot Free.
Top comments (0)