brian austin

Posted on Apr 3

How to run Gemma 4 26B locally in April 2026: Ollama setup that actually works

#ai #ollama #claudecode #devtools

How to run Gemma 4 26B locally in April 2026: Ollama setup that actually works

The HN post went up last week. 248 points. Everyone's asking the same thing: how do I actually get Gemma 4 26B running locally without it being a mess?

Here's what works in April 2026.

What you need

Mac mini M2 Pro or better (16GB RAM minimum, 32GB recommended for 26B)
Or: Linux machine with 24GB+ VRAM
Ollama 0.5+ (the quantization support is critical)
~17GB free disk space for the Q4_K_M quantization

Install Ollama

curl -fsSL https://ollama.ai/install.sh | sh

Verify:

ollama --version
# ollama version 0.5.x

Pull Gemma 4 26B

The 26B model in Q4_K_M quantization hits the sweet spot of quality vs RAM usage:

ollama pull gemma4:26b

This is ~17GB. Takes 10-15 minutes on fast internet. Go get coffee.

For 4-bit quantized (runs on 16GB RAM):

ollama pull gemma4:26b-instruct-q4_K_M

Run it

ollama run gemma4:26b

You'll see:

>>> Send a message (/? for help)

That's it. You're talking to a 26B parameter model locally.

The API

Ollama exposes a REST API on port 11434:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b",
  "prompt": "Write a Python function to parse JSON safely",
  "stream": false
}'

OpenAI-compatible endpoint:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:26b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Use it with Claude Code

Here's the thing: local models are great for privacy, but they're slow and the quality gap with Claude Sonnet is real for complex coding tasks.

For local-first with Claude as fallback:

# .env
ANTHROPIC_BASE_URL=https://simplylouie.com
ANTHROPIC_API_KEY=your-key-here

This routes Claude Code through a proxy that removes rate limits. SimplyLouie runs at $2/month — cheaper than running electricity for the GPU.

For tasks that need to stay local (proprietary code), use Ollama. For tasks where quality matters (production code, debugging), use Claude via proxy.

Performance reality check

On Mac mini M2 Pro (16GB RAM):

Gemma 4 26B Q4_K_M: ~8-12 tokens/second
First token latency: 2-4 seconds
Context: 8K tokens comfortably, 16K with slowdown

On Linux with RTX 3090 (24GB VRAM):

Gemma 4 26B full precision: ~15-20 tokens/second
First token latency: <1 second

The sweet spots for local Gemma 4

Good at:

Code explanation (reads your code, explains what it does)
Commit message generation
Simple refactoring (rename variables, extract functions)
README generation
Code review for obvious issues

Not great at:

Complex multi-file refactoring
Debugging subtle logic errors
Keeping context across a long session
Tasks requiring knowledge of recent libraries (2025+)

Monitor resource usage

# CPU/RAM usage
watch -n 1 'ps aux | grep ollama'

# On Mac, GPU usage
sudo powermetrics --samplers gpu_power -i500 -n1

# Temperature
osx-cpu-temp  # brew install osx-cpu-temp

The CLAUDE.md integration

If you're using both Ollama and Claude Code, document the split in your CLAUDE.md:

# AI Tool Configuration

## Local (Ollama/Gemma 4 26B)
Use for: sensitive code review, offline work, explain-this-code tasks
Command: `ollama run gemma4:26b`

## Claude Code (via proxy)
Use for: complex refactoring, debugging, new feature implementation
Set: ANTHROPIC_BASE_URL=https://simplylouie.com

This way any developer who opens the repo knows which tool to use for what.

April 2026 vs January 2026

The Ollama setup has gotten meaningfully better in the last 90 days:

Flash attention support is stable now
Quantization options are broader
The OpenAI-compatible API is much more reliable
GPU offloading for hybrid CPU/GPU machines works properly

If you tried this in Q4 2025 and hit issues, try again. It's a different experience.

What about Gemma 4 27B?

Google released both 26B and 27B variants. The 27B adds multimodal support (image input). If you don't need vision, stick with 26B — same quality for text tasks, smaller download.

For vision tasks:

ollama pull gemma4:27b

Then you can pass image URLs in the API:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:27b",
  "prompt": "What is in this image?",
  "images": ["base64_encoded_image_here"]
}'

Bottom line

Gemma 4 26B is the best open model running locally right now. Setup is 10 minutes. The quality is genuinely good for code tasks.

For the heavy lifting — the complex debugging sessions, the multi-file refactors — pair it with Claude via ANTHROPIC_BASE_URL. $2/month, no rate limits, same Claude Sonnet you'd use directly.

The local + proxy combination is the best developer AI setup in 2026.

DEV Community

How to run Gemma 4 26B locally in April 2026: Ollama setup that actually works

How to run Gemma 4 26B locally in April 2026: Ollama setup that actually works

What you need

Install Ollama

Pull Gemma 4 26B

Run it

The API

Use it with Claude Code

Performance reality check

The sweet spots for local Gemma 4

Monitor resource usage

The CLAUDE.md integration

April 2026 vs January 2026

What about Gemma 4 27B?

Bottom line

Top comments (0)