DEV Community

brian austin
brian austin

Posted on

How to run Gemma 4 26B locally in April 2026: Ollama setup that actually works

How to run Gemma 4 26B locally in April 2026: Ollama setup that actually works

The HN post went up last week. 248 points. Everyone's asking the same thing: how do I actually get Gemma 4 26B running locally without it being a mess?

Here's what works in April 2026.

What you need

  • Mac mini M2 Pro or better (16GB RAM minimum, 32GB recommended for 26B)
  • Or: Linux machine with 24GB+ VRAM
  • Ollama 0.5+ (the quantization support is critical)
  • ~17GB free disk space for the Q4_K_M quantization

Install Ollama

curl -fsSL https://ollama.ai/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Verify:

ollama --version
# ollama version 0.5.x
Enter fullscreen mode Exit fullscreen mode

Pull Gemma 4 26B

The 26B model in Q4_K_M quantization hits the sweet spot of quality vs RAM usage:

ollama pull gemma4:26b
Enter fullscreen mode Exit fullscreen mode

This is ~17GB. Takes 10-15 minutes on fast internet. Go get coffee.

For 4-bit quantized (runs on 16GB RAM):

ollama pull gemma4:26b-instruct-q4_K_M
Enter fullscreen mode Exit fullscreen mode

Run it

ollama run gemma4:26b
Enter fullscreen mode Exit fullscreen mode

You'll see:

>>> Send a message (/? for help)
Enter fullscreen mode Exit fullscreen mode

That's it. You're talking to a 26B parameter model locally.

The API

Ollama exposes a REST API on port 11434:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b",
  "prompt": "Write a Python function to parse JSON safely",
  "stream": false
}'
Enter fullscreen mode Exit fullscreen mode

OpenAI-compatible endpoint:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:26b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
Enter fullscreen mode Exit fullscreen mode

Use it with Claude Code

Here's the thing: local models are great for privacy, but they're slow and the quality gap with Claude Sonnet is real for complex coding tasks.

For local-first with Claude as fallback:

# .env
ANTHROPIC_BASE_URL=https://simplylouie.com
ANTHROPIC_API_KEY=your-key-here
Enter fullscreen mode Exit fullscreen mode

This routes Claude Code through a proxy that removes rate limits. SimplyLouie runs at $2/month — cheaper than running electricity for the GPU.

For tasks that need to stay local (proprietary code), use Ollama. For tasks where quality matters (production code, debugging), use Claude via proxy.

Performance reality check

On Mac mini M2 Pro (16GB RAM):

  • Gemma 4 26B Q4_K_M: ~8-12 tokens/second
  • First token latency: 2-4 seconds
  • Context: 8K tokens comfortably, 16K with slowdown

On Linux with RTX 3090 (24GB VRAM):

  • Gemma 4 26B full precision: ~15-20 tokens/second
  • First token latency: <1 second

The sweet spots for local Gemma 4

Good at:

  • Code explanation (reads your code, explains what it does)
  • Commit message generation
  • Simple refactoring (rename variables, extract functions)
  • README generation
  • Code review for obvious issues

Not great at:

  • Complex multi-file refactoring
  • Debugging subtle logic errors
  • Keeping context across a long session
  • Tasks requiring knowledge of recent libraries (2025+)

Monitor resource usage

# CPU/RAM usage
watch -n 1 'ps aux | grep ollama'

# On Mac, GPU usage
sudo powermetrics --samplers gpu_power -i500 -n1

# Temperature
osx-cpu-temp  # brew install osx-cpu-temp
Enter fullscreen mode Exit fullscreen mode

The CLAUDE.md integration

If you're using both Ollama and Claude Code, document the split in your CLAUDE.md:

# AI Tool Configuration

## Local (Ollama/Gemma 4 26B)
Use for: sensitive code review, offline work, explain-this-code tasks
Command: `ollama run gemma4:26b`

## Claude Code (via proxy)
Use for: complex refactoring, debugging, new feature implementation
Set: ANTHROPIC_BASE_URL=https://simplylouie.com
Enter fullscreen mode Exit fullscreen mode

This way any developer who opens the repo knows which tool to use for what.

April 2026 vs January 2026

The Ollama setup has gotten meaningfully better in the last 90 days:

  • Flash attention support is stable now
  • Quantization options are broader
  • The OpenAI-compatible API is much more reliable
  • GPU offloading for hybrid CPU/GPU machines works properly

If you tried this in Q4 2025 and hit issues, try again. It's a different experience.

What about Gemma 4 27B?

Google released both 26B and 27B variants. The 27B adds multimodal support (image input). If you don't need vision, stick with 26B — same quality for text tasks, smaller download.

For vision tasks:

ollama pull gemma4:27b
Enter fullscreen mode Exit fullscreen mode

Then you can pass image URLs in the API:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:27b",
  "prompt": "What is in this image?",
  "images": ["base64_encoded_image_here"]
}'
Enter fullscreen mode Exit fullscreen mode

Bottom line

Gemma 4 26B is the best open model running locally right now. Setup is 10 minutes. The quality is genuinely good for code tasks.

For the heavy lifting — the complex debugging sessions, the multi-file refactors — pair it with Claude via ANTHROPIC_BASE_URL. $2/month, no rate limits, same Claude Sonnet you'd use directly.

The local + proxy combination is the best developer AI setup in 2026.

Top comments (0)