DEV Community

Cover image for Run Claude Code for 99% Less With Ollama and OpenRouter
Max Quimby
Max Quimby

Posted on • Originally published at computeleap.com

Run Claude Code for 99% Less With Ollama and OpenRouter

Claude Code cost reduction guide — terminal interface showing local model setup vs cloud API pricing

At 12 PM Pacific today, Anthropic flipped the switch. Claude Max subscriptions — the $100/month and $200/month plans that gave you unlimited Opus 4.6 — no longer work with third-party tools like OpenClaw, Cline, or any harness outside Anthropic's own apps. If you were running Claude Code through a third-party client on your Max subscription, it stopped working this afternoon.

The announcement came from Boris Cherny, Claude Code's creator, and was confirmed across multiple channels. The reaction was immediate: two separate tutorial videos dropped within hours, the "free Claude Code" community mobilized, and Hugging Face's CEO started posting CLI commands to run open-source models as direct replacements.

⚠️ The April 4 cutoff is real. Starting today at 12 PM PT, Claude Max subscriptions no longer cover usage on third-party tools. You need an API key from here on — which means per-token billing instead of a flat monthly fee. For heavy users, this could mean going from $100/month to $500-2,000/month overnight.

But here's the thing Claude Code's harness doesn't care which model powers it. The agent framework — the file reading, code writing, git integration, terminal execution — is separate from the language model underneath. Swap the model, keep the workflow. That's exactly what we're going to do.

This guide covers two approaches: Ollama (completely free, runs on your machine) and OpenRouter (pennies per request, cloud-hosted). Both work today. Both are tested. And both will save you 90-99% compared to API pricing.

What Actually Changed (And Why It Matters)

Let's be precise about what happened. Anthropic didn't shut down Claude Code. They didn't change the API. What they did was decouple the Max subscription from third-party tool access.

Previously, your $100/month Max plan gave you unlimited Claude Opus 4.6 usage — and that included any tool that could authenticate through your Anthropic account. Power users on OpenClaw were getting hundreds of dollars worth of API calls for a flat fee. From Anthropic's perspective, these users were "freeloading at scale," as one analyst put it.

Now, third-party tools require an API key with per-token billing:

  • Claude Opus 4.6: $15 per million input tokens, $75 per million output tokens
  • Claude Sonnet 4.5: $3 per million input tokens, $15 per million output tokens

For a typical coding session — 50,000 input tokens and 10,000 output tokens — that's roughly $1.50 per session with Opus or $0.30 with Sonnet. Do 10 sessions a day and you're looking at $450/month with Opus. Heavy users report $1,000+ monthly bills on the API.

📊 The cost math that triggered the migration: Light users (5 sessions/day) go from $100/mo on Max to ~$225/mo on API Opus — or $0 with Ollama and ~$5/mo on OpenRouter. Heavy users (20+/day) face ~$900+/mo on API vs. still $0 locally. Power users who coded all day on Max? Looking at $2,000+/mo on the API.

Usage Level Max Plan (Before) API Opus (After) Ollama (Local) OpenRouter
Light (5 sessions/day) $100/mo ~$225/mo $0 ~$5/mo
Medium (10 sessions/day) $100/mo ~$450/mo $0 ~$10/mo
Heavy (20+ sessions/day) $200/mo ~$900+/mo $0 ~$25/mo
Power user (all day) $200/mo ~$2,000+/mo $0 ~$50/mo

Ollama costs = electricity only. OpenRouter costs assume using capable free-tier or low-cost models like Qwen3.5, Gemma 4, or DeepSeek.

The community response has been swift. Nate Herk published two tutorials the same day. Clément Delangue (Hugging Face CEO) posted literal CLI commands to run Gemma 4 locally as a Claude replacement. The "free Claude Code" tutorial is becoming its own genre.

Approach 1: Ollama — Free, Local, Unlimited

Ollama is an open-source tool that runs large language models on your own hardware. No API keys. No billing. No data leaving your machine. You download a model, point Claude Code at it, and you're coding.

Prerequisites

  • macOS, Linux, or Windows (with WSL2)
  • 16GB+ RAM (32GB recommended for larger models)
  • ~20GB free disk space per model
  • A reasonably modern CPU — Apple Silicon (M1+) or a recent AMD/Intel with AVX2

Step 1: Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows (via WSL2)
curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Start the Ollama server:

ollama serve
Enter fullscreen mode Exit fullscreen mode

This runs in the background and exposes a local API at http://localhost:11434.

Step 2: Pull a Coding Model

Not all models are equal for code generation. Here's what works well:

# Best overall coding model for local use (35B, needs 24GB+ RAM)
ollama pull qwen3.5:35b

# Great MoE option — only 4B active params, runs on 16GB (26B total)
ollama pull gemma4:26b

# Smaller but capable (needs 8GB+ RAM)
ollama pull qwen3.5:14b

# Budget option — runs on almost anything (needs 4GB+ RAM)
ollama pull qwen3.5:7b
Enter fullscreen mode Exit fullscreen mode

⚡ Which model should you pick? If you have 32GB+ RAM (like a MacBook Pro M2/M3/M4), go with qwen3.5:35b — it's the closest to Claude Sonnet quality for code. If you're on 16GB, gemma4:26b is excellent thanks to its MoE architecture (only 4B parameters are active at any time, so it runs fast despite the large model size). On 8GB, stick to qwen3.5:14b.

Step 3: Configure Claude Code to Use Ollama

Claude Code reads its model configuration from environment variables. Set these before launching:

# Point Claude Code at your local Ollama instance
export ANTHROPIC_BASE_URL="http://localhost:11434/v1"
export ANTHROPIC_API_KEY="ollama"  # Ollama doesn't need a real key
export CLAUDE_CODE_MODEL="qwen3.5:35b"  # Match the model you pulled

# Now launch Claude Code normally
claude
Enter fullscreen mode Exit fullscreen mode

To make this permanent, add those exports to your ~/.zshrc or ~/.bashrc:

echo 'export ANTHROPIC_BASE_URL="http://localhost:11434/v1"' >> ~/.zshrc
echo 'export ANTHROPIC_API_KEY="ollama"' >> ~/.zshrc
echo 'export CLAUDE_CODE_MODEL="qwen3.5:35b"' >> ~/.zshrc
source ~/.zshrc
Enter fullscreen mode Exit fullscreen mode

Step 4: Verify It Works

claude
Enter fullscreen mode Exit fullscreen mode

You should see Claude Code launch normally. Try a simple prompt:

Create a Python function that calculates the Fibonacci 
sequence using dynamic programming. Include type hints 
and docstring.
Enter fullscreen mode Exit fullscreen mode

If it generates code, reads files, and executes commands — you're running Claude Code for free.

Ollama Troubleshooting

Problem Solution
"Connection refused" Run ollama serve in a separate terminal
Slow generation Try a smaller model or check RAM usage with htop
Model crashes mid-generation You're out of RAM — switch to a smaller model
"Model not found" Run ollama list to see installed models; name must match exactly

Approach 2: OpenRouter — Cloud Models, Pennies Per Request

If your machine can't run local models (or you want frontier-quality output without the $15/MTok Opus price), OpenRouter is the play. It's a unified API that routes to 100+ models from different providers — many of them free or near-free.

Step 1: Get an OpenRouter API Key

  1. Go to openrouter.ai
  2. Create an account (free)
  3. Generate an API key from your dashboard
  4. Add credits — $5 will last weeks for most users

Step 2: Configure Claude Code

# Point Claude Code at OpenRouter
export ANTHROPIC_BASE_URL="https://openrouter.ai/api/v1"
export ANTHROPIC_API_KEY="sk-or-v1-your-key-here"

# Pick your model — here are the best options:
export CLAUDE_CODE_MODEL="qwen/qwen3.5-coder-next"    # Strong coder, ~$0.50/MTok
# export CLAUDE_CODE_MODEL="google/gemma-4-31b"        # Free tier available
# export CLAUDE_CODE_MODEL="deepseek/deepseek-v3.2"    # Great reasoning, ~$0.27/MTok
# export CLAUDE_CODE_MODEL="anthropic/claude-sonnet-4.5" # Full Claude, but cheaper than direct API

claude
Enter fullscreen mode Exit fullscreen mode

Step 3: Pick the Right Model for Your Task

OpenRouter's strength is model selection. Match the model to the work:

Task Recommended Model Cost/MTok (Input) Why
Quick edits, scripting qwen/qwen3.5:14b Free Fast, good enough for simple tasks
Feature development qwen/qwen3.5-coder-next ~$0.50 Optimized for code, strong reasoning
Complex architecture deepseek/deepseek-v3.2 ~$0.27 Excellent reasoning at low cost
Production-critical code anthropic/claude-sonnet-4.5 $3.00 When quality matters most
Budget unlimited google/gemma-4-31b Free tier Apache 2.0, solid all-around

💡 The hybrid strategy: Use a cheap model (Qwen 3.5 or Gemma 4) for routine coding, file exploration, and test writing. Switch to Sonnet 4.5 via OpenRouter only when you need frontier reasoning — complex refactors, subtle bugs, architecture decisions. This drops your average cost by 80-90% compared to running Opus for everything.

The Tradeoffs: What You Gain and What You Lose

Let's be honest about what you're giving up. This isn't a free lunch — it's a different lunch at a different price point.

What You Keep ✅

  • The Claude Code harness — file reading, code writing, git operations, shell commands, the entire agent workflow
  • Multi-file editing — Claude Code's ability to work across your whole project
  • CLAUDE.md and hooks — your project context and automation rules still work
  • Terminal UI — same interface, same commands, same muscle memory

What You Lose ❌

With Ollama (local models):

  • Raw intelligence drops. Qwen 3.5 35B is ~85% of Claude Sonnet on coding benchmarks. For complex multi-step reasoning, you'll notice the gap. The hidden cost of cheaper reasoning models is real — they make more subtle mistakes.
  • Context window shrinks. Most local models max out at 32K-128K tokens vs. Claude's 1M. For large codebases, this means Claude Code can't hold your entire project in context simultaneously.
  • Speed varies wildly. On an M4 Max, Qwen 3.5 35B runs at ~25 tok/s. On an older Intel MacBook, you might get 3-5 tok/s. Opus via API gives you ~80 tok/s consistently.
  • Your machine is busy. Running a 35B model uses 20-30GB of RAM and significant CPU/GPU. Don't expect to be running other heavy workloads simultaneously.

With OpenRouter:

  • Latency is higher. Requests route through OpenRouter's proxy, adding 100-500ms per request compared to direct API calls.
  • Free models have rate limits. The free tier on models like Gemma 4 restricts requests per minute. Heavy sessions will hit these.
  • Model availability isn't guaranteed. If a provider goes down, that model goes down with it. OpenRouter's routing helps, but it's not immune.

The Pro Setup: Switching Models on the Fly

Power users don't pick one approach. They set up aliases to switch between models depending on the task:

# Add to ~/.zshrc or ~/.bashrc

# Free local model — for exploration, simple tasks
alias claude-local='ANTHROPIC_BASE_URL="http://localhost:11434/v1" ANTHROPIC_API_KEY="ollama" CLAUDE_CODE_MODEL="qwen3.5:35b" claude'

# Cheap cloud model — for feature development
alias claude-cheap='ANTHROPIC_BASE_URL="https://openrouter.ai/api/v1" ANTHROPIC_API_KEY="sk-or-v1-YOUR-KEY" CLAUDE_CODE_MODEL="qwen/qwen3.5-coder-next" claude'

# Full Claude Sonnet — when quality matters
alias claude-sonnet='ANTHROPIC_BASE_URL="https://openrouter.ai/api/v1" ANTHROPIC_API_KEY="sk-or-v1-YOUR-KEY" CLAUDE_CODE_MODEL="anthropic/claude-sonnet-4.5" claude'

# Direct Anthropic API — when you need Opus
alias claude-opus='ANTHROPIC_API_KEY="sk-ant-YOUR-KEY" CLAUDE_CODE_MODEL="claude-opus-4-6" claude'
Enter fullscreen mode Exit fullscreen mode

Now you can type claude-local for free coding sessions, claude-cheap for daily work, and claude-opus only when you're tackling something that genuinely needs frontier intelligence.

# Exploring a new codebase? Free.
claude-local

# Building a feature? Pennies.
claude-cheap

# Debugging a race condition in your distributed system? Worth paying for.
claude-opus
Enter fullscreen mode Exit fullscreen mode

What the Community Is Building

The "free Claude Code" movement isn't just about cost savings — it's about resilience. When your workflow depends on a single provider's pricing decisions, you're one announcement away from a 10x cost increase. Today proved that.

This is a pattern we've seen before. Every time a closed provider tightens access, the open-source alternative gets a growth spike. The difference now is that open-source coding models are genuinely competitive — Gemma 4's 31B dense model ranked #3 on Arena AI's text leaderboard, and Qwen 3.5's coding variants are approaching Sonnet-level quality on SWE-bench.

📊 The open-source quality gap is closing fast: Qwen 3.6-Plus hits SWE-bench 78.8 (vs. Claude Opus 4.5's 80.9). Gemma 4 31B ranks #3 open model globally at ELO ~1452. DeepSeek V3.2 delivers strong reasoning at $0.27/MTok. Six months ago, the best open model scored ~65 on SWE-bench. The gap went from 25% to 3%.

Which Approach Should You Pick?

Pick Ollama if:

  • You have 16GB+ RAM (32GB ideal)
  • Privacy matters — your code never leaves your machine
  • You do mostly routine coding (CRUD, scripts, tests, frontend)
  • You want zero ongoing costs
  • You're comfortable with ~85% of Claude's quality for most tasks

Pick OpenRouter if:

  • Your machine can't run large models (8GB laptop, Chromebook)
  • You want access to multiple model providers through one API
  • You need near-frontier quality but can't justify Opus pricing
  • You want the flexibility to switch models per task
  • You're OK with $5-25/month instead of $0

Pick both if:

  • You're a power user who wants the alias-switching setup above
  • Use local models for exploration and simple tasks (free)
  • Route to cloud models for complex work (cheap)
  • Only pay full Anthropic API rates for genuinely hard problems (rare)

The Bigger Picture

Today's announcement is a business decision, not a technical one. Anthropic is profitable on API usage and losing money on Max subscribers who use third-party tools heavily. The subsidy had to end.

But the unintended consequence is acceleration. Every developer who sets up Ollama today is one more developer who knows how to run local models. Every OpenRouter account created this week is one more developer who understands model routing and cost optimization. The lock-in weakens with every migration guide that gets published.

Claude Code as a harness is still excellent — arguably the best agent framework available. But the model powering it? That's now a commodity. Compare the options, pick the right tool for each task, and don't pay $15/MTok for work that a $0 local model handles just fine.

The 99% cost reduction is real. The tradeoffs are real too. Now you know both sides.


Running Claude Code with alternative models and want to share your setup? We're collecting community configurations — reach out via our GitHub.

Top comments (0)