DEV Community

brian austin
brian austin

Posted on

Run Claude Code on Apple Silicon for free (Ollama + MLX local setup)

Run Claude Code on Apple Silicon for free (Ollama + MLX local setup)

If you're on an M1/M2/M3 Mac, you're sitting on a GPU that can run serious AI models locally. With Ollama and MLX, you can run Claude-compatible models without paying for cloud APIs.

Here's the exact setup I use.

Why local models + Claude Code?

Claude Code supports ANTHROPIC_BASE_URL — which means you can point it at any OpenAI-compatible endpoint. Local Ollama + MLX gives you:

  • Zero per-token cost for exploratory coding
  • Complete privacy (nothing leaves your machine)
  • No rate limits
  • Works offline

For production-quality work where you need the real Claude 3.5 Sonnet, I switch to a flat-rate proxy at simplylouie.com/developers — $2/month, no per-token billing.

Step 1: Install Ollama with MLX backend

# Install Ollama
brew install ollama

# Start the server
ollama serve
Enter fullscreen mode Exit fullscreen mode

In a new terminal:

# Pull a coding-optimized model
ollama pull qwen2.5-coder:7b

# Test it
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:7b",
  "prompt": "Write a Python function to parse JSON"
}'
Enter fullscreen mode Exit fullscreen mode

Step 2: Enable the OpenAI-compatible endpoint

Ollama has an OpenAI-compatible API at /v1/ — this is what makes Claude Code work:

# Test the OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder:7b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
Enter fullscreen mode Exit fullscreen mode

Step 3: Point Claude Code at local Ollama

Create or edit ~/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434/v1",
    "ANTHROPIC_API_KEY": "ollama"
  },
  "model": "qwen2.5-coder:7b"
}
Enter fullscreen mode Exit fullscreen mode

Now launch Claude Code — it'll hit your local Ollama instead of Anthropic's servers:

claude
Enter fullscreen mode Exit fullscreen mode

Step 4: MLX acceleration for Apple Silicon

For M1/M2/M3 Macs, MLX gives you significant speed improvements:

# Install MLX LM
pip install mlx-lm

# Run a model with MLX
mlx_lm.server \
  --model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit \
  --port 8080
Enter fullscreen mode Exit fullscreen mode

Then update your Claude Code settings to point at the MLX server:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:8080/v1",
    "ANTHROPIC_API_KEY": "local"
  }
}
Enter fullscreen mode Exit fullscreen mode

Benchmark: local vs cloud

On an M2 Pro (32GB):

Model Tokens/sec Quality
Qwen2.5-Coder 7B (MLX) ~45 tok/s Good for boilerplate
Qwen2.5-Coder 14B (MLX) ~22 tok/s Solid for most tasks
Claude 3.5 Sonnet (cloud) ~80 tok/s Best for complex reasoning

The hybrid workflow I actually use

Local Ollama for:

  • Autocomplete and quick fixes
  • Exploratory refactoring
  • Documentation generation
  • Private/sensitive code

Real Claude (via flat-rate API) for:

  • Architecture decisions
  • Complex debugging
  • Code review
  • Anything I need to ship

For the cloud piece, I use simplylouie.com/developers — it's $2/month flat rate (vs $20/month for Claude Pro or per-token billing on the official API). Works identically with ANTHROPIC_BASE_URL.

CLAUDE.md for local model optimization

Local models are more verbose than Claude — they love over-explaining. Add this to your CLAUDE.md:

## Response style
- Code only, minimal explanation
- No preamble ("Certainly!" "Of course!" = instant fail)
- No summary after code blocks
- If I asked for a function, give me the function
Enter fullscreen mode Exit fullscreen mode

This alone cuts output by ~40% and makes local models actually usable.

Switching between local and cloud

I use a simple shell alias:

# ~/.zshrc
alias claude-local='ANTHROPIC_BASE_URL=http://localhost:11434/v1 ANTHROPIC_API_KEY=ollama claude'
alias claude-cloud='ANTHROPIC_BASE_URL=https://api.simplylouie.com ANTHROPIC_API_KEY=your_key claude'
Enter fullscreen mode Exit fullscreen mode

Local for exploration, cloud for shipping.

Models worth trying

# Best for coding (7B, fast on M1)
ollama pull qwen2.5-coder:7b

# Better quality (14B, needs 16GB+ RAM)
ollama pull qwen2.5-coder:14b

# Deepseek Coder alternative
ollama pull deepseek-coder-v2:16b

# Code Llama (Meta's dedicated coding model)
ollama pull codellama:13b
Enter fullscreen mode Exit fullscreen mode

Conclusion

Apple Silicon makes local AI genuinely useful for the first time. The Ollama + MLX combo on M-series Macs is fast enough for real work.

But local models aren't a full replacement for Claude 3.5 Sonnet — they lack the reasoning depth for hard problems. The hybrid approach (local for speed/privacy + real Claude for quality) is what I actually ship with.

If you want Claude at flat-rate pricing without per-token anxiety: simplylouie.com/developers


What local models are you running on Apple Silicon? Drop your setup in the comments.

Top comments (0)