Run Claude Code on Apple Silicon for free (Ollama + MLX local setup)
If you're on an M1/M2/M3 Mac, you're sitting on a GPU that can run serious AI models locally. With Ollama and MLX, you can run Claude-compatible models without paying for cloud APIs.
Here's the exact setup I use.
Why local models + Claude Code?
Claude Code supports ANTHROPIC_BASE_URL — which means you can point it at any OpenAI-compatible endpoint. Local Ollama + MLX gives you:
- Zero per-token cost for exploratory coding
- Complete privacy (nothing leaves your machine)
- No rate limits
- Works offline
For production-quality work where you need the real Claude 3.5 Sonnet, I switch to a flat-rate proxy at simplylouie.com/developers — $2/month, no per-token billing.
Step 1: Install Ollama with MLX backend
# Install Ollama
brew install ollama
# Start the server
ollama serve
In a new terminal:
# Pull a coding-optimized model
ollama pull qwen2.5-coder:7b
# Test it
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5-coder:7b",
"prompt": "Write a Python function to parse JSON"
}'
Step 2: Enable the OpenAI-compatible endpoint
Ollama has an OpenAI-compatible API at /v1/ — this is what makes Claude Code work:
# Test the OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder:7b",
"messages": [{"role": "user", "content": "Hello"}]
}'
Step 3: Point Claude Code at local Ollama
Create or edit ~/.claude/settings.json:
{
"env": {
"ANTHROPIC_BASE_URL": "http://localhost:11434/v1",
"ANTHROPIC_API_KEY": "ollama"
},
"model": "qwen2.5-coder:7b"
}
Now launch Claude Code — it'll hit your local Ollama instead of Anthropic's servers:
claude
Step 4: MLX acceleration for Apple Silicon
For M1/M2/M3 Macs, MLX gives you significant speed improvements:
# Install MLX LM
pip install mlx-lm
# Run a model with MLX
mlx_lm.server \
--model mlx-community/Qwen2.5-Coder-7B-Instruct-4bit \
--port 8080
Then update your Claude Code settings to point at the MLX server:
{
"env": {
"ANTHROPIC_BASE_URL": "http://localhost:8080/v1",
"ANTHROPIC_API_KEY": "local"
}
}
Benchmark: local vs cloud
On an M2 Pro (32GB):
| Model | Tokens/sec | Quality |
|---|---|---|
| Qwen2.5-Coder 7B (MLX) | ~45 tok/s | Good for boilerplate |
| Qwen2.5-Coder 14B (MLX) | ~22 tok/s | Solid for most tasks |
| Claude 3.5 Sonnet (cloud) | ~80 tok/s | Best for complex reasoning |
The hybrid workflow I actually use
Local Ollama for:
- Autocomplete and quick fixes
- Exploratory refactoring
- Documentation generation
- Private/sensitive code
Real Claude (via flat-rate API) for:
- Architecture decisions
- Complex debugging
- Code review
- Anything I need to ship
For the cloud piece, I use simplylouie.com/developers — it's $2/month flat rate (vs $20/month for Claude Pro or per-token billing on the official API). Works identically with ANTHROPIC_BASE_URL.
CLAUDE.md for local model optimization
Local models are more verbose than Claude — they love over-explaining. Add this to your CLAUDE.md:
## Response style
- Code only, minimal explanation
- No preamble ("Certainly!" "Of course!" = instant fail)
- No summary after code blocks
- If I asked for a function, give me the function
This alone cuts output by ~40% and makes local models actually usable.
Switching between local and cloud
I use a simple shell alias:
# ~/.zshrc
alias claude-local='ANTHROPIC_BASE_URL=http://localhost:11434/v1 ANTHROPIC_API_KEY=ollama claude'
alias claude-cloud='ANTHROPIC_BASE_URL=https://api.simplylouie.com ANTHROPIC_API_KEY=your_key claude'
Local for exploration, cloud for shipping.
Models worth trying
# Best for coding (7B, fast on M1)
ollama pull qwen2.5-coder:7b
# Better quality (14B, needs 16GB+ RAM)
ollama pull qwen2.5-coder:14b
# Deepseek Coder alternative
ollama pull deepseek-coder-v2:16b
# Code Llama (Meta's dedicated coding model)
ollama pull codellama:13b
Conclusion
Apple Silicon makes local AI genuinely useful for the first time. The Ollama + MLX combo on M-series Macs is fast enough for real work.
But local models aren't a full replacement for Claude 3.5 Sonnet — they lack the reasoning depth for hard problems. The hybrid approach (local for speed/privacy + real Claude for quality) is what I actually ship with.
If you want Claude at flat-rate pricing without per-token anxiety: simplylouie.com/developers
What local models are you running on Apple Silicon? Drop your setup in the comments.
Top comments (0)