I've been running Claude Code locally with Ollama for about a week now, using Unsloth's guide as the setup reference. The setup works — but there are three things I learned the hard way that the guide only touches on, and I want to share them so you don't hit the same walls.
The KV Cache Issue Is Real and It's Brutal
Unsloth's guide mentions this, but I didn't understand the magnitude until I measured it myself. Claude Code prepends an attribution header to every request. With the header enabled, my local model's inference was 90% slower. Not a little slower — effectively unusable for any real task.
The fix is in ~/.claude/settings.json:
{
"env": {
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
}
}
But here's what the guide doesn't tell you: this setting only works when it's in the JSON file. Trying to set it via environment variable (export CLAUDE_CODE_ATTRIBUTION_HEADER=0) does nothing. I spent two hours debugging why my inference was still slow before I found this in a GitHub issue.
After disabling the header, my inference times went from ~45 seconds per task to ~5 seconds. The difference between "this is fine" and "I'm going to switch back to the API."
Context Length Is the Real Constraint
When you're running a local model, you're not just running a model — you're running a model with whatever context window your hardware can support without hallucinating or choking.
With Qwen3.5 (8B) on 24GB of unified memory, I get about 32K context before performance degrades noticeably. With Gemma 4 (9B), it's closer to 24K.
For most coding tasks, that's fine. But if you're working on a large codebase and want the model to understand the full context of a 50K line project, local models hit a ceiling that cloud APIs don't.
The practical implication: you need to think about what your "unit of work" is for a local model session. A single file, a single module, a single PR review. Not the entire repo.
The Ollama Config Matters More Than You Think
Here's my Ollama config (~/.ollama/config.json) that finally made things work properly:
{
"gpu_overrides": [
{
"num_gpu": "auto"
}
],
"model_options": {
"num_ctx": 32768
}
}
The num_ctx setting is what controls context window size. Default is often 2048, which is useless for code. Set it explicitly.
Also: num_gpu: "auto" is important if you're on a Mac with unified memory — it lets Ollama automatically use GPU acceleration. Without it, I was running on CPU only, which was 3-4x slower.
The Setup I Actually Use Now
After a week of trial and error, here's what works for me:
Model: Qwen3.5 (8B) via Ollama — good agentic and coding performance, runs on 24GB unified memory
Interface: Claude Code connected to Ollama via the attribution header fix
Context window: 32K (explicitly set in Ollama config)
Use case: single-file refactors, dependency updates, code review on specific modules, small feature implementations
For multi-file tasks that need broader repo context, I still use the API. For tasks that are scope-contained, local model works well and has zero per-token cost.
The Real Question: Is It Worth It?
For me, the answer is yes, but with a narrow use case. If you:
- Have a machine that can run 8-9B models at acceptable speed
- Have tasks that fit within 24-32K context
- Care about data privacy (code never leaves your machine)
- Do enough of these tasks that the API cost adds up
Then local models are worth setting up. If your tasks regularly need broader context, the setup overhead isn't worth the frustration.
Setup: Mac Mini M4 Pro, 24GB unified memory, Ollama 0.5, Claude Code with attribution header disabled. Model: Qwen3.5 (8B) via Ollama.
Unsloth guide reference: unsloth.ai/docs/basics/claude-code
Top comments (0)