My API bill last month had a line I couldn't ignore.
Not the expensive reasoning tasks — those I expected. It was the small stuff. The "what does this error mean" questions. The quick refactors. The five-line test I asked Claude Code to write at 11pm. A thousand tiny requests, all billed like they mattered.
Meanwhile, I had Ollama running on my machine with qwen2.5-coder loaded. Fast. Free. Already sitting there.
The problem was that my CLI tools had no idea it existed.
The Wiring Problem
Claude Code speaks Anthropic's protocol. Codex CLI speaks OpenAI's. Gemini CLI speaks Google's. And Ollama? It speaks its own thing — but it also exposes an OpenAI-compatible endpoint at http://localhost:11434.
So the question isn't "can Ollama do this" — it clearly can. The question is: how do you get your tools to talk to it without rewriting your entire config every time you switch between local and cloud?
That's what I spent the last week solving, and I've now shipped it as part of CliGate.
How It Works
CliGate is a local proxy that already handles routing Claude Code, Codex CLI, and Gemini CLI to cloud providers. The new local model support adds Ollama as a first-class routing target alongside OpenAI, Anthropic, and Google.
When local model routing is enabled, CliGate intercepts requests from your CLI tools and — depending on your config — sends them to Ollama instead of the cloud. Protocol translation happens in the proxy layer: Claude Code's Anthropic-formatted request gets adapted to whatever Ollama expects, the response gets adapted back.
Your tool never knows the difference.
The 3-Minute Setup
Step 1 — Make sure Ollama is running with a model
ollama run qwen2.5-coder:7b
Or any model you prefer. CliGate auto-discovers whatever's loaded.
# Verify Ollama is accessible
curl http://localhost:11434/api/version
# {"version":"0.6.x"}
Step 2 — Start CliGate
npx cligate@latest start
Dashboard opens at http://localhost:8081.
Step 3 — Add your Ollama instance
Go to Settings → Local Models. Add your Ollama URL:
http://localhost:11434
CliGate runs a health check and then fetches your model list via /v1/models. You'll see your loaded models appear automatically — no manual entry.
Step 4 — Enable local routing
Toggle on "Local Model Routing". At this point, any request that would normally go to a cloud provider will check local models first.
You can also configure this per-app. For example:
-
Claude Code →
qwen2.5-coder:7b(your local coding model) - Codex CLI → cloud (when you need the full thing)
- Gemini CLI → cloud
That's it. No ANTHROPIC_BASE_URL juggling. No re-exporting env vars. One dashboard toggle.
Step 5 — Test it
Go to the Chat tab, pick "Local Model" as the source, and send a message. If it comes back, the routing is working. Then go to your terminal and use Claude Code normally — the proxy handles the rest.
# Claude Code is already pointed at CliGate from the one-click setup
claude "explain what this function does"
# → routes to your local Ollama model
The Part That Surprised Me
I expected the basic routing to be the hard part. It wasn't.
The interesting problem was streaming. Claude Code expects streaming responses in Anthropic's SSE format. Ollama streams in its own format. Getting those two to handshake correctly without garbling the output took longer than everything else combined.
The solution is a dedicated SSE bridge in the proxy layer that reads Ollama's stream chunk-by-chunk and re-emits it in the format the requesting tool expects. Claude Code sees a normal Anthropic streaming response. It never touches Ollama directly.
Claude Code
└─→ POST /v1/messages (Anthropic format, streaming)
└─→ CliGate proxy
└─→ detects: local routing enabled
└─→ sends to Ollama /v1/chat/completions
└─→ re-streams response as Anthropic SSE
←─ Claude Code receives: normal streaming response
Same pattern for Codex CLI (OpenAI Responses format) and any other tool you route through the proxy.
What This Is Actually Good For
I'm not suggesting you replace GPT-4 or Claude Sonnet with a local 7B model. There's a real capability difference.
But a lot of what I actually use Claude Code for in a normal day doesn't need the best model:
- "What does this stacktrace mean?"
- "Generate a unit test for this function"
- "Rename these variables to be more descriptive"
- "Does this SQL query look right?"
For tasks like these, qwen2.5-coder:7b is fast, accurate enough, and free. Saving the cloud calls for the harder problems — complex refactors, architecture questions, multi-file changes — drops my monthly API bill significantly without changing my workflow.
The toggle in CliGate makes it easy to switch back when you need to.
What's Your Local Model Setup?
Are you running Ollama (or LM Studio, or anything else) for coding tasks? I'm curious what models people are finding useful for day-to-day dev work — especially anything that runs well on a laptop.
GitHub: github.com/codeking-ai/cligate
npx cligate@latest start
Top comments (9)
welcome to use
oh this is the exact thing I wish I'd figured out 6 months ago. the local ollama + Claude Code combo is weirdly underrated — you get the Claude Code ergonomics/agent loop but keep context on-device for the stuff you don't want sitting in cloud logs.
one tip: after you set this up, make a
CLAUDE.mdat repo root with your local model's quirks (context window, tool-calling reliability, latency). different from the cloud model so the agent needs to know. I keep these CLAUDE.md templates in tokrepo.com and copy-paste them into new projects — saves me rediscovering the same workarounds every time.also curious — are you running this in the CC terminal or through the SDK? the streaming behavior is different depending on which path you take and it matters for longer loops.
The CLAUDE.md tip is genuinely underrated. Local models have different context limits, tool-calling reliability, and
latency than cloud models — the agent behaves better when it knows the constraints upfront. I've been doing this
manually per project, templates would save a lot of rediscovery.
Running through the CC terminal, not the SDK directly. Spent some time getting SSE chunk forwarding right through the
proxy — if you don't handle that correctly, the agent loop gets sluggish on longer tasks. Terminal gives you the
interactive feedback that makes the agent feel natural; SDK path is cleaner for programmatic use.
The per-app routing config is a smart middle ground. One thing I'd push further: within the same app, the gap between "rename this variable" and "redesign this module" is huge. Have you thought about request-level classification, like estimating complexity from prompt length or keyword patterns, to auto-switch between local and cloud on a per-request basis?
You're right that per-app routing is a coarse approximation. Per-request classification is more powerful but harder to
get right — prompt length is a decent proxy but misses context (a short "refactor this function" can be surprisingly
hard if it touches 10 files).
One heuristic I've been thinking about: context token count as the routing signal. If the request includes more than N
context tokens, route to cloud. Simpler than NLP classification and correlates reasonably well with actual task
complexity.
Curious what signals you'd trust most in practice — keyword patterns or something else?
The 3-minute setup is great, but what I find most interesting here is the shift toward local-first AI workflows. We have been running local LLMs on edge devices (Raspberry Pi, Jetson) for a robotics project, and the latency difference between a local 7B model vs. API call can make or break real-time decision making.
One thing that bit us: model loading time on resource-constrained devices. On a Pi 5 with 8GB RAM, just getting the model into memory takes longer than the actual inference. Curious if you have hit any similar constraints with Ollama — like cold start time or memory pressure when running other services alongside it?
The Claude Code + Ollama combo is clever though. Basically turning your editor into an AI pair programmer with full privacy. That is the kind of setup that actually changes how you work day to day.
Cold start is definitely noticeable on a laptop too — Ollama takes 5–15s to load a 7B model into memory from cold. The
trick I use: send a dummy request to Ollama on startup so the model stays warm. Once loaded, inference latency is
fine for interactive use.
For Pi-class hardware, memory is the real ceiling. You'd probably want a smaller quantized variant — a Q4_K_M of a 3B
model sits around 2GB and still handles code tasks well. CliGate itself is just a lightweight Node.js proxy so it adds
almost no overhead on top of what Ollama already needs.
The privacy angle you mentioned is actually one of the main reasons people go this route. Claude Code ergonomics with
nothing leaving the machine — that's a real value prop for certain codebases.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.