My API bill last month had a line I couldn't ignore.
Not the expensive reasoning tasks — those I expected. It was the small stuff. The "what does this error mean" questions. The quick refactors. The five-line test I asked Claude Code to write at 11pm. A thousand tiny requests, all billed like they mattered.
Meanwhile, I had Ollama running on my machine with qwen2.5-coder loaded. Fast. Free. Already sitting there.
The problem was that my CLI tools had no idea it existed.
The Wiring Problem
Claude Code speaks Anthropic's protocol. Codex CLI speaks OpenAI's. Gemini CLI speaks Google's. And Ollama? It speaks its own thing — but it also exposes an OpenAI-compatible endpoint at http://localhost:11434.
So the question isn't "can Ollama do this" — it clearly can. The question is: how do you get your tools to talk to it without rewriting your entire config every time you switch between local and cloud?
That's what I spent the last week solving, and I've now shipped it as part of CliGate.
How It Works
CliGate is a local proxy that already handles routing Claude Code, Codex CLI, and Gemini CLI to cloud providers. The new local model support adds Ollama as a first-class routing target alongside OpenAI, Anthropic, and Google.
When local model routing is enabled, CliGate intercepts requests from your CLI tools and — depending on your config — sends them to Ollama instead of the cloud. Protocol translation happens in the proxy layer: Claude Code's Anthropic-formatted request gets adapted to whatever Ollama expects, the response gets adapted back.
Your tool never knows the difference.
The 3-Minute Setup
Step 1 — Make sure Ollama is running with a model
ollama run qwen2.5-coder:7b
Or any model you prefer. CliGate auto-discovers whatever's loaded.
# Verify Ollama is accessible
curl http://localhost:11434/api/version
# {"version":"0.6.x"}
Step 2 — Start CliGate
npx cligate@latest start
Dashboard opens at http://localhost:8081.
Step 3 — Add your Ollama instance
Go to Settings → Local Models. Add your Ollama URL:
http://localhost:11434
CliGate runs a health check and then fetches your model list via /v1/models. You'll see your loaded models appear automatically — no manual entry.
Step 4 — Enable local routing
Toggle on "Local Model Routing". At this point, any request that would normally go to a cloud provider will check local models first.
You can also configure this per-app. For example:
-
Claude Code →
qwen2.5-coder:7b(your local coding model) - Codex CLI → cloud (when you need the full thing)
- Gemini CLI → cloud
That's it. No ANTHROPIC_BASE_URL juggling. No re-exporting env vars. One dashboard toggle.
Step 5 — Test it
Go to the Chat tab, pick "Local Model" as the source, and send a message. If it comes back, the routing is working. Then go to your terminal and use Claude Code normally — the proxy handles the rest.
# Claude Code is already pointed at CliGate from the one-click setup
claude "explain what this function does"
# → routes to your local Ollama model
The Part That Surprised Me
I expected the basic routing to be the hard part. It wasn't.
The interesting problem was streaming. Claude Code expects streaming responses in Anthropic's SSE format. Ollama streams in its own format. Getting those two to handshake correctly without garbling the output took longer than everything else combined.
The solution is a dedicated SSE bridge in the proxy layer that reads Ollama's stream chunk-by-chunk and re-emits it in the format the requesting tool expects. Claude Code sees a normal Anthropic streaming response. It never touches Ollama directly.
Claude Code
└─→ POST /v1/messages (Anthropic format, streaming)
└─→ CliGate proxy
└─→ detects: local routing enabled
└─→ sends to Ollama /v1/chat/completions
└─→ re-streams response as Anthropic SSE
←─ Claude Code receives: normal streaming response
Same pattern for Codex CLI (OpenAI Responses format) and any other tool you route through the proxy.
What This Is Actually Good For
I'm not suggesting you replace GPT-4 or Claude Sonnet with a local 7B model. There's a real capability difference.
But a lot of what I actually use Claude Code for in a normal day doesn't need the best model:
- "What does this stacktrace mean?"
- "Generate a unit test for this function"
- "Rename these variables to be more descriptive"
- "Does this SQL query look right?"
For tasks like these, qwen2.5-coder:7b is fast, accurate enough, and free. Saving the cloud calls for the harder problems — complex refactors, architecture questions, multi-file changes — drops my monthly API bill significantly without changing my workflow.
The toggle in CliGate makes it easy to switch back when you need to.
What's Your Local Model Setup?
Are you running Ollama (or LM Studio, or anything else) for coding tasks? I'm curious what models people are finding useful for day-to-day dev work — especially anything that runs well on a laptop.
GitHub: github.com/codeking-ai/cligate
npx cligate@latest start
Top comments (1)
welcome to use