CodeKing

Posted on Apr 10

"I Pointed Claude Code at My Local Ollama Models — Here's the 3-Minute Setup"

#ai #webdev #node #tutorial

Cuts API costs by routing minor tasks locally

My API bill last month had a line I couldn't ignore.

Not the expensive reasoning tasks — those I expected. It was the small stuff. The "what does this error mean" questions. The quick refactors. The five-line test I asked Claude Code to write at 11pm. A thousand tiny requests, all billed like they mattered.

Meanwhile, I had Ollama running on my machine with qwen2.5-coder loaded. Fast. Free. Already sitting there.

The problem was that my CLI tools had no idea it existed.

The Wiring Problem

Claude Code speaks Anthropic's protocol. Codex CLI speaks OpenAI's. Gemini CLI speaks Google's. And Ollama? It speaks its own thing — but it also exposes an OpenAI-compatible endpoint at http://localhost:11434.

So the question isn't "can Ollama do this" — it clearly can. The question is: how do you get your tools to talk to it without rewriting your entire config every time you switch between local and cloud?

That's what I spent the last week solving, and I've now shipped it as part of CliGate.

How It Works

CliGate is a local proxy that already handles routing Claude Code, Codex CLI, and Gemini CLI to cloud providers. The new local model support adds Ollama as a first-class routing target alongside OpenAI, Anthropic, and Google.

When local model routing is enabled, CliGate intercepts requests from your CLI tools and — depending on your config — sends them to Ollama instead of the cloud. Protocol translation happens in the proxy layer: Claude Code's Anthropic-formatted request gets adapted to whatever Ollama expects, the response gets adapted back.

Your tool never knows the difference.

The 3-Minute Setup

Step 1 — Make sure Ollama is running with a model

ollama run qwen2.5-coder:7b

Or any model you prefer. CliGate auto-discovers whatever's loaded.

# Verify Ollama is accessible
curl http://localhost:11434/api/version
# {"version":"0.6.x"}

Step 2 — Start CliGate

npx cligate@latest start

Dashboard opens at http://localhost:8081.

Step 3 — Add your Ollama instance

Go to Settings → Local Models. Add your Ollama URL:

http://localhost:11434

CliGate runs a health check and then fetches your model list via /v1/models. You'll see your loaded models appear automatically — no manual entry.

Step 4 — Enable local routing

Toggle on "Local Model Routing". At this point, any request that would normally go to a cloud provider will check local models first.

You can also configure this per-app. For example:

Claude Code → qwen2.5-coder:7b (your local coding model)
Codex CLI → cloud (when you need the full thing)
Gemini CLI → cloud

That's it. No ANTHROPIC_BASE_URL juggling. No re-exporting env vars. One dashboard toggle.

Step 5 — Test it

Go to the Chat tab, pick "Local Model" as the source, and send a message. If it comes back, the routing is working. Then go to your terminal and use Claude Code normally — the proxy handles the rest.

# Claude Code is already pointed at CliGate from the one-click setup
claude "explain what this function does"
# → routes to your local Ollama model

The Part That Surprised Me

I expected the basic routing to be the hard part. It wasn't.

The interesting problem was streaming. Claude Code expects streaming responses in Anthropic's SSE format. Ollama streams in its own format. Getting those two to handshake correctly without garbling the output took longer than everything else combined.

The solution is a dedicated SSE bridge in the proxy layer that reads Ollama's stream chunk-by-chunk and re-emits it in the format the requesting tool expects. Claude Code sees a normal Anthropic streaming response. It never touches Ollama directly.

Claude Code
  └─→ POST /v1/messages (Anthropic format, streaming)
        └─→ CliGate proxy
              └─→ detects: local routing enabled
              └─→ sends to Ollama /v1/chat/completions
              └─→ re-streams response as Anthropic SSE
        ←─ Claude Code receives: normal streaming response

Same pattern for Codex CLI (OpenAI Responses format) and any other tool you route through the proxy.

What This Is Actually Good For

I'm not suggesting you replace GPT-4 or Claude Sonnet with a local 7B model. There's a real capability difference.

But a lot of what I actually use Claude Code for in a normal day doesn't need the best model:

"What does this stacktrace mean?"
"Generate a unit test for this function"
"Rename these variables to be more descriptive"
"Does this SQL query look right?"

For tasks like these, qwen2.5-coder:7b is fast, accurate enough, and free. Saving the cloud calls for the harder problems — complex refactors, architecture questions, multi-file changes — drops my monthly API bill significantly without changing my workflow.

The toggle in CliGate makes it easy to switch back when you need to.

What's Your Local Model Setup?

Are you running Ollama (or LM Studio, or anything else) for coding tasks? I'm curious what models people are finding useful for day-to-day dev work — especially anything that runs well on a laptop.

GitHub: github.com/codeking-ai/cligate

npx cligate@latest start

Top comments (9)

CodeKing • Apr 13

Qwen2.5-Coder 7B is a solid pick — handles tests, stack traces, and single-file refactors well in my experience too.
The ceiling shows up on multi-file tasks where the model needs to hold more context than it's comfortable with.

The hybrid split you described (local for well-scoped tasks, cloud for architectural work) is basically the exact
workflow CliGate is designed around. The interesting question for me is how far better system prompts and tool-use
scaffolding can push smaller models — some 7B models that looked mediocre a year ago are noticeably more capable now
with the right setup.

CodeKing • Apr 10

welcome to use

mote • Apr 10

The 3-minute setup is great, but what I find most interesting here is the shift toward local-first AI workflows. We have been running local LLMs on edge devices (Raspberry Pi, Jetson) for a robotics project, and the latency difference between a local 7B model vs. API call can make or break real-time decision making.

One thing that bit us: model loading time on resource-constrained devices. On a Pi 5 with 8GB RAM, just getting the model into memory takes longer than the actual inference. Curious if you have hit any similar constraints with Ollama — like cold start time or memory pressure when running other services alongside it?

The Claude Code + Ollama combo is clever though. Basically turning your editor into an AI pair programmer with full privacy. That is the kind of setup that actually changes how you work day to day.

CodeKing • Apr 13

Cold start is definitely noticeable on a laptop too — Ollama takes 5–15s to load a 7B model into memory from cold. The
trick I use: send a dummy request to Ollama on startup so the model stays warm. Once loaded, inference latency is
fine for interactive use.

For Pi-class hardware, memory is the real ceiling. You'd probably want a smaller quantized variant — a Q4_K_M of a 3B
model sits around 2GB and still handles code tasks well. CliGate itself is just a lightweight Node.js proxy so it adds
almost no overhead on top of what Ollama already needs.

The privacy angle you mentioned is actually one of the main reasons people go this route. Claude Code ergonomics with
nothing leaving the machine — that's a real value prop for certain codebases.

Valentin Monteiro • Apr 10

The per-app routing config is a smart middle ground. One thing I'd push further: within the same app, the gap between "rename this variable" and "redesign this module" is huge. Have you thought about request-level classification, like estimating complexity from prompt length or keyword patterns, to auto-switch between local and cloud on a per-request basis?

CodeKing • Apr 13

You're right that per-app routing is a coarse approximation. Per-request classification is more powerful but harder to
get right — prompt length is a decent proxy but misses context (a short "refactor this function" can be surprisingly
hard if it touches 10 files).

One heuristic I've been thinking about: context token count as the routing signal. If the request includes more than N
context tokens, route to cloud. Simpler than NLP classification and correlates reasonably well with actual task
complexity.

Curious what signals you'd trust most in practice — keyword patterns or something else?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.