DEV Community

Dor Amir
Dor Amir

Posted on

You're Already Routing Claude Code. You're Just Doing It Manually.

There's a popular post making rounds today: a developer explaining how they separate "planning mode" from "execution mode" in Claude Code. Use one model for exploring and reading, another for actually writing code. Manage the switch manually. It works. The author is right.

Here's the uncomfortable follow-up: you shouldn't have to do that manually.


What you're actually doing when you "separate planning from execution"

Read the workflow carefully. The insight isn't about planning vs execution as a concept. The insight is that different types of prompts deserve different model tiers.

  • "Read this file and tell me what it does" → doesn't need Sonnet
  • "What's the signature of this function?" → doesn't need Sonnet
  • "Refactor the entire auth module to support OAuth and maintain backward compat" → needs Sonnet

The developer's solution is: be disciplined about which mode you're in. Switch manually. Think before each prompt about whether it's a planning question or an execution question.

That's a lot of cognitive overhead for something a classifier can do in 10 milliseconds.


The numbers on why this matters

Across typical Claude Code sessions, 60-70% of prompts are low-complexity:

  • File reads
  • Short questions about what something does
  • Asking for variable names or signatures
  • Quick grep-style queries ("does this import already exist?")
  • Short continuations ("add a docstring")

These go to Sonnet. They cost Sonnet prices. In a session where you're actively working, you might fire 80-120 prompts. 50 of those could be Gemini Flash-Lite at $0.10/M tokens. Instead they're all going to Sonnet at $3/M tokens.

The math on 50 simple prompts at 300 tokens each:

  • Sonnet: 50 × 300 × $3/1M ≈ $0.045
  • Flash-Lite: 50 × 300 × $0.10/1M ≈ $0.0015

Per session that sounds small. Run Claude Code for a few hours a day? It compounds fast.


What routing actually looks like

NadirClaw is an OpenAI-compatible proxy. You run it locally:

pip install nadirclaw
nadirclaw setup   # configure simple and complex providers once
nadirclaw serve   # starts on localhost:8856
Enter fullscreen mode Exit fullscreen mode

Point Claude Code at localhost:8856 instead of the Anthropic API. Every prompt goes through a 10ms classifier before it goes anywhere. The classifier embeds the prompt, compares it to learned complexity centroids, and routes:

  • Confidence it's simple → cheap model (you configure which one)
  • Confidence it's complex → your premium model
  • Borderline → premium model (better to over-serve than under-serve)

No prompt engineering. No mode switching. No manual discipline. The separation of planning and execution that the workflow post describes, NadirClaw does that automatically.


The "but my local model is worse" objection

It's valid for execution. It doesn't apply to planning.

When you're reading a file to understand it, you don't need Opus-level reasoning. You need a model that can read. Gemini Flash-Lite reads. GPT-4o-mini reads. Qwen 3.5 running on Ollama reads.

The borderline cases go to your premium model by default. You're not gambling on the wrong call, you're just not paying Sonnet prices for "what does line 47 do."


What you configure

# Example: Gemini Flash-Lite for simple, Claude Sonnet for complex
nadirclaw setup \
  --simple-provider gemini \
  --simple-model gemini-flash-lite \
  --complex-provider anthropic \
  --complex-model claude-sonnet-4-5
Enter fullscreen mode Exit fullscreen mode

Or if you want to go fully local for simple:

nadirclaw setup \
  --simple-provider ollama \
  --simple-model qwen3.5:8b \
  --complex-provider anthropic \
  --complex-model claude-sonnet-4-5
Enter fullscreen mode Exit fullscreen mode

Same drop-in pattern. Claude Code doesn't know anything changed.


Quota exhaustion as a symptom, not a root cause

Most "Claude Code quota" problems aren't because you're doing too much complex work. They're because 60% of your prompts are buying Sonnet for tasks that don't need it.

The planning/execution separation post is solving the symptom correctly but manually. Routing solves the root cause automatically.


Try it

pip install nadirclaw
nadirclaw setup
nadirclaw serve
# Point Claude Code at http://localhost:8856
# Check nadirclaw status for real-time routing decisions
Enter fullscreen mode Exit fullscreen mode

After a session: nadirclaw savings shows what routed where and what it saved.

The separation of planning and execution is a real insight. Automating it is just the next step.


NadirClaw is open source: github.com/doramirdor/NadirClaw

Top comments (1)

Collapse
 
nyrok profile image
Hamza KONTE

Routing decisions get a lot cleaner when the prompts going to each model are structurally consistent. If you're sending unstructured blobs, two "identical" prompts can hit the router differently just because of word order.

That's part of what I built flompt to solve — it decomposes prompts into 12 semantic blocks (role, objective, constraints, output format, etc.) and compiles them to structured XML. Consistent structure means your routing logic can rely on something stable, not just vibes from a freeform paragraph.

flompt.dev / github.com/Nyrok/flompt