DEV Community

Dor Amir
Dor Amir

Posted on

Stop Hitting Your Claude Code Quota. Route Around It Instead.

Every week there's a new HN thread about it: "Claude Code: connect to a local model when your quota runs out."

The advice is always the same: set up Ollama as a fallback, accept that the local model is worse, keep using Claude Code for the important stuff. Treat quota exhaustion like a power outage. Have a backup.

Here's the problem with that framing: it treats quota exhaustion as inevitable.

It isn't.


What's actually eating your quota

If you've used Claude Code for more than a day, you know the pattern. You start a session, work on something complex, and most of the exchange is fine. But then there's this steady trickle of smaller requests:

  • "Read this file and tell me what it does"
  • "What's the signature of this function?"
  • "Show me line 47 to 53 of config.py"
  • "Does this import already exist?"

These prompts go to Sonnet. Or Opus. They cost the same per token as your hardest architectural questions.

Roughly 60-70% of Claude Code requests in a typical session are simple tasks. Reading files, short summaries, formatting checks, continuations. None of them need a frontier model. All of them burn quota at frontier model rates.


Fallback is the wrong mental model

The conversation keeps landing on fallback because that's how people think about resource exhaustion: you have a primary, and when it fails, you have a backup.

But you're not running out of Claude because the model broke. You're running out because you're over-spending on prompts that don't need it.

The real solution isn't fallback. It's routing.


NadirClaw: route before you exhaust, not after

NadirClaw sits between your AI tool and your LLM provider as an OpenAI-compatible proxy. Before every request goes out, it classifies the prompt in about 10ms using sentence embeddings and asks: does this actually need a premium model?

If the answer is no, it routes to something cheap. Gemini Flash-Lite, Ollama on your local machine, GPT-4o-mini, whatever you configure. If the answer is yes, it passes through to Claude, Sonnet, GPT-5, your premium model of choice.

The result: the 60-70% of requests that don't need Claude don't hit Claude's quota. The requests that do need it get all the headroom they need.

pip install nadirclaw
nadirclaw setup
nadirclaw serve
Enter fullscreen mode Exit fullscreen mode

Then point Claude Code at http://localhost:8856 instead of the Anthropic API. That's it. No code changes, no prompt engineering, no new SDK.


How it classifies

The router uses sentence embeddings to build a semantic fingerprint of each prompt, then scores it against complexity signals: reasoning depth required, ambiguity, length of expected output, presence of multi-step instructions.

Simple prompts (file reads, short factual questions, formatting tasks) score low and go to the cheap model. Complex prompts (architecture decisions, debugging hard failures, multi-step implementation) score high and go to your premium model.

Classification happens locally. Your prompts never leave your machine before routing. The router adds ~10ms overhead, which disappears into network latency.

For Claude Code specifically, NadirClaw also detects agentic loops: when Claude Code is running a multi-step task with tools, it forces the complex model for that entire conversation to avoid the quality degradation that comes from switching mid-task.


Numbers that matter

With default routing config on a typical Claude Code session:

  • ~65% of requests route to cheap/local models
  • Cost reduction: 40-60% compared to routing everything to Sonnet
  • Quota consumption: drops proportionally, so you can run longer sessions before hitting limits

The 200x price spread between Gemini Flash-Lite ($0.10/M tokens) and Claude Opus ($15/M tokens) is enormous. You're paying somewhere in that range right now for every file-read request. Routing closes most of that gap automatically.


Hybrid routing beats going fully local

The HN discussions usually end up at one of two positions: pay for API, or go fully local. Fully local means hardware investment and accepting that even the best local models are meaningfully behind frontier models on coding tasks.

NadirClaw takes the middle path: frontier quality when you need it, cheap/local when you don't. The routing is automatic. You don't have to decide per-request.

With Ollama integration (nadirclaw setup asks if you want to add Ollama), simple prompts can route to your local model at zero API cost. Complex ones still go to Claude. You get the quality of frontier models where it matters, the cost (and privacy) of local where it doesn't.


Setup takes about 3 minutes

# Install
pip install nadirclaw

# Interactive setup: pick providers, enter API keys, set routing thresholds
nadirclaw setup

# Start the router
nadirclaw serve
Enter fullscreen mode Exit fullscreen mode

Configure Claude Code with ANTHROPIC_BASE_URL=http://localhost:8856. Or for Cursor/Codex: point the OpenAI base URL at the same port.

That's it. nadirclaw dashboard gives you a live view of what's routing where, your cost breakdown, and budget alerts when you're approaching limits.


The quota exhaustion problem is real. But the fix isn't a fallback. It's spending your quota on things that need it.

github.com/doramirdor/NadirClaw

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.