Anthropic Just Pushed Everyone to API Billing. Here's How to Cut Your Costs.

#opensource #python #ai #llm

Anthropic quietly updated their docs this week with a policy that affects a lot of developers:

"Using OAuth tokens obtained through Claude Free, Pro, or Max accounts in any other product, tool, or service, including the Agent SDK, is not permitted and constitutes a violation of the Consumer Terms of Service."

Translation: if you've been using your $200/month Claude Max subscription in the Agent SDK, OpenCode, Cursor, or any third-party tool, that's no longer allowed. The only place those tokens work is Claude Code CLI itself.

For everyone else? You're on API billing now. Every token costs money.

The hidden problem with per-token billing

When you had a flat-rate subscription, optimization didn't matter. Send everything to Claude Sonnet, who cares. But on per-token pricing, every prompt counts.

Here's what most developers don't realize: the majority of their prompts are simple. "Explain this error." "Reformat this JSON." "Write a docstring for this function." "Convert this to TypeScript."

These are all tasks that a smaller, cheaper model handles perfectly fine. But without any routing layer, they all hit your most expensive model at full price.

I tracked my own usage for a few weeks. Roughly 60-70% of my prompts fell into the "simple" category. That's a lot of wasted spend.

What I built

NadirClaw is an open source Python proxy that sits between your application and your LLM providers. It classifies each incoming prompt in about 10ms using sentence embeddings and routes it to the appropriate model tier.

Simple prompts (summaries, reformatting, basic code generation) go to a cheap model like Gemini Flash or a local Ollama instance. Complex prompts (multi-file refactors, architectural decisions, agentic tool-use loops) go to your premium model.

It exposes an OpenAI-compatible API, so you just change your base URL to localhost:8856 and everything works. No code changes needed in your application.

How the classification works

The classifier considers several signals:

Vocabulary complexity of the prompt
Code structure: single file vs multi-file, presence of imports and dependencies
System prompt patterns that indicate agentic workflows
Conversation context: whether the thread involves chain-of-thought reasoning
Tool-use detection: if the prompt contains function calls or multi-step loops, it always routes to the complex tier

This isn't just checking token count. A short prompt like "refactor the auth module to use JWT" is complex despite being brief. A long prompt that's just "here's my JSON, convert it to YAML" is simple despite being lengthy.

What surprised me during development

Session persistence matters. Without it, you start a deep conversation on Claude Sonnet, then the next message gets classified as "simple" and goes to Gemini Flash, which has no context for the thread. NadirClaw pins conversations to their initial model.

Rate limit fallback is essential. When your primary model returns a 429, NadirClaw falls back to the other tier instead of just failing. During peak hours, this alone saves a lot of frustration.

Context window awareness. Some conversations grow beyond what the assigned model supports. NadirClaw detects this and auto-migrates to a model with a larger context window.

Setup

pip install nadirclaw
nadirclaw setup    # interactive wizard for providers and models
nadirclaw serve    # starts the proxy on localhost:8856

The setup wizard walks you through picking your providers and models for each tier. You can also configure it manually via ~/.nadirclaw/config.yaml.

Results

In practice, my API costs dropped about 60%. The classification adds roughly 10ms of latency per request, which is unnoticeable. Quality on complex tasks stayed the same because those still hit the premium model.

The timing of Anthropic's policy change makes this more relevant than ever. Thousands of developers are about to move from flat-rate to per-token billing without any cost optimization in place.