There are at least six tools right now that will show you exactly how much money your LLM calls cost. Helicone gives you dashboards. Arize gives you traces. SigNoz plugs into OpenTelemetry. They're all good at the same thing: showing you the bill.
None of them make it smaller.
The Observability Trap
Here's the pattern I keep seeing. Team ships an AI feature. Costs creep up. Someone sets up an observability layer. Now you have gorgeous charts showing your Claude Sonnet spend going up and to the right. Everyone nods seriously in the meeting. Nothing changes.
Observation without action is just a nicer way to watch money leave.
The problem isn't visibility. You already know LLM calls are expensive. The problem is that every single prompt hits the same model, regardless of complexity. Your "what's the weather in Tokyo" query runs on the same $15/million-token model as your "analyze this contract for liability risks" query.
That's not an observability problem. That's an architecture problem.
What If the Proxy Just Fixed It?
NadirClaw is an open-source LLM proxy. It sits between your application and the LLM API. OpenAI-compatible, drop-in replacement. One line to install:
pip install nadirclaw && nadirclaw serve
Here's what it does differently from every observability tool out there: it classifies each prompt's complexity before it hits the API. Simple prompts (lookups, formatting, extraction) route to cheap models like Gemini Flash or Claude Haiku. Complex prompts (reasoning, analysis, generation) stay on Claude Sonnet or Opus.
You don't configure rules. You don't write routing logic. The classifier handles it.
The result: 40-70% cost reduction on real workloads. Not theoretical. Measured.
"But I Need Visibility Too"
You get it. NadirClaw ships with a built-in dashboard that shows every routing decision in real time. Which prompts went to which model, why, and what it cost. You see the savings as they happen, not after the invoice arrives.
The difference is that visibility is a byproduct of the thing actually saving you money. Not the other way around.
Compare that to the observability-first approach:
- Install observability tool
- See that costs are high
- Manually figure out which calls could use cheaper models
- Write routing logic yourself
- Maintain it as models and pricing change
- Set up the observability tool again to monitor your routing logic
Or:
- Install NadirClaw
- Done
Who This Is For
If you're running any AI application that makes more than a handful of LLM calls per day, you're overpaying. Agents are the worst offenders because they run loops of tool calls, memory lookups, and planning steps. Most of those intermediate calls are simple. They don't need your most expensive model.
But it's not just agents. RAG pipelines, chatbots, content generation workflows, code assistants. Anything with volume.
If you're already using an observability tool, NadirClaw doesn't replace it. It just makes the numbers on your dashboards less painful to look at.
The Part Where I Get Opinionated
The LLM observability space is crowded because it's easy to build. Wrap the API, log the calls, render some charts. Useful, sure. But it's a vitamin, not a painkiller.
Cost reduction is the painkiller. And right now, NadirClaw is the only open-source proxy that does it automatically.
I'd rather have a tool that saves me $500/month with a basic dashboard than a tool that shows me a beautiful breakdown of the $500 I just spent.
Try It
pip install nadirclaw
nadirclaw setup # pick your models
nadirclaw serve # localhost:8000, OpenAI-compatible
Point your app at localhost:8000 instead of the OpenAI or Anthropic API. Everything else stays the same.
Open source. No account needed. No vendor lock-in.
Top comments (1)
Token waste from unstructured prompts is real — hedging phrases, repeated context, redundant instructions all add up. The fix isn't shorter prompts though, it's denser ones. Same information in a structured format (role, constraints, output format as discrete blocks) often costs fewer tokens while producing better output.
Structured prompts also make caching more predictable because the static parts (role, constraints) stay consistent across requests. flompt.dev / github.com/Nyrok/flompt if you want to try the decomposition approach.