You're Flying Blind With Your AI Agents. Here's How to Fix It.
Last Tuesday at 2 AM, I woke up to a $340 bill from OpenAI. My coding agent had been running all evening. I thought it was just refactoring some tests. Turns out it had hit an infinite retry loop on a malformed API response and burned through 8 million tokens.
I had no idea until the bill arrived.
If you're building with AI agents (coding assistants, autonomous task runners, chatbots), you're probably flying blind too. Here's the problem and how to fix it.

NadirClaw automatically classifies prompt complexity and routes accordingly. Simple goes free, complex goes premium.
The Problem: You Have No Idea What Your Agents Are Doing
When you spin up a coding agent like Aider, Cursor, or a custom LangChain workflow, you see the final output. The code it wrote. The answer it gave. The task it completed.
What you don't see:
- How many LLM calls it made to get there
- Which models it used (did it really need GPT-4, or would 3.5 have worked?)
- What the actual prompts and responses were
- How long each call took
- Which calls failed and got retried
- What you're paying per task
You get a monthly bill from OpenAI or Anthropic, but you can't trace it back to specific tasks or prompts. It's like running a web service with no logs and no monitoring. You wouldn't do that. So why are we doing it with AI?
Why This Matters More Than You Think
Three things happen when you have no visibility:
1. Surprise bills
Your agent uses way more tokens than you expected. Maybe it's re-reading the same file 15 times. Maybe it's sending your entire codebase as context on every call. You won't know until you see the bill.
2. Silent performance issues
Your agent feels slow, but is it the LLM latency? Network issues? A bad prompt causing long generation times? Without traces, you're guessing.
3. No way to optimize
You can't make it better if you can't measure it. Could you use a cheaper model for some calls? Are you over-prompting? Is caching working? No clue.
The Standard Answer (And Why It's Not Enough)
The usual advice is to use an observability platform. LangSmith, Weights & Biases, Arize, Langfuse, whatever. These are great tools. But they have two problems:
You have to instrument everything. Every agent, every framework, every custom script. If you're mixing LangChain with raw OpenAI calls with Anthropic SDK calls, good luck getting consistent traces.
They only see what you send them. If you forget to wrap a call, it's invisible. If a library makes a direct API call, you miss it.
What you actually want is a single chokepoint that sees every LLM call, automatically, without you having to remember to instrument it.
The Fix: Put a Router in Front of Everything
Here's the approach that actually works: route all your LLM traffic through a single proxy that logs everything by default.
Instead of this:
Your Agent → OpenAI API
Your Agent → Anthropic API
Your Agent → Local Model
Do this:
Your Agent → Router → OpenAI/Anthropic/Local Model
The router sees every request and response. It logs them. It tracks timing. It calculates costs. It shows you what's actually happening.
This is how we built NadirClaw (full disclosure: I maintain it, it's open source at github.com/doramirdor/NadirClaw). It started as a cost-saving tool (route expensive calls to cheaper models when possible). But the observability piece ended up being way more valuable.
What You Get With a Router-Based Approach
When every LLM call flows through a central point, you automatically get:
Full request/response logs
See the actual prompts your agent is sending. See the raw responses. Debug weird behavior by reading the actual conversation, not guessing.
Cost tracking per task
Tag requests by agent, task, or user. See exactly what each task costs. Find the expensive outliers.
Latency metrics
p50, p95, p99 latency for every model and provider. See which calls are slow. Spot timeouts before they become a problem.
Error rates and retries
How often are calls failing? Which models have the highest error rates? Are you retrying intelligently or just burning money?
Provider comparison
If you're routing between OpenAI, Anthropic, and local models, you can compare them head-to-head on cost, speed, and reliability. Make informed decisions about which to use when.
No instrumentation required
Point your app at the router instead of directly at the API. That's it. Everything gets logged automatically.
How This Looks in Practice
Here's a real example from last week. We had a coding agent that was supposed to write unit tests. It worked fine, but felt slow.
Looking at NadirClaw's dashboard, we saw:
- Average task: 12 LLM calls (way more than expected)
- 8 of those calls were hitting GPT-4
- 6 of the GPT-4 calls had identical prompts (wtf?)
Turns out the agent was re-analyzing the same file on every iteration because of a caching bug. We fixed it in 10 minutes once we could see the actual call pattern.
Before: ~90 seconds per task, $0.40 in API costs
After: ~25 seconds per task, $0.08 in API costs
We only found this because we could see the trace of every call. Without that, we'd still be wondering why it was slow and expensive.
You Don't Need Another SaaS Tool
The nice thing about the router approach is you're not adding another vendor to your stack. NadirClaw runs locally (or in your VPC if you want). Your prompts and responses stay on your infrastructure. You're not sending sensitive data to a third-party observability platform.
If you already have an observability stack (Datadog, New Relic, whatever), you can export traces from the router via OpenTelemetry. Or just use the built-in dashboard. For most teams, the built-in stuff is plenty.
How to Get Started
If you want to try this:
- Spin up NadirClaw locally (it's a single Docker container or npm install)
- Point your agent at
http://localhost:3000instead ofapi.openai.com - Add your API keys to the router config
- Open the dashboard at
http://localhost:3000/dashboard
You'll immediately see every call, every response, costs, and timing. No instrumentation, no SDK changes, nothing.
Then start poking around. Look for patterns. Find the expensive calls. See where retries are happening. Optimize based on actual data instead of guessing.
The Takeaway
You wouldn't run a production service without logs and metrics. Don't run AI agents without them either.
The router approach gives you observability by default. Every call gets logged. You can trace problems back to specific prompts. You can optimize based on real usage patterns.
And when you get a surprise bill at 2 AM, you'll know exactly what caused it.
Amir Dor maintains NadirClaw, an open-source LLM router focused on observability and cost optimization. Find it at github.com/doramirdor/NadirClaw.
Top comments (0)