Dor Amir

Posted on Feb 28

You're Flying Blind With Your AI Agents. Here's How to Fix It.

#ai #llm #observability #opensource

You're Flying Blind With Your AI Agents. Here's How to Fix It.

Last Tuesday at 2 AM, I woke up to a $340 bill from OpenAI. My coding agent had been running all evening. I thought it was just refactoring some tests. Turns out it had hit an infinite retry loop on a malformed API response and burned through 8 million tokens.

I had no idea until the bill arrived.

If you're building with AI agents (coding assistants, autonomous task runners, chatbots), you're probably flying blind too. Here's the problem and how to fix it.

NadirClaw automatically classifies prompt complexity and routes accordingly. Simple goes free, complex goes premium.

The Problem: You Have No Idea What Your Agents Are Doing

When you spin up a coding agent like Aider, Cursor, or a custom LangChain workflow, you see the final output. The code it wrote. The answer it gave. The task it completed.

What you don't see:

How many LLM calls it made to get there
Which models it used (did it really need GPT-4, or would 3.5 have worked?)
What the actual prompts and responses were
How long each call took
Which calls failed and got retried
What you're paying per task

You get a monthly bill from OpenAI or Anthropic, but you can't trace it back to specific tasks or prompts. It's like running a web service with no logs and no monitoring. You wouldn't do that. So why are we doing it with AI?

Why This Matters More Than You Think

Three things happen when you have no visibility:

1. Surprise bills

Your agent uses way more tokens than you expected. Maybe it's re-reading the same file 15 times. Maybe it's sending your entire codebase as context on every call. You won't know until you see the bill.

2. Silent performance issues

Your agent feels slow, but is it the LLM latency? Network issues? A bad prompt causing long generation times? Without traces, you're guessing.

3. No way to optimize

You can't make it better if you can't measure it. Could you use a cheaper model for some calls? Are you over-prompting? Is caching working? No clue.

The Standard Answer (And Why It's Not Enough)

The usual advice is to use an observability platform. LangSmith, Weights & Biases, Arize, Langfuse, whatever. These are great tools. But they have two problems:

You have to instrument everything. Every agent, every framework, every custom script. If you're mixing LangChain with raw OpenAI calls with Anthropic SDK calls, good luck getting consistent traces.
They only see what you send them. If you forget to wrap a call, it's invisible. If a library makes a direct API call, you miss it.

What you actually want is a single chokepoint that sees every LLM call, automatically, without you having to remember to instrument it.

The Fix: Put a Router in Front of Everything

Here's the approach that actually works: route all your LLM traffic through a single proxy that logs everything by default.

Instead of this:

Your Agent → OpenAI API
Your Agent → Anthropic API  
Your Agent → Local Model

Do this:

Your Agent → Router → OpenAI/Anthropic/Local Model

The router sees every request and response. It logs them. It tracks timing. It calculates costs. It shows you what's actually happening.

This is how we built NadirClaw (full disclosure: I maintain it, it's open source at github.com/doramirdor/NadirClaw). It started as a cost-saving tool (route expensive calls to cheaper models when possible). But the observability piece ended up being way more valuable.

What You Get With a Router-Based Approach

When every LLM call flows through a central point, you automatically get:

Full request/response logs

See the actual prompts your agent is sending. See the raw responses. Debug weird behavior by reading the actual conversation, not guessing.

Cost tracking per task

Tag requests by agent, task, or user. See exactly what each task costs. Find the expensive outliers.

Latency metrics

p50, p95, p99 latency for every model and provider. See which calls are slow. Spot timeouts before they become a problem.

Error rates and retries

How often are calls failing? Which models have the highest error rates? Are you retrying intelligently or just burning money?

Provider comparison

If you're routing between OpenAI, Anthropic, and local models, you can compare them head-to-head on cost, speed, and reliability. Make informed decisions about which to use when.

No instrumentation required

Point your app at the router instead of directly at the API. That's it. Everything gets logged automatically.

How This Looks in Practice

Here's a real example from last week. We had a coding agent that was supposed to write unit tests. It worked fine, but felt slow.

Looking at NadirClaw's dashboard, we saw:

Average task: 12 LLM calls (way more than expected)
8 of those calls were hitting GPT-4
6 of the GPT-4 calls had identical prompts (wtf?)

Turns out the agent was re-analyzing the same file on every iteration because of a caching bug. We fixed it in 10 minutes once we could see the actual call pattern.

Before: ~90 seconds per task, $0.40 in API costs

After: ~25 seconds per task, $0.08 in API costs

We only found this because we could see the trace of every call. Without that, we'd still be wondering why it was slow and expensive.

You Don't Need Another SaaS Tool

The nice thing about the router approach is you're not adding another vendor to your stack. NadirClaw runs locally (or in your VPC if you want). Your prompts and responses stay on your infrastructure. You're not sending sensitive data to a third-party observability platform.

If you already have an observability stack (Datadog, New Relic, whatever), you can export traces from the router via OpenTelemetry. Or just use the built-in dashboard. For most teams, the built-in stuff is plenty.

How to Get Started

If you want to try this:

Spin up NadirClaw locally (it's a single Docker container or npm install)
Point your agent at http://localhost:3000 instead of api.openai.com
Add your API keys to the router config
Open the dashboard at http://localhost:3000/dashboard

You'll immediately see every call, every response, costs, and timing. No instrumentation, no SDK changes, nothing.

Then start poking around. Look for patterns. Find the expensive calls. See where retries are happening. Optimize based on actual data instead of guessing.

The Takeaway

You wouldn't run a production service without logs and metrics. Don't run AI agents without them either.

The router approach gives you observability by default. Every call gets logged. You can trace problems back to specific prompts. You can optimize based on real usage patterns.

And when you get a surprise bill at 2 AM, you'll know exactly what caused it.

Amir Dor maintains NadirClaw, an open-source LLM router focused on observability and cost optimization. Find it at github.com/doramirdor/NadirClaw.

DEV Community

You're Flying Blind With Your AI Agents. Here's How to Fix It.

You're Flying Blind With Your AI Agents. Here's How to Fix It.

The Problem: You Have No Idea What Your Agents Are Doing

Why This Matters More Than You Think

The Standard Answer (And Why It's Not Enough)

The Fix: Put a Router in Front of Everything

What You Get With a Router-Based Approach

How This Looks in Practice

You Don't Need Another SaaS Tool

How to Get Started

The Takeaway

Top comments (0)