Six weeks ago, a LangChain agent we'd deployed for a B2B client started failing on roughly 30% of its sessions. No exceptions. No 500s. Nothing in the logs that looked wrong. The agent kept running, kept returning responses, kept looking completely healthy from the outside.
We found out because the client called asking why their numbers looked off.
Two weeks of silent failures. About $2,400 in wasted LLM spend before we even knew there was a problem. And the frustrating part wasn't that we lacked visibility — we had LangSmith running. Every LLM call was traced. Every tool call was logged. We could see exactly what the agent did, step by step.
We just couldn't see that it was wrong.
The gap between "what happened" and "was it correct"
This is the distinction that took us embarrassingly long to articulate clearly: trace-level observability tells you what an agent did. It does not tell you whether what it did was right.
Our agent was calling the correct tools. It was getting 200 responses back. Nothing in the execution path threw an error. The problem was that it had started retrieving the wrong context for one specific client's data, and it was confidently generating plausible-sounding, subtly incorrect answers based on that wrong context. At the trace level, a successful run and a failed run looked identical. Same tool calls, same response codes, same latency profile. The only difference was semantic — and semantic correctness isn't something a request/response trace can tell you.
This matters because most of the tooling in this space (LangSmith, Langfuse, Helicone — all genuinely solid tools) is built around the developer's question: what did the call cost, how long did it take, did it error. That's the right question for debugging integration issues. It's the wrong question for catching the failure mode where the agent runs perfectly and does the wrong thing.
What we changed
We ended up building a small internal tool to close this gap, which eventually became AgentWatch. The core design decisions were mostly about treating three things as first-class, rather than as afterthoughts bolted onto request logs:
- Outcome as an explicit field, not an inference
Instead of deriving "success" from the absence of an error, we made outcome a field you set deliberately per session — success, error, or unknown if you haven't gotten around to labeling it yet. This sounds trivial. In practice it forces a decision at write-time about what "correct" even means for a given task, which is usually where the actual thinking happens. It's manual labeling in the early days, sometimes automated later, but the field exists and gets populated regardless.
pythonimport agentwatch
aw = agentwatch.init(
api_url="https://agentwatch-api.up.railway.app",
api_key="your-api-key"
)
chain = aw.wrap(your_langchain_agent)
The SDK wraps the agent and captures LLM calls, tool calls, latency, and cost automatically. Outcome gets set either by your own evaluation logic or manually where you don't have one yet.
- Retry count as signal, not noise
An agent that silently retries a tool call four times before succeeding is telling you something useful, and most trace viewers bury that information three levels deep in a nested tree. We made retry_count a first-class field per event specifically so it's queryable and alertable — "sessions where any event retried more than twice" is a one-line filter instead of a manual trace review.
- Per-client cost attribution, designed in from the start
If you're running agents for multiple B2B clients on shared infrastructure — which is the situation most agencies are actually in — "what did this cost" is meaningless without "for which client." Retrofitting this after the fact means reconstructing attribution from incomplete data, which is exactly as painful as it sounds. We tag workspace_id at the API-key level so every session and event is attributable from the moment it's created, not from whenever someone remembers to add a client ID column.
The part that surprised us: clients don't want a dashboard
We initially assumed clients would want access to the same trace-level data we had. We gave one client access to raw session traces. They never opened it once.
What they actually wanted was something closer to a monthly statement: how many sessions ran, what it cost, whether it worked, and — this was the specific ask — something they could screenshot or forward to their own boss without needing to understand what a "trace" is. That's a fundamentally different artifact than a dashboard. It's a report.
This is the part of the problem that's genuinely underserved by existing tools, as far as we've found. LangSmith and Langfuse are built for the team running the agent, not for that team's clients. There's no white-label PDF workflow because that's not the use case they're solving for — and it's not a criticism, just a different target user. If you're a dev team monitoring your own agents, you probably don't need this. If you're an agency delivering agents to other companies, "what do I show my client" turns out to be one of the first questions you have to answer, and there wasn't a good off-the-shelf answer for it.
What's still unsolved
A few things we haven't cracked yet, in case anyone's further along on these:
Automated outcome scoring. Manually labeling outcome works but doesn't scale past a certain volume. Using another model to score outputs has its own failure modes — you're now trusting a judge model to catch what the original model missed, which is a different flavor of the same problem.
Data-layer monitoring. A few people pointed out in discussion that garbage input data produces confidently garbage output that looks like a perfectly healthy run. We think this deserves the same first-class treatment as outcome tracking, but haven't built it out yet.
Cost-saved attribution across enforcement layers. We integrate with FiGuard for budget enforcement at the tool-call boundary — when a request gets denied for exceeding a budget threshold, we surface that as "cost saved" in the client report. Getting this reporting right across two systems that weren't originally designed together has been its own small adventure in webhook signature verification and shared vocabulary.
If you're running LangChain or LangGraph agents in production for clients and have run into versions of these problems, I'd genuinely like to compare notes — feel free to leave a comment or reach out. We built a free tier (500 sessions/month) if anyone wants to see whether this approach to the problem is useful for their setup: agentwatch-two.vercel.app
Top comments (0)