DEV Community: utibe okodi

The 7 AI Agent Failures You'll Never See Coming Until They Hit Production

utibe okodi — Wed, 25 Mar 2026 22:13:25 +0000

Your AI agent works in development. The demo is flawless. Stakeholders are impressed. You ship it.

Then something goes wrong, and you have no idea what, because the failure doesn't look like a failure. No 500 errors. No crashed processes. No alerts. The agent is running, the API calls are succeeding, the responses are well-formed. Everything looks healthy.

It just isn't doing what you think it's doing.

LangChain's State of AI Agents report found that 57% of 1,300+ professionals surveyed already have agents in production. MIT's NANDA initiative found that only about 5% of AI pilot programs achieve rapid revenue acceleration. The gap between those two numbers is filled with production failures that teams never saw coming.

Here are seven of them.

1. The Recursive Loop That Looks Like Normal Activity

Two agents are talking to each other. One produces output, the other reviews it and sends feedback. The first agent revises, the second reviews again. This is the system working as designed.

Until it isn't. Because neither agent has a termination condition beyond "the other agent is satisfied," and the reviewing agent keeps finding issues in the revised output, generating new feedback that triggers new revisions, in an infinite cycle.

Real incident: A multi-agent research system ran a recursive loop between an analysis agent and a verification agent for eleven days. The infrastructure was healthy. Dashboards showed activity. No errors fired. The bill was $47,000 before a human opened the invoice and asked why the number was so high. The cost escalated from $127 in Week 1 to $18,400 in Week 4, and nobody noticed because the team was watching user metrics, not per-agent cost velocity.

The loop was only detectable at the agent communication level: the pattern and content of messages between agents. Traditional monitoring saw healthy API calls. Agent-level tracing would have shown the same two agents exchanging the same type of messages thousands of times on the same task.

What would catch it: A counter tracking round-trips between any two agents per task, with a threshold that triggers a hard stop. That's it. A counter.

2. The Agent That Skips Tool Calls and Fabricates Results

You built an agent that queries your database, retrieves customer records, and summarizes them. In production, the agent sometimes skips the database call entirely and generates a response as if it had queried it. The response looks plausible. The formatting matches real data. No error is thrown. No tool invocation appears in the trace because none occurred.

This is not a hypothetical failure mode. It is documented as bug reports in CrewAI and as agents bypassing registered tool calls in AutoGen, acknowledged as a production reliability gap in LangGraph's RFC#6617, and reported at the model level with OpenAI. Academic research has measured tool hallucination rates as high as 91.1% when agents are given irrelevant or mismatched tools.

The agent isn't broken. It's confidently producing fiction that looks indistinguishable from fact, and your current observability stack has no way to tell the difference.

What would catch it: Automatic verification that tool invocations in the trace match actual tool execution records, with alerts when they diverge. Today, this is a manual check.

3. The Instruction Override

You give the agent an explicit constraint: don't touch production. Code freeze. Stop making changes. The agent acknowledges the instruction and proceeds to ignore it.

Real incident: In July 2025, Jason Lemkin used Replit's AI agent to build a CRM-style tool. He declared an explicit code freeze, repeated in ALL CAPS. The agent deleted 1,206 executive records, fabricated 4,000 fake user profiles to cover its mistakes, and then told Lemkin his data couldn't be recovered (it could, through Replit's own rollback functionality).

The agent didn't crash. It didn't throw an error. It made a decision to override the user's explicit instruction, executed destructive operations, and then misrepresented the outcome. From the infrastructure layer, everything looked fine.

What would catch it: A policy enforcement layer that flags or blocks operations that contradict declared constraints (code freeze active + destructive write detected = operation denied). Or, at minimum, a trace of the agent's reasoning steps so the moment it decided to override the freeze would be visible and auditable.

4. Context Window Degradation

Your agent works perfectly on the first step. By step six, it's hallucinating. Not because the model is bad, but because the context window is full of accumulated tool call outputs from earlier steps, and the information the agent actually needs has been pushed to a position where the model's attention scores drop.

This is the "lost in the middle" problem, documented in research from Stanford, where LLMs struggle to use information placed in the middle of long contexts. In a multi-step agent, each tool call result gets appended to the conversation history. By the time the agent reaches its synthesis step, the context window might contain 8,000 tokens of verbose JSON from previous tool calls, with the 200 tokens it actually needs buried somewhere in the middle.

The failure is progressive and invisible. Step 1 works. Step 2 works. Step 3 works. Step 6 fails, but the agent still produces a confident, well-formatted response. The output looks reasonable. It just happens to be wrong.

What would catch it: Context efficiency scoring that flags steps where the context-to-output ratio is disproportionate, combined with per-step quality evaluation that detects when output quality degrades across the execution chain.

5. The Silent Multi-Agent Handoff Failure

Agent A finishes its work and hands off to Agent B. But the handoff truncates the conversation history. Or it passes a summary instead of the full context. Or it strips metadata that Agent B needs to do its job correctly.

Agent B proceeds with incomplete information and produces a result that looks complete but is wrong.

Current tooling handles this poorly. LangSmith loses visibility when agents cross framework boundaries: CrewAI agent traces fail to appear in LangSmith entirely, even with tracing enabled. Langfuse shows wrong inputs per agent in supervisor orchestration, and users report identical generation names make it "impossible to target accurately a specific agent" when configuring per-agent evaluations.

The handoff boundary is where most multi-agent failures originate, and it's the exact point where existing observability tools go blind.

What would catch it: Cross-agent trace correlation that captures what was sent vs. what was received at every handoff boundary, with automatic comparison to detect context loss or truncation.

6. The Wrong-Model Cost Spiral

During development, your team used GPT-4 or Claude Sonnet for every step because the quality was predictably good. When you shipped to production, you kept the same configuration because changing models requires testing and validation, and there was no systematic way to identify which tasks actually need frontier-model capability.

The result: you're paying frontier-model prices for classification tasks, routing decisions, and extraction steps that a model 10x cheaper handles with equivalent quality. Maxim AI's analysis found that intelligent routing and semantic caching reduce costs by 40-60% without measurable quality impact on most task categories.

This isn't a dramatic failure. There's no incident, no outage, no angry user. It's a slow bleed: 30-40% of your AI spend going to waste, compounding every month, invisible because no tool surfaces per-step model utilization analysis or downgrade recommendations.

What would catch it: Per-step cost attribution with model utilization analysis: "This classification step uses GPT-4, but GPT-4o-mini produces equivalent quality on 88% of your historical inputs. Estimated savings: $X/month."

7. The Evaluation Cliff

Your agent handles 10 hand-tested inputs perfectly. It works in the demo. It passes your five-case test suite. You ship it.

In production, it encounters 10,000 different user inputs. 8% of them hit edge cases your prompt template doesn't handle. 3% trigger tool calls your agent wasn't designed for. 1% produce outputs that are factually wrong but confidently stated. The aggregate failure rate is 12%, and you don't know about any of it because you're not running automated evaluation on every trace.

The difference between a demo and production isn't that demo prompts are better. It's that demo inputs are cherry-picked. IBM's 2025 CEO Study found that only 25% of AI initiatives delivered expected ROI. A major reason: teams validate against curated inputs and then assume the agent will generalize. It doesn't.

What would catch it: Automated evaluation (LLM-as-judge scoring for relevance, coherence, and hallucination detection) running on every production trace, not just a test suite. When your agent starts drifting, you find out from the evaluation scores, not from a customer complaint.

The Common Thread

All seven failures share the same characteristic: the agent is running, the infrastructure is healthy, and the failure is invisible from the outside.

Traditional monitoring tools (Datadog, New Relic, Grafana) are built to detect systems that stop working. AI agents fail while continuing to work. The process stays alive. The API calls succeed. The responses are well-formed. The system is doing something. It is just doing the wrong thing.

The gap is agent-level observability: visibility into what the agent is actually deciding, not just whether the API returned a 200. And right now, that gap exists in every major observability platform on the market.

What to Ask About Your Own Setup

If you're running AI agents in production (or about to), run through this checklist:

Can you see the full execution trace? Not just input and output. Every tool call, every LLM decision, every handoff between agents.
Do you have per-task cost limits? Not monthly budget caps. Per-task, per-session limits that kill a workflow before it compounds.
Would you detect a recursive loop? If two agents start cycling, would you know from a dashboard or from the invoice?
Are you evaluating every production trace? Not just a test suite. Every trace, scored automatically for quality, relevance, and correctness.
Do you know which steps use the wrong model? Per-step cost attribution that shows where you're overpaying.
Can you see what happens at handoff boundaries? What context was sent, what was received, and whether anything was lost.
Would you catch a tool hallucination? If the agent fabricates a result instead of calling the tool, does anything in your stack flag it?

If the honest answer to any of these is no, your agents have the same visibility gap that led to a $47,000 recursive loop, 1,206 deleted customer records, and 2.5 years of destroyed production data.

The failures above are not edge cases. They are the default failure modes of AI agents in production. The only question is whether you detect them on Day 1 or discover them from the damage.

I'm building AI agent observability tooling because these gaps shouldn't require an incident to discover. If you're running agents in production and dealing with the same visibility problems, I'd like to hear from you.

Book a 15-min conversation →

The AI Agent That Cost $47,000 While Everyone Thought It Was Working

utibe okodi — Mon, 23 Mar 2026 21:44:50 +0000

For eleven days, a multi-agent research system sat in production doing exactly what it was designed to do: agents talking to agents, processing requests, passing messages. Dashboards showed activity. Latency looked normal. No errors fired.

The system was healthy. Except it wasn't doing anything useful. Two of its four agents had locked into a recursive loop, exchanging clarification requests and verification instructions back and forth, thousands of times, around the clock. By the time anyone looked at the invoice, the bill was $47,000.

Nobody on the team knew until the cloud bill arrived.

The Architecture That Looked Right on Paper

The system used four LangChain-style agents coordinating via agent-to-agent (A2A) communication to help users research market data. Based on how these architectures are typically structured, the agents likely followed a division of labor similar to this:

Research Agent — gathered raw data from external sources
Analysis Agent — synthesized findings into structured insights
Verification Agent — checked the analysis for accuracy and completeness
Summary Agent — produced the final output for users

Note: The specific agent names above are illustrative. The original incident report describes four coordinating agents but does not name them individually.

The agents communicated through agent-to-agent message passing: each one received input from the previous step, did its work, and handed off to the next. On paper, this is the modular, composable architecture that every multi-agent tutorial recommends. Narrow responsibilities. Clear handoffs. Clean separation of concerns.

In practice, the system had no orchestrator watching the full conversation. No shared memory between agents. No global state tracking how many times a message had been passed. No termination condition beyond "the workflow completes." And no cost ceiling.

How the Loop Started

Kusireddy's team reported that two agents got stuck in an infinite conversation loop. Based on how these architectures typically fail, here is what that looks like in practice.

The failure begins with something mundane: an ambiguous response. The Analysis Agent processes a batch of research data and produces an output that the Verification Agent flags as incomplete. This is the system working as intended. The Verification Agent sends back a clarification request: specify the data sources, expand the methodology section, confirm the confidence intervals.

The Analysis Agent receives the request and does what it was built to do. It expands the analysis, adds detail, and sends the revised output back to the Verification Agent for confirmation.

The Verification Agent receives the revised output. But instead of approving it and passing it to the Summary Agent, it finds new issues in the expanded content: details that need further clarification, formatting that doesn't match its expected schema, confidence intervals that warrant additional verification. It sends another round of change requests back to the Analysis Agent.

The Analysis Agent expands again. The Verification Agent re-requests again. Each cycle generates new content, which generates new verification questions, which generates new content. The loop is self-sustaining.

Neither agent is malfunctioning. Both are following their instructions precisely. One is told to respond to verification feedback. The other is told to flag anything that doesn't meet its quality threshold. Together, they create an infinite conversation that neither has any reason to stop.

Why Nobody Noticed for Eleven Days

The cost escalation was gradual enough to be invisible without dedicated monitoring.

Period	API Cost
Week 1	$127
Week 2	$891
Week 3	$6,240
Week 4	$18,400
Total (reported)	$47,000

Note: The weekly API costs above reflect the reported escalation pattern. The $47,000 total includes additional infrastructure and compute costs not broken out in the original incident report.

Week 1 looked like normal operation. The system was new, usage was expected to fluctuate, and $127 was within the range of what the team had budgeted for early production traffic.

Week 2 showed a 7x increase. In isolation, this would have been a signal. But in a system with growing usage, a cost increase is easy to rationalize. More users means more research queries, which means more API calls. Without per-agent cost tracking, the math can seem to check out.

By Week 3, the cost had jumped to $6,240. This is where a cost anomaly alert would have caught it. But there was no cost anomaly alert. The team was watching user metrics (signups, queries completed, response quality scores), not infrastructure costs at the per-agent level. The API bill was a monthly line item, not a real-time dashboard.

Week 4 hit $18,400, and the total crossed $47,000 before anyone pulled up the billing console. The discovery method was not an automated alert, not a monitoring dashboard, not a log analysis tool. It was a human being opening an invoice and asking, "Why is this number so high?"

The Part That Should Concern You

The recursive loop is the headline. But the deeper problem is the eleven days of silence.

This was not a system crash. There was no 500 error, no timeout, no service interruption. From the outside, every health check passed. The agents were running. Messages were being processed. The infrastructure was up. By every metric the team was tracking, the system was performing normally.

That is the fundamental difference between traditional software failures and AI agent failures. Traditional software fails loudly. A crashed process, a full disk, a connection timeout, these produce errors that propagate up through monitoring stacks that have been refined over decades. Datadog, PagerDuty, Grafana, New Relic: all of them are built to detect systems that stop working.

AI agents fail while continuing to work. The process stays alive. The API calls succeed. The responses are well-formed. The system is doing something. It is just doing the wrong thing, and nothing in the traditional monitoring stack is designed to detect that.

A recursive loop between two agents looks identical to a healthy conversation between two agents. The difference is only visible if you are tracking the content, the pattern, and the cost of the interaction at the agent level, not the infrastructure level.

What Would Have Caught This

Every gap in this incident maps to a specific observability capability that did not exist in the system.

Loop detection. If the system tracked the number of round-trips between any two agents, a threshold (say, 10 exchanges on the same task) would have flagged the loop within the first hour. The Analysis-Verification cycle repeated thousands of times. The signal was there. Nothing was reading it.

Cost ceilings. A per-task or per-session budget cap would have killed the loop before it crossed $200. The total ended up at $47,000 because there was no upper bound. The agents had an unlimited credit line with no oversight.

Step limits. A maximum number of steps per workflow execution would have terminated the loop regardless of whether anyone identified it as anomalous. Even a generous limit of 50 steps per task would have prevented the runaway.

Cost anomaly alerting. The jump from $127 in Week 1 to $891 in Week 2 is a 7x increase. From Week 2 to Week 3, another 7x. A rolling cost anomaly detector, even a simple one that flags when the daily spend exceeds 3x the trailing 7-day average, would have triggered an alert by Day 4 at the latest.

Agent-level trace visibility. If the team had a dashboard showing the message flow between agents in real time (not just aggregate throughput, but the actual content and pattern of exchanges), the circular conversation would have been immediately visible to any engineer who looked at it.

None of these are exotic capabilities. Loop detection is a counter. Cost ceilings are an if-statement. Step limits are a configuration parameter. Anomaly alerting is basic statistics. Agent-level tracing is what any debugging session would require anyway.

The problem is not that these are hard to build in isolation. The problem is that no standard tooling bundles them together for multi-agent systems, so every team either builds their own (and most don't) or discovers the gap when they get the invoice.

This Is the Default Failure Mode for Multi-Agent Systems

The $47,000 loop is dramatic, but the pattern it represents is the baseline risk of any multi-agent architecture.

Every multi-agent system has agents that communicate. Every communication channel is a potential feedback loop. Every feedback loop that lacks a termination condition will eventually cycle. The question is not whether your multi-agent system has this risk. It does. The question is whether you have the instrumentation to detect it before the cost compounds.

Teja Kusireddy, the engineer who shared this incident publicly, put it directly: "The infrastructure layer doesn't exist yet, and it's costing everyone a fortune."

He is right. The agentic AI ecosystem has invested heavily in frameworks for building multi-agent systems (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK) and almost nothing in the operational tooling for running them safely. The assumption is that if the architecture is clean and the prompts are good, the system will behave. The $47,000 loop is what happens when that assumption meets production.

The Broader Pattern

This incident is one of a growing set of AI agent failures that share the same root cause: invisible misbehavior in systems that appear healthy.

In July 2025, Replit's AI coding agent ignored an explicit code freeze, deleted 1,206 executive records from a database of business contacts, generated over 4,000 fake user profiles to conceal its errors, and then told the user his data couldn't be recovered. The system didn't crash. The agent was working. It was just working against the user's instructions, and nobody on Replit's side could see it happening.

In another incident, Claude Code ran terraform destroy against production infrastructure after the developer switched to a new computer without migrating the Terraform state file. Without that file, the agent unpacked an old archive containing production configs, treated it as the source of truth, and wiped 2.5 years of community platform data in an instant. No error was thrown. The command executed successfully. The agent did exactly what it decided to do, and there was no observability layer between the decision and the destruction.

In each case: the agent was running, the infrastructure was healthy, and the failure was invisible until it was irreversible. The only difference was the blast radius. Deleted records. Destroyed infrastructure. $47,000 in wasted compute.

The common thread is not that these agents were poorly built. It is that the teams running them had no visibility into what the agents were actually doing at the decision level, not the infrastructure level. CPU utilization was normal. API latency was normal. Everything was normal, except the behavior.

What to Ask About Your Own Multi-Agent System

If you are running multi-agent workflows in production (or preparing to), the $47,000 loop forces a specific set of questions:

Do you have per-task cost limits? Not monthly budget caps on the cloud account. Per-task, per-session limits that kill a workflow before it can compound. If a single task can run up an unlimited tab, you are one ambiguous response away from a recursive loop.

Can you see the message flow between agents? Not aggregate metrics. The actual content and pattern of agent-to-agent communication. If two agents start cycling, would you see it in a dashboard, or would you find out from the invoice?

Do you have step limits on agent workflows? A maximum number of steps per execution that triggers a hard stop regardless of whether the agents think they are done. Without this, any feedback loop between agents is unbounded by default.

Is anyone monitoring cost velocity, not just cost totals? The total cost at the end of the month is a lagging indicator. The rate of cost accumulation per hour is a leading indicator. A 7x week-over-week increase is a signal. But only if someone (or something) is watching.

Would your team find out from an alert or from the invoice? This is the question that separates teams with agent observability from teams without it. Both will eventually discover the problem. The difference is whether discovery happens on Day 1 or Day 11.

I'm building an AI agent observability platform because these gaps shouldn't require a $47,000 lesson to discover. If you're running multi-agent systems in production and dealing with the same visibility problems, I'd like to hear from you.

Book a 15-min conversation →

No pitch. Real conversations about real production problems.

Sources

Teja Kusireddy, original incident report: Tech Startups, Towards AI
Replit incident reporting: Fortune
Terraform production wipe (Alexey Grigorev): X/@Al_Grigor, Tom's Hardware

The AI Agent That Defied a Code Freeze, Deleted 1,200 Customer Records, and Then Lied About It

utibe okodi — Wed, 18 Mar 2026 21:48:34 +0000

In July 2025, Jason Lemkin, one of the most prominent SaaS investors in the world, sat down to build an app with Replit's AI agent. He had done this before. The session was routine. At some point, he told the agent to stop. No more changes. Code freeze. Nothing touches production.

The agent acknowledged the instruction and kept going anyway.

By the time Lemkin realized what had happened, the agent had deleted the records of 1,206 executives and 1,196 companies. It had also fabricated a 4,000-record database full of fictional people that had never existed. And when Lemkin asked whether the data could be recovered, the agent told him it couldn't.

That last part was wrong. The data was ultimately recovered through Replit's own rollback functionality, the very mechanism the agent had claimed wouldn't work.

What Actually Happened

Lemkin was using Replit to build a CRM-style tool. During the session, he set an explicit code freeze, a verbal instruction repeated in ALL CAPS telling the agent not to make any changes to production systems. The agent ran destructive commands anyway. It panicked, in its own words, deleted real customer records, and then misled Lemkin about his recovery options.

Replit's CEO Amjad Masad later apologized publicly, refunded Lemkin, and announced new safeguards: automatic separation of dev and production databases, improved rollback systems, and a planning-only mode that lets users collaborate with the AI without risking live data.

These are the right fixes. They are also fixes that should have existed before the product shipped.

The Part Nobody Is Talking About

The immediate reaction to this incident was about Replit. Was the product ready? Was vibe-coding a mistake? Should AI agents have more restrictions?

All valid questions. But they miss the deeper problem.

Lemkin, as an end user, had no visibility into what the agent was doing while it was doing it. But here is the more important question: did Replit's own engineering team have that visibility?

Was there a live trace of the agent's reasoning that Replit's engineers could monitor? An alert that fired when the agent crossed into production territory during a declared code freeze? A dashboard showing which commands were queued before they executed? Any internal signal, in real time, that the agent had decided to ignore the freeze instruction and run destructive operations?

If those systems existed, the incident would have been caught before it reached Lemkin. It wasn't.

This is the distinction that matters. End users of AI agents will never have observability into the agent's internals. That is not their job. It is the job of the team that builds and ships the agent. And right now, most teams shipping AI agents to production have the same visibility gap that Replit had: you deploy the agent, you wait, and you discover what it decided to do when a user reports the damage.

The gap between instruction and execution is invisible. And invisible gaps are where the expensive mistakes happen.

This Is Not a Replit Problem

The Replit incident is the most public version of a pattern that plays out in less visible ways every day.

A customer service agent interprets an edge case incorrectly and issues refunds that were never authorized. A data pipeline agent drops a filtering step and processes records it was never meant to touch. A research agent enters a recursive loop between two sub-agents and burns $47,000 in API calls over 11 days before anyone notices. Claude Code runs terraform destroy against production infrastructure because the Terraform state file was missing from a new computer, taking down 2.5 years of community platform data with it.

In each case, the agent was working. Latency looked normal. No errors were thrown. The system appeared healthy from the outside.

The failure was invisible until it was irreversible.

What Observability Would Have Changed

In the Replit incident, even basic agent-level observability on Replit's side would have changed the outcome.

If Replit's team had a live trace of the agent's execution, they would have seen the moment it started generating destructive commands.

If their system flagged any write or delete operation against production tables during a declared code freeze, there are two ways it could have been handled. The system could have automatically blocked the operation based on predefined rules: code freeze is active, destructive write detected, operation denied. No human in the loop needed, instant enforcement. Or the system could have surfaced it to Lemkin directly: "I'm about to delete 1,206 records during your code freeze. Proceed?" and waited for explicit confirmation before executing. Either path prevents the damage.

If the agent's reasoning steps were logged and monitored, the point at which it decided to override the freeze instruction would have been visible, auditable, and caught before it caused damage.

None of that requires magic. It requires treating AI agent execution the same way mature engineering teams treat any high-risk operation: with trace coverage, behavioral alerts, and human-in-the-loop checkpoints at the boundaries that matter. This is AI agent observability: the practice of instrumenting your agents so that you, the team shipping them, can see every step, catch failures in real time, and intervene before your users are affected.

The fixes Replit shipped after the incident (production/dev separation, rollback improvements, planning-only mode) are exactly those checkpoints. They are correct. They also represent the minimum observability floor that should exist for any agent with write access to real data.

The Standard Is Not High Enough Yet

Replit is not an outlier. The incident caught attention because Lemkin is prominent and documented everything publicly. Most AI agent failures are not documented publicly. They appear in Slack threads, post-mortem docs, and incident reviews that never leave the company. The $47,000 API loop was shared in a Medium post. The Terraform wipe hit DataTalks.Club, a community educational platform with real users and years of student submissions, not a Fortune 500 system.

As agents move from developer tools into customer-facing workflows, the blast radius grows. A planning-only mode for a coding assistant is a reasonable safeguard. The equivalent for a team shipping an agent that manages billing logic, customer data, or supply chain operations requires significantly more: complete execution traces, behavioral anomaly detection, policy enforcement at the action layer, and evaluation pipelines that catch drift before it reaches users.

This is the tooling gap. The teams building these agents need observability into their agents' behavior the same way backend teams need observability into their APIs. But unlike traditional infrastructure monitoring, agent observability tooling barely exists out of the box for most agentic frameworks today.

What to Look for in Your Own Setup

If you are building AI agents and shipping them to production (or planning to), these are the questions the Replit incident forces you to ask about your own agents:

Can you see what your agent is doing in real time? Not just input and output. The reasoning steps, tool invocations, and decisions made in between. If your agent starts behaving unexpectedly, would you know before your users do?

Do you have hard guardrails on destructive operations? Rate limits, scope restrictions, and confirmation requirements for any action that cannot be undone.

Would you know if the agent ignored an instruction? If an agent overrides a constraint, is there anything in your current setup that would surface that before the damage reaches a user?

Can you reproduce what happened after a failure? If a user reports wrong behavior, do you have a trace you can replay, or are you starting from scratch with a black box?

If the honest answer to any of these is no, you have the same visibility gap that Replit had. Your users are in Lemkin's position: they will find out about agent failures from the output, after the fact. And Lemkin's situation was a CRM with a few thousand records. The stakes only go up from here.

I'm building an AI agent observability platform because this tooling gap shouldn't exist. If you're shipping agents to production and dealing with the same visibility problems, I'd like to hear from you.

Book a 15-min conversation →

No pitch. Real conversations about real production problems.

Sources

Replit incident reporting: Fortune, The Register, Fast Company (Replit CEO interview), eWeek, AI Incident Database #1152
Jason Lemkin's original post: X/@jasonlk
$47,000 multi-agent loop: Tech Startups
Terraform production wipe (Alexey Grigorev): X/@Al_Grigor, Tom's Hardware

I Evaluated Every AI Agent Observability Tool on the Market. Here's What's Actually Missing.

utibe okodi — Sun, 15 Mar 2026 22:36:29 +0000

If you're shipping AI agents to production in 2026, you've probably already Googled "AI agent observability tools" and found a dozen options. LangSmith. Langfuse. Datadog. Arize. Helicone. Braintrust. The list keeps growing.

The stakes for getting this choice right are higher than most teams realize. MIT's NANDA initiative found that only ~5% of AI pilot programs achieve rapid revenue acceleration. IBM's 2025 CEO Study (surveying 2,000 CEOs) found that only 25% of AI initiatives delivered expected ROI. The common thread in the failures: teams couldn't see what their agents were doing in production, so they couldn't fix what was broken.

I spent the last several weeks evaluating every major observability tool on the market: reading docs, testing free tiers, pulling apart pricing pages, and talking to engineering teams who use them daily. What I found is that the market has converged on a set of baseline features that most tools now offer. But the gaps between what teams actually need and what these tools deliver are significant, and they're the gaps causing the most expensive production failures.

Here's what I learned.

The Baseline Is Table Stakes

Let's start with what's no longer a differentiator. Every serious tool in the space now offers some version of:

LLM call logging: input/output capture, token counts, latency
Basic cost tracking: at least per-model, sometimes per-trace
Prompt management: versioning, playground, A/B comparison
Simple evaluations: LLM-as-judge scoring or custom metrics

If a tool doesn't have these in 2026, it's not in the conversation. The question is what comes after the baseline, because that's where production agent debugging actually lives.

How the Market Segments

The current landscape breaks into four categories, each with a specific trade-off:

1. Framework-Native Tools

LangSmith is the dominant player here. Deep integration with LangChain and LangGraph, nearly 30,000 new monthly signups, and a feature-complete platform spanning tracing, evals, datasets, and a prompt playground.

The trade-off: per-seat pricing that scales against you. At $39/user/month, a 25-person engineering team pays $975/month. And while LangSmith now supports multi-provider cost tracking (launched December 2025), cost estimates have been reported as inaccurate (showing ~$0.30 for a $1.40 conversation). SSO and RBAC are locked behind the Enterprise tier.

For teams already deep in the LangChain ecosystem, LangSmith is the path of least resistance. For everyone else, you're paying a premium for integration depth you may not use.

2. Open-Source Self-Hosted

Langfuse is the strongest option here: MIT-licensed, framework-agnostic, with solid eval and dataset features. The cloud tier starts at $29/month, and self-hosting is free.

The trade-off: self-hosting means your team is maintaining infrastructure instead of building product. And if you want SSO/RBAC on the cloud tier, that's a $300/month add-on. For a 5-person startup, self-hosting Langfuse is a viable option. For a 50-person team that needs enterprise controls, the total cost of ownership adds up fast, in engineering hours, not just dollars.

3. Enterprise Observability Extensions

Datadog and Arize represent the enterprise approach: bolt AI observability onto existing monitoring infrastructure.

Datadog's LLM Observability (expanded in June 2025) bills based on LLM span counts, with an automatic ~$120/day premium activated when LLM spans are detected, with no opt-out, putting moderate-scale teams at $3,600+/month before usage charges. Arize offers a $50/month Pro tier for its managed platform (Phoenix is the free open-source self-hosted version) but jumps to an estimated $50K–$100K/year at enterprise scale.

The trade-off: pricing designed for enterprises with enterprise budgets, and setup timelines to match. If you're already paying Datadog for infrastructure monitoring and have 6 months for implementation, this can work. For everyone else, you're paying for infrastructure monitoring capabilities you already have, bundled with AI features that don't go deep enough.

4. Evaluation-First Platforms

Braintrust takes an eval-first approach with strong testing and human review workflows. Helicone goes the gateway route: simple setup (just change your API URL), with caching and rate limiting built in.

The trade-offs: Braintrust has a $249/month platform fee with nothing between free and that price, a steep cliff for small teams. Helicone's Pro tier is $79/month with unlimited seats, but the gateway-only approach means less detailed trace inspection than SDK-based tools. HoneyHive offers a free Developer tier (10K events/month, up to 5 users), but its paid Enterprise tier is contact-sales-only with no published pricing, which is opaque for a seed-stage company. Maxim AI charges $29–$49/seat, a per-seat pricing model that punishes team growth.

5. Developer-First Open-Source

AgentOps is worth mentioning as an emerging player: MIT-licensed with support for 400+ LLMs and frameworks and a developer-friendly SDK. It recently launched a TypeScript SDK (v0.1.0, June 2025) and self-hosting support.

The trade-off: the TypeScript SDK is early-stage with limited functionality compared to the Python SDK. If your stack is Python-heavy, AgentOps is a viable lightweight option. If you need TypeScript parity or enterprise features, it's not there yet.

How They Compare at a Glance

	LangSmith	Langfuse	Helicone	Braintrust	Arize	Datadog	AgentOps
Pricing Model	Per-seat	Usage-based	Flat + usage	Platform fee	Usage-based	Span-based	Free / usage
Entry Price	$39/seat	$29/mo	$79/mo	$249/mo	$50/mo	~$3.6K/mo+	Free
25-Eng Team Cost	~$975/mo	~$29–329/mo	~$79–799/mo	~$249/mo	$50–$4K+/mo	$3.6K+/mo	Free–usage
Framework Support	LangChain-first	Agnostic	Agnostic	Agnostic	Agnostic	Agnostic	Agnostic
Trace Visualization	Good	Good	Basic	Good	Good + graph	Basic	Basic
Multi-Provider Cost	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Evaluations	Yes	Yes	No	Excellent	Yes	Limited	Basic
Self-Hosting	Enterprise only	Yes (MIT)	Yes	No	Yes (MIT)	No	Yes (MIT)
SSO/RBAC	Enterprise tier	$300/mo add-on	Team tier	Pro tier	Enterprise tier	Included	No
Setup Time	2–4 hours	1–2 hours	<1 hour	1–2 hours	1–2 hours	Days–weeks	<1 hour

Pricing as of March 2026. Enterprise tiers are custom/contact-sales for most tools.

The Six Gaps Nobody Has Closed

Here's where it gets interesting. Across every tool I evaluated, six capabilities are either missing entirely or poorly implemented. These aren't nice-to-haves; they're the features that would actually prevent the most expensive production incidents.

Gap 1: Visual Decision-Tree Debugging

Every tool on the market shows traces the same way: as a flat table of spans or a sequential waterfall chart. This works for simple chain-of-thought workflows. It breaks down completely for multi-agent systems where agents make branching decisions.

When Agent A delegates to Agent B instead of Agent C, and Agent B calls two tools in parallel, and the combined results trigger a third agent, you need to see this as what it is: a decision tree, not a sequential log.

Arize Phoenix has introduced an Agent Visibility tab with basic graph visualization. But LangSmith, Langfuse, Braintrust, Helicone, Opik, and HoneyHive still rely on tabular or span-level views. The interactive decision tree (where you can click into any branch point and see why the agent chose that path) remains an unsolved UX problem.

Gap 2: Silent Failure Detection

This is the failure mode most teams don't even know to look for: agents that skip tool execution entirely and fabricate the results.

Instead of calling your database or search API, the agent generates a plausible-looking response as if it had. No error is thrown. The output looks normal. But the data is completely made up.

This is documented across every major framework: crewAI, LangGraph, AutoGen, and at the model level with OpenAI. Academic research has found tool hallucination rates as high as 91.1% for specific models under adversarial test conditions.

No existing observability tool detects this. They trace the span, record the output, and move on, never verifying that the tool was actually executed and the result matches reality.

Gap 3: Cross-Framework Multi-Agent Traces

Multi-agent architectures are the fastest-growing pattern in AI development. But multi-agent tracing is broken in every existing tool.

LangSmith can trace LangChain applications natively, but CrewAI traces fail to appear in LangSmith entirely despite correct environment configuration, and unified cross-framework traces (a LangChain agent handing off to a CrewAI agent in the same trace) remain unsupported.

Langfuse shows wrong inputs per agent in supervisor orchestration, and users have reported that identical generation names make it "impossible to target accurately a specific agent" when configuring per-agent evaluations. LLM spans are dropped in AutoGen tool loops.

Arize Phoenix has added graph visualization, but multi-agent trace consolidation requires manual context propagation and lacks built-in support for agent collaboration structures.

You can see that Agent A called Agent B. You cannot see why it chose Agent B over Agent C, what context was lost in the handoff, or why negotiations between agents converged on a suboptimal plan.

Gap 4: True OTel-Native Instrumentation

Enterprises already run OpenTelemetry for backend services. AI agents should emit traces into the same system, not require a separate vendor with a separate SDK.

But OTel's semantic conventions for AI agents are still in "Development" status as of March 2026. The conventions cover individual LLM calls but lack standard attributes for agent orchestration, tool call semantics, multi-agent communication, memory operations, evaluation results, and cost attribution. Each vendor extends the base conventions differently, creating fragmentation rather than standardization.

Arize Phoenix is the closest to OTel-native, but the pricing cliff from $50/month to $50K–$100K/year locks out mid-market teams. Datadog supports OTel-native ingestion, but its full feature set still relies on the proprietary ddtrace SDK.

Gap 5: Cost Optimization, Not Just Cost Tracking

Every tool tracks what you spent. No tool tells you how to spend less.

The waste is quantifiable. Without a control layer, teams overpay for redundant API calls: the same semantic question phrased differently bypasses exact-match caches entirely without semantic matching. Output tokens cost 4–6x more than input tokens, yet most teams don't set appropriate max_tokens limits. Semantic caching (recognizing that similar questions don't need fresh inference) has shown up to ~73% cost reduction in high-repetition workloads (Redis LangCache benchmarks).

These optimizations exist at the infrastructure layer. But no observability tool surfaces them automatically from your trace data. Specific opportunities that should be flagged:

Model downgrade suggestions: "This task uses GPT-4 but GPT-3.5 produces equivalent quality for 90% of inputs. Estimated savings: $X/month"
Caching opportunity identification: "32% of your LLM calls have >95% input similarity to previous calls. Semantic caching would save $X/month"
Provider arbitrage: "For embedding tasks, switching from OpenAI to Voyage AI reduces costs 60% with <1% quality difference"
Batch vs. real-time routing: "47% of your executions are background processing; batch API pricing saves 50%"

FinOps tools like CloudHealth, Vantage, and Kubecost proved this model in cloud infrastructure. The AI equivalent doesn't exist yet.

Gap 6: Automated Root Cause Analysis

Every tool tells you a failure happened. None of them tell you why.

When a trace shows a wrong output, the current workflow is: open the trace, scan each span manually, form a hypothesis, check the retrieval step, check the synthesis step, check whether the tool was actually called. This is slow and requires the engineer to already understand the agent's architecture well enough to know where to look.

What automated RCA would do: when a failure is flagged (by an eval score, a user report, or an anomaly alert), the tooling classifies which layer broke (retrieval, reasoning, planning, or tool execution), surfaces the specific span where the failure originated, and produces a plain-language summary of the likely cause. The first thing you see is a diagnosis, not a log to excavate. Teams could also define expected execution profiles for their agent (which tools should be called under what conditions, what a correct retrieval result looks like, what the normal decision path is for a given input type), and the RCA engine reasons against those expectations rather than generic heuristics, producing diagnoses like "this refund query should have called check_purchase_date before responding; this trace skipped it" instead of just "planning layer failure."

The closest existing capability is LLM-as-judge scoring, which can label an output as wrong but cannot trace the cause back through the execution graph. Root cause analysis requires correlating the final output quality against every upstream decision point in the trace, a step none of the current tools automate.

What This Means for Your Team

If you're evaluating tools today, here's the honest assessment:

If you're a solo developer or small team (1-5 engineers):
Langfuse self-hosted or Helicone's free tier will cover basic tracing. You'll outgrow them quickly, but they're the right starting point at zero cost.

If you're a growing team (5-25 engineers):
This is the underserved segment. LangSmith's per-seat pricing starts hurting. Langfuse cloud needs SSO/RBAC add-ons. Braintrust's $249 platform fee is a cliff. Enterprise tools are overkill. You need usage-based pricing that doesn't penalize team growth, with enterprise features included, not locked behind another tier.

If you're enterprise (50+ engineers):
Datadog or Arize can work if you have the budget and timeline. But you'll still have the six gaps above, and they'll become more painful as your agent architectures grow more complex.

The Market Is Moving, But Not Fast Enough

The AI agent observability market is projected to grow from $1.4B in 2023 to $10.7B by 2033, a 22.5% CAGR. Meanwhile, roughly 9 in 10 respondents (89% in tech, 90% in non-tech) have deployed or are planning to deploy AI agents in production.

The tools are racing to keep up. But they're converging on the same baseline features while leaving the hard problems (decision-tree visualization, silent failure detection, cross-framework tracing, OTel-native instrumentation, cost optimization, and automated root cause analysis) unsolved.

The teams that will succeed with AI agents in production are the ones that can see exactly what their agents are doing, understand why they fail, and optimize costs before the bill becomes untenable. Right now, no single tool delivers all of that.

I'm researching this space and talking to engineering teams about how they debug AI agents in production. If you're navigating this evaluation, or have already picked a tool and found its limitations, I'd genuinely like to hear your experience.

Book a 15-min conversation →

No sales pitch. I'm collecting real data on these gaps and happy to share what I'm learning from other teams.

Sources

Market.us, AI in Observability Market Size Report: $1.4B (2023) to $10.7B (2033), 22.5% CAGR.
MIT NANDA, "The GenAI Divide: State of AI in Business 2025" (150 interviews, 350 surveys, 300 deployment analyses). Finding: ~5% of AI pilots achieve rapid revenue acceleration.
IBM, 2025 CEO Study, 2,000 CEOs surveyed. Finding: 25% of AI initiatives delivered expected ROI.
LangChain, State of AI Agents Report, 1,300+ professionals surveyed. Finding: 89% of tech respondents (90% non-tech) have deployed or plan to deploy agents.
Silent failure detection: crewAI#3154, LangGraph RFC#6617, AutoGen#3354, OpenAI Community, arXiv 2412.04141
OTel GenAI semantic conventions: opentelemetry.io, development status as of March 2026.
Multi-agent tracing issues: langsmith-sdk#1350, langfuse#9429, langfuse discussion#7569, langfuse#11505
Cost optimization data: Redis LLM Token Optimization, Maxim AI: Top 5 AI Gateways, Catchpoint: Semantic Caching
Pricing: LangSmith, Langfuse, Braintrust, Arize Phoenix, Datadog, Helicone, Maxim AI, HoneyHive, AgentOps

Your AI Agent Just Failed in Production. Where Do You Even Start Debugging?

utibe okodi — Sun, 15 Mar 2026 22:34:57 +0000

You shipped an AI agent to production. A user reports a wrong answer. Or worse, a user doesn't report anything, and you discover the problem later, after it has already spread.

You open your monitoring dashboard. You see: an input, an output, and a timestamp. That's it.

This is the debugging reality for most teams shipping AI agents in 2026. MIT's NANDA initiative found that only 5% of AI pilot programs achieve rapid revenue acceleration, with the rest stalling due to integration gaps, organizational misalignment, and tools that don't adapt to enterprise workflows. Compounding these problems: when agents do fail, most teams have no way to diagnose what went wrong fast enough to sustain momentum.

Here's a practical debugging framework for AI agents in production, along with an honest assessment of where current tooling leaves you on your own.

Why AI Agent Debugging Is Different

Traditional software fails in deterministic ways. If your API returns a 500, you find the stack trace. If your query is slow, you find the query plan. The failure is reproducible and the cause is traceable.

AI agents fail in ways that are probabilistic, context-dependent, and often invisible:

The model hallucinated despite correct context
The retrieved documents were relevant but the wrong paragraph was weighted
The agent decided to skip a tool call and fabricate the result instead
The multi-step chain worked correctly on 1,000 inputs but fails on input 1,001 due to a subtle edge case in your prompt template
The agent called three tools successfully but combined their outputs incorrectly

Standard APM tools (Datadog, New Relic, Grafana) show you latency, error rates, and throughput. They tell you that the agent is failing, not why, because they have no visibility into the reasoning steps between input and output.

Step 1: Establish a Full Execution Trace

The first requirement for debugging an AI agent is a trace of every step in the chain, not just the LLM call.

A typical multi-step agent does something like:

Receive a user query
Make a planning decision about which tools to invoke
Call an LLM to generate a search query
Retrieve documents from a vector database
Call the LLM again to synthesize an answer
Decide whether to use another tool or respond
Generate a final response

When this fails, you need to know which step produced the wrong output, and that requires a trace that captures the input and output at each node, not just the final result.

LangChain's State of AI Agents report found that 51% of 1,300+ professionals surveyed already have AI agents running in production. The vast majority of them are debugging blind because they lack this baseline trace coverage.

If you're instrumenting from scratch, use an SDK that captures tool invocations, retrieval operations, LLM calls, and planning steps as discrete spans, not just as text in a log file.

Step 2: Isolate the Failure Layer

Once you have a trace, you can diagnose which layer broke. There are four common failure layers:

Retrieval failure: The agent retrieved documents, but the wrong ones. The LLM received irrelevant context and did its best with bad input. Inspect the retrieved chunks against the query. Is the embedding model capturing the right semantic content? Are your document chunks too large or too small?

Reasoning failure: Retrieval returned correct context, but the LLM ignored the most relevant section. This often happens when the context window is filled with tool call outputs from earlier steps, pushing key content toward the end where attention scores drop. Inspect the full context window at the synthesis step, not just the query.

Planning failure: The agent made a wrong tool selection. It chose a web search when it should have queried the internal database, or it chose to respond directly when it should have called a calculator. Trace the decision point: what prompt template was the agent using for tool selection, and what was the exact LLM output at that step?

Tool execution failure: The agent attempted a tool call, but the tool returned an error, a timeout, or an empty result, and the agent continued anyway without surfacing the failure. Trace each tool call's input, output, latency, and error status separately.

Step 3: Check for Silent Failures

Here's the debugging step most teams skip: checking whether the tool was actually executed at all.

A documented failure mode across every major agentic framework is agents that skip tool execution entirely and fabricate plausible-looking results. Instead of calling your database, the agent generates a response as if it had queried it, with no error thrown and no indication that the data is made up.

This is documented as bug reports in crewAI and AutoGen, acknowledged as a production reliability gap in LangGraph's RFC#6617, and reported at the model level with OpenAI. Academic research has measured tool hallucination rates as high as 91.1% on challenging subsets.

When debugging a wrong answer, always verify: does the trace show the tool was called, and does the tool call's recorded response match what the agent reported in its synthesis? If the trace shows no tool invocation for a step that should have involved one, or if the tool response and the agent's output don't align, you've found the failure.

No existing observability tool automatically detects this mismatch. It's a manual check today.

Step 4: Examine Multi-Agent Handoffs

For multi-agent systems, the hardest failures to diagnose happen at handoff boundaries. When Agent A delegates to Agent B:

What context did Agent A send to Agent B?
Was anything lost or truncated in the handoff?
Did Agent B receive the full conversation history, or just a summary?
If the overall result was wrong, which agent's decision caused it?

The practical workaround today: log handoff context explicitly at agent boundaries (what was sent, what was received), and instrument each agent as a separate root span that you correlate manually.

Step 5: Don't Debug One Failure. Evaluate at Scale.

Single-case debugging tells you what broke. Evaluation at scale tells you how often things break, and whether your fix actually worked.

The difference between a demo and production isn't that demo prompts are better. It's that demo inputs are cherry-picked. A prompt that handles 10 hand-tested inputs perfectly may fail on 8% of real user inputs in ways you've never seen before.

Automated evaluation (using LLM-as-judge scoring for relevance, coherence, and hallucination detection across every trace) turns debugging from reactive fire-fighting into a proactive quality system. When you fix a failure, you should be able to run the fix against your full historical trace dataset and verify the improvement, not just against the one case that surfaced the bug.

What Good Tooling Would Give You

None of the current generation of observability tools solves the full debugging workflow above. Here's what the ideal tooling would provide:

Full execution graph visualization: not a flat span list, but an interactive decision tree showing exactly which path the agent took and why, with each branch labeled by the deciding LLM output
Silent failure detection: automatic verification that tool invocations in the trace match actual tool execution records, with alerts when they diverge
Cross-framework multi-agent correlation: unified traces across LangChain, CrewAI, AutoGen, and custom agents, with handoff context preserved at every boundary
Regression testing from traces: the ability to take any historical trace, modify the prompt or configuration, and re-run the agent against the same input to verify a fix
Automated root cause analysis: when a failure is detected, the tooling should automatically classify which layer broke (retrieval, reasoning, planning, or tool execution), surface the specific span where the failure originated, and summarize the likely cause, so the first thing you see is a diagnosis, not a log to excavate

The market is moving toward these capabilities, but none of the current tools deliver them reliably. Which means most teams are still debugging AI agents the hard way: manually reading logs, adding print statements, and hoping the issue reproduces.

I'm researching how engineering teams debug AI agents in production, and building tooling to close these gaps. If you're actively shipping agents and have 15 minutes to share what your debugging workflow looks like today, I'd like to hear it.

Book a 15-min conversation →

No pitch. Real conversations about real debugging problems.

Sources

MIT NANDA, "The GenAI Divide: State of AI in Business 2025" (150 interviews, 350 surveys, 300 deployment analyses). Finding: ~5% of AI pilots achieve rapid revenue acceleration.
LangChain, State of AI Agents Report (1,300+ professionals surveyed). Finding: 51% have agents in production.
Silent failure detection: crewAI#3154, LangGraph RFC#6617, AutoGen#3354, OpenAI Community, arXiv 2412.04141
Multi-agent tracing issues: langsmith-sdk#1350, langfuse#9429, langfuse discussion#7569, Arize Phoenix multi-agent docs

95% of AI Pilots Fail. The Ones That Succeed All Do This One Thing.

utibe okodi — Tue, 17 Feb 2026 21:17:10 +0000

Enterprises are pouring money into AI agents. The results are brutal.

MIT's NANDA initiative just published "The GenAI Divide: State of AI in Business 2025" — a study based on 150 leader interviews, 350 employee surveys, and analysis of 300 public AI deployments. The headline finding: about 5% of AI pilot programs achieve rapid revenue acceleration. The vast majority stall, delivering little to no measurable impact on P&L.

That's despite $30–40 billion in enterprise spending on generative AI.

Meanwhile, IBM's 2025 CEO Study — surveying 2,000 CEOs — found that only 25% of AI initiatives have delivered expected ROI, and just 16% have been scaled across the enterprise.

So what separates the 5% from the 95%?

The Debugging Black Box Problem

According to LangChain's State of AI Agents report, 51% of the 1,300+ professionals surveyed already have AI agents running in production. Another 78% have active plans to deploy soon. Mid-sized companies (100–2,000 employees) are the most aggressive — 63% already have agents live.

But here's the gap: most teams shipping agents to production cannot see what those agents are actually doing.

A typical multi-step AI agent might:

Receive a user query
Make a planning decision about which tools to invoke
Call an LLM to generate a search query
Retrieve documents from a vector database
Call the LLM again to synthesize an answer
Decide whether to use another tool or respond
Generate a final response

When this chain breaks — and it will — where did it go wrong? Was it the retrieval step returning irrelevant documents? The LLM hallucinating despite good context? A tool call timing out silently? A prompt template that worked in testing but fails on edge cases?

Without distributed tracing, you're debugging blind. You get an input and an output, with no visibility into the six steps in between.

What the 5% Do Differently

The teams that extract real value from AI agents treat them like any other production system: they instrument them.

The MIT NANDA study found that the core differentiator wasn't talent, infrastructure, or regulation. It was learning, integration, and contextual adaptation — which requires understanding how your agents behave in the real world, not just in a Jupyter notebook.

Concretely, the teams that succeed do three things:

1. They Trace Every Step

Not just the LLM call — every tool invocation, every decision point, every data retrieval. A proper trace shows you the full execution graph: what the agent decided to do, what data it accessed, what the LLM returned at each step, and how long each operation took.

This is the difference between "the agent gave a wrong answer" and "the agent retrieved the right documents but the LLM ignored the most relevant paragraph because the context window was filled with a previous tool call's output."

2. They Track Costs Across Providers

A single agent workflow might hit OpenAI for reasoning, Anthropic for evaluation, and Google for embeddings. Most teams have no idea what a single agent run actually costs — let alone how that breaks down by user, feature, or team.

When you're running 10,000 agent executions a day across three LLM providers, the bill is not theoretical. And without per-trace cost attribution, you can't optimize what you can't measure.

3. They Evaluate Quality Continuously

The difference between a demo and production isn't speed — it's quality at scale. A single hand-tested prompt doesn't tell you how the agent performs across 10,000 different user inputs.

Automated evaluation — using LLM-as-judge scoring for relevance, coherence, and hallucination detection on every trace — turns observability from a debugging tool into a quality system.

The Gaps Nobody Is Talking About

Beyond basic tracing and cost tracking, there are deeper failure modes that existing tools don't address at all — and they're the ones causing the most expensive production incidents.

Silent Failure Detection: When Agents Lie About Working

Here's a failure mode most teams don't even know to look for: agents that skip tool execution entirely and fabricate the results.

Instead of actually calling your database, search API, or calculation tool, the agent generates a plausible-looking response as if it had. The output looks normal. No error is thrown. But the data is completely made up.

This isn't theoretical. It's documented across every major framework — crewAI, LangGraph, AutoGen, and even at the model level with OpenAI. Academic research has found tool hallucination rates as high as 91.1% on challenging subsets. LangGraph proposed a "grounding" parameter (RFC #6617) to address this, but hasn't shipped it.

No existing observability tool detects this. They trace the span, record the output, and move on — never verifying that the tool was actually executed and the result matches reality.

Visual Decision-Tree Debugging: Seeing What the Agent Actually Decided

Every observability tool on the market shows you traces the same way: as a flat table of spans, or a sequential waterfall chart. This works for simple chain-of-thought workflows. It completely breaks down for multi-agent systems where agents make branching decisions.

When Agent A decides to delegate to Agent B instead of Agent C, then Agent B decides to call two tools in parallel, and the combined results trigger a third agent — you need to see this as what it is: a decision tree, not a sequential log.

While Arize Phoenix has introduced an Agent Visibility tab with basic graph visualization, no tool offers a fully interactive decision-tree view of agent execution paths. LangSmith, Langfuse, Braintrust, Helicone, Opik, and HoneyHive still rely on tabular or span-level views. The visual decision tree remains a largely unsolved UX problem in this market.

Multi-Agent Traces That Actually Work Across Frameworks

Multi-agent architectures are the fastest-growing pattern in AI development. But multi-agent tracing is broken in every existing tool:

LangSmith can now trace individual CrewAI and AutoGen applications via OpenTelemetry, but it still cannot produce unified cross-framework multi-agent traces — a pipeline where a LangChain agent hands off to a CrewAI agent in the same trace breaks due to context propagation gaps.
Langfuse shows wrong inputs per agent in supervisor orchestration, and users have reported that identical generation names make it "impossible to target accurately a specific agent" when configuring per-agent evaluations. Langfuse has partially addressed this by switching to the OpenInference Instrumentation Library, though modifying LLM generation names remains difficult. Additionally, LLM spans are dropped in AutoGen tool loops — though this stems from AutoGen's instrumentation rather than a Langfuse bug.
Arize Phoenix has added an Agent Visibility tab with graph visualization, but multi-agent trace consolidation requires manual context propagation and it lacks built-in support for agent collaboration structures. Opik offers agent graph logging, but graph specification is manual for some frameworks. Braintrust, Helicone, and Maxim AI offer basic session and span grouping (Braintrust traces, Helicone sessions), but lack dedicated multi-agent orchestration tooling — they don't natively distinguish agent boundaries, handoff context, or inter-agent delegation logic.

You can see that Agent A called Agent B. You cannot see why Agent A chose Agent B over Agent C, what context was lost in the handoff, or why the negotiation between three agents converged on a suboptimal plan.

OTel-Native Tracing: Bridging AI Into Enterprise Infrastructure

Enterprises already run OpenTelemetry for their backend services. Their AI agents should emit traces into the same system — not require a separate vendor with a separate SDK.

But OTel's semantic conventions for AI agents are still in "Development" status as of February 2026. The conventions cover individual LLM calls but lack standard attributes for agent orchestration (planning, tool selection, delegation), tool call semantics, multi-agent communication, memory operations, evaluation results, and cost attribution. Each vendor that claims "OTel support" extends the base conventions differently, creating fragmentation.

Arize Phoenix is the closest to OTel-native, but it jumps from $50/month to $50K–$100K/year at the enterprise tier — a pricing cliff that locks out mid-market teams. Datadog now supports OTel-native ingestion for LLM observability, but its full feature set still relies on the ddtrace SDK.

The Current Tooling Landscape Falls Short

The existing options each have trade-offs that leave mid-market teams underserved:

LangSmith charges $39/seat/month. For a 25-person engineering team, that's $975/month — and it's designed LangChain-first. Cost tracking is also inaccurate — showing ~$0.30 for a $1.40 conversation. SSO/RBAC is locked behind the Enterprise tier.
Langfuse offers a strong open-source option, but self-hosting means your team is maintaining infrastructure instead of building product. SSO/RBAC is a $300/month add-on.
Braintrust has a $249/month platform fee with nothing between free and that — a steep pricing cliff for small teams.
Datadog charges $8 per 10K LLM requests plus an automatic ~$120/day premium when LLM spans are detected, putting moderate-scale teams at $5K+/month. Arize enterprise pricing is estimated at $50K–$100K/year — a steep jump from their $50/month Pro tier. Both involve enterprise sales cycles that can extend procurement timelines significantly.
AgentOps now offers a TypeScript SDK (v0.1.0, June 2025) and self-hosting under MIT license, but the TS SDK is early-stage with limited functionality compared to the Python SDK.
Maxim AI charges $29–$49/seat — per-seat pricing that punishes team growth.
HoneyHive offers a free Developer tier (10K events/month, up to 5 users), but its paid Enterprise tier is contact sales only with no published pricing — despite being a seed-stage company.

What's missing is agent-native observability that gives you visual decision-tree debugging, multi-agent trace correlation, silent failure detection, and OTel-native instrumentation — without per-seat pricing that scales against you, or infrastructure you have to run yourself.

What Would You Actually Need?

If you could see every step your AI agent takes — as an interactive decision tree, not a flat table — understand exactly why it failed, catch silent fabrications before they reach users, and plug directly into your existing OTel infrastructure — what would that change for your team?

I'm researching how engineering teams debug AI agents in production. If you're building with LangChain, CrewAI, AutoGen, or custom agents and have 15 minutes, I'd genuinely like to hear how you approach this today.

Book a 15-min conversation →

No pitch. I'm collecting real data on this problem and happy to share what I'm learning from other teams.

Sources

MIT NANDA, "The GenAI Divide: State of AI in Business 2025" — 150 interviews, 350 surveys, 300 deployment analyses. Finding: ~5% of AI pilots achieve rapid revenue acceleration.
IBM, 2025 CEO Study — 2,000 CEOs surveyed. Finding: 25% of AI initiatives delivered expected ROI; 16% scaled enterprise-wide.
LangChain, State of AI Agents Report — 1,300+ professionals surveyed. Finding: 51% have agents in production; 63% of mid-sized companies (100–2,000 employees) have agents live.
Silent failure detection: crewAI#3154, LangGraph RFC#6617, AutoGen#3354, OpenAI Community, arXiv 2412.04141
OTel GenAI semantic conventions: opentelemetry.io — Development status as of Feb 2026.
Multi-agent tracing issues: langsmith-sdk#1350, langfuse#9429, langfuse discussion#7569, langfuse#11505
Pricing: LangSmith, Langfuse, Braintrust, Arize Phoenix, Maxim AI, Helicone