utibe okodi

Posted on Feb 17

95% of AI Pilots Fail. The Ones That Succeed All Do This One Thing.

Enterprises are pouring money into AI agents. The results are brutal.

MIT's NANDA initiative just published "The GenAI Divide: State of AI in Business 2025" — a study based on 150 leader interviews, 350 employee surveys, and analysis of 300 public AI deployments. The headline finding: about 5% of AI pilot programs achieve rapid revenue acceleration. The vast majority stall, delivering little to no measurable impact on P&L.

That's despite $30–40 billion in enterprise spending on generative AI.

Meanwhile, IBM's 2025 CEO Study — surveying 2,000 CEOs — found that only 25% of AI initiatives have delivered expected ROI, and just 16% have been scaled across the enterprise.

So what separates the 5% from the 95%?

The Debugging Black Box Problem

According to LangChain's State of AI Agents report, 51% of the 1,300+ professionals surveyed already have AI agents running in production. Another 78% have active plans to deploy soon. Mid-sized companies (100–2,000 employees) are the most aggressive — 63% already have agents live.

But here's the gap: most teams shipping agents to production cannot see what those agents are actually doing.

A typical multi-step AI agent might:

Receive a user query
Make a planning decision about which tools to invoke
Call an LLM to generate a search query
Retrieve documents from a vector database
Call the LLM again to synthesize an answer
Decide whether to use another tool or respond
Generate a final response

When this chain breaks — and it will — where did it go wrong? Was it the retrieval step returning irrelevant documents? The LLM hallucinating despite good context? A tool call timing out silently? A prompt template that worked in testing but fails on edge cases?

Without distributed tracing, you're debugging blind. You get an input and an output, with no visibility into the six steps in between.

What the 5% Do Differently

The teams that extract real value from AI agents treat them like any other production system: they instrument them.

The MIT NANDA study found that the core differentiator wasn't talent, infrastructure, or regulation. It was learning, integration, and contextual adaptation — which requires understanding how your agents behave in the real world, not just in a Jupyter notebook.

Concretely, the teams that succeed do three things:

1. They Trace Every Step

Not just the LLM call — every tool invocation, every decision point, every data retrieval. A proper trace shows you the full execution graph: what the agent decided to do, what data it accessed, what the LLM returned at each step, and how long each operation took.

This is the difference between "the agent gave a wrong answer" and "the agent retrieved the right documents but the LLM ignored the most relevant paragraph because the context window was filled with a previous tool call's output."

2. They Track Costs Across Providers

A single agent workflow might hit OpenAI for reasoning, Anthropic for evaluation, and Google for embeddings. Most teams have no idea what a single agent run actually costs — let alone how that breaks down by user, feature, or team.

When you're running 10,000 agent executions a day across three LLM providers, the bill is not theoretical. And without per-trace cost attribution, you can't optimize what you can't measure.

3. They Evaluate Quality Continuously

The difference between a demo and production isn't speed — it's quality at scale. A single hand-tested prompt doesn't tell you how the agent performs across 10,000 different user inputs.

Automated evaluation — using LLM-as-judge scoring for relevance, coherence, and hallucination detection on every trace — turns observability from a debugging tool into a quality system.

The Gaps Nobody Is Talking About

Beyond basic tracing and cost tracking, there are deeper failure modes that existing tools don't address at all — and they're the ones causing the most expensive production incidents.

Silent Failure Detection: When Agents Lie About Working

Here's a failure mode most teams don't even know to look for: agents that skip tool execution entirely and fabricate the results.

Instead of actually calling your database, search API, or calculation tool, the agent generates a plausible-looking response as if it had. The output looks normal. No error is thrown. But the data is completely made up.

This isn't theoretical. It's documented across every major framework — crewAI, LangGraph, AutoGen, and even at the model level with OpenAI. Academic research has found tool hallucination rates as high as 91.1% on challenging subsets. LangGraph proposed a "grounding" parameter (RFC #6617) to address this, but hasn't shipped it.

No existing observability tool detects this. They trace the span, record the output, and move on — never verifying that the tool was actually executed and the result matches reality.

Visual Decision-Tree Debugging: Seeing What the Agent Actually Decided

Every observability tool on the market shows you traces the same way: as a flat table of spans, or a sequential waterfall chart. This works for simple chain-of-thought workflows. It completely breaks down for multi-agent systems where agents make branching decisions.

When Agent A decides to delegate to Agent B instead of Agent C, then Agent B decides to call two tools in parallel, and the combined results trigger a third agent — you need to see this as what it is: a decision tree, not a sequential log.

While Arize Phoenix has introduced an Agent Visibility tab with basic graph visualization, no tool offers a fully interactive decision-tree view of agent execution paths. LangSmith, Langfuse, Braintrust, Helicone, Opik, and HoneyHive still rely on tabular or span-level views. The visual decision tree remains a largely unsolved UX problem in this market.

Multi-Agent Traces That Actually Work Across Frameworks

Multi-agent architectures are the fastest-growing pattern in AI development. But multi-agent tracing is broken in every existing tool:

LangSmith can now trace individual CrewAI and AutoGen applications via OpenTelemetry, but it still cannot produce unified cross-framework multi-agent traces — a pipeline where a LangChain agent hands off to a CrewAI agent in the same trace breaks due to context propagation gaps.
Langfuse shows wrong inputs per agent in supervisor orchestration, and users have reported that identical generation names make it "impossible to target accurately a specific agent" when configuring per-agent evaluations. Langfuse has partially addressed this by switching to the OpenInference Instrumentation Library, though modifying LLM generation names remains difficult. Additionally, LLM spans are dropped in AutoGen tool loops — though this stems from AutoGen's instrumentation rather than a Langfuse bug.
Arize Phoenix has added an Agent Visibility tab with graph visualization, but multi-agent trace consolidation requires manual context propagation and it lacks built-in support for agent collaboration structures. Opik offers agent graph logging, but graph specification is manual for some frameworks. Braintrust, Helicone, and Maxim AI offer basic session and span grouping (Braintrust traces, Helicone sessions), but lack dedicated multi-agent orchestration tooling — they don't natively distinguish agent boundaries, handoff context, or inter-agent delegation logic.

You can see that Agent A called Agent B. You cannot see why Agent A chose Agent B over Agent C, what context was lost in the handoff, or why the negotiation between three agents converged on a suboptimal plan.

OTel-Native Tracing: Bridging AI Into Enterprise Infrastructure

Enterprises already run OpenTelemetry for their backend services. Their AI agents should emit traces into the same system — not require a separate vendor with a separate SDK.

But OTel's semantic conventions for AI agents are still in "Development" status as of February 2026. The conventions cover individual LLM calls but lack standard attributes for agent orchestration (planning, tool selection, delegation), tool call semantics, multi-agent communication, memory operations, evaluation results, and cost attribution. Each vendor that claims "OTel support" extends the base conventions differently, creating fragmentation.

Arize Phoenix is the closest to OTel-native, but it jumps from $50/month to $50K–$100K/year at the enterprise tier — a pricing cliff that locks out mid-market teams. Datadog now supports OTel-native ingestion for LLM observability, but its full feature set still relies on the ddtrace SDK.

The Current Tooling Landscape Falls Short

The existing options each have trade-offs that leave mid-market teams underserved:

LangSmith charges $39/seat/month. For a 25-person engineering team, that's $975/month — and it's designed LangChain-first. Cost tracking is also inaccurate — showing ~$0.30 for a $1.40 conversation. SSO/RBAC is locked behind the Enterprise tier.
Langfuse offers a strong open-source option, but self-hosting means your team is maintaining infrastructure instead of building product. SSO/RBAC is a $300/month add-on.
Braintrust has a $249/month platform fee with nothing between free and that — a steep pricing cliff for small teams.
Datadog charges $8 per 10K LLM requests plus an automatic ~$120/day premium when LLM spans are detected, putting moderate-scale teams at $5K+/month. Arize enterprise pricing is estimated at $50K–$100K/year — a steep jump from their $50/month Pro tier. Both involve enterprise sales cycles that can extend procurement timelines significantly.
AgentOps now offers a TypeScript SDK (v0.1.0, June 2025) and self-hosting under MIT license, but the TS SDK is early-stage with limited functionality compared to the Python SDK.
Maxim AI charges $29–$49/seat — per-seat pricing that punishes team growth.
HoneyHive offers a free Developer tier (10K events/month, up to 5 users), but its paid Enterprise tier is contact sales only with no published pricing — despite being a seed-stage company.

What's missing is agent-native observability that gives you visual decision-tree debugging, multi-agent trace correlation, silent failure detection, and OTel-native instrumentation — without per-seat pricing that scales against you, or infrastructure you have to run yourself.

What Would You Actually Need?

If you could see every step your AI agent takes — as an interactive decision tree, not a flat table — understand exactly why it failed, catch silent fabrications before they reach users, and plug directly into your existing OTel infrastructure — what would that change for your team?

I'm researching how engineering teams debug AI agents in production. If you're building with LangChain, CrewAI, AutoGen, or custom agents and have 15 minutes, I'd genuinely like to hear how you approach this today.

Book a 15-min conversation →

No pitch. I'm collecting real data on this problem and happy to share what I'm learning from other teams.

Sources

MIT NANDA, "The GenAI Divide: State of AI in Business 2025" — 150 interviews, 350 surveys, 300 deployment analyses. Finding: ~5% of AI pilots achieve rapid revenue acceleration.
IBM, 2025 CEO Study — 2,000 CEOs surveyed. Finding: 25% of AI initiatives delivered expected ROI; 16% scaled enterprise-wide.
LangChain, State of AI Agents Report — 1,300+ professionals surveyed. Finding: 51% have agents in production; 63% of mid-sized companies (100–2,000 employees) have agents live.
Silent failure detection: crewAI#3154, LangGraph RFC#6617, AutoGen#3354, OpenAI Community, arXiv 2412.04141
OTel GenAI semantic conventions: opentelemetry.io — Development status as of Feb 2026.
Multi-agent tracing issues: langsmith-sdk#1350, langfuse#9429, langfuse discussion#7569, langfuse#11505
Pricing: LangSmith, Langfuse, Braintrust, Arize Phoenix, Maxim AI, Helicone

DEV Community