Ian Parent

Posted on Mar 17 • Edited on May 2 • Originally published at iris-eval.com

The State of MCP Agent Observability (March 2026)

#mcp #aiagents #observability #opensource

Editor's note (2026-04): This post was written when Iris framed itself as observability-first. Iris has since repositioned as "the agent eval standard for MCP" — scoring outputs, not just watching them. The market analysis below remains accurate; the framing of where Iris fits has sharpened. See Closing the Eval Gap for the current thesis.

The State of MCP Agent Observability (March 2026)

The gap between deploying AI agents and understanding what they're doing.

1. The Observability Gap

The infrastructure world solved observability a decade ago. Prometheus, Grafana, Datadog, Sentry — production systems have mature monitoring, tracing, and alerting. But AI agents sit in a blind spot.

Traditional APM sees HTTP 200 and 143ms latency. It calls that a success. It has no idea the agent just:

Leaked a Social Security number in its response
Hallucinated a citation that doesn't exist
Burned $0.47 on a single query that should have cost $0.03
Followed an injected prompt that overrode its system instructions

Industry surveys suggest approximately 89% of teams deploying software have production observability in place. But when it comes to AI agents specifically — evaluating the quality of what agents produce, not just whether the HTTP request succeeded — that number drops to roughly half. There's a gap between "is the service up?" and "is the agent actually doing what it should?"

The market is responding. The observability market is projected to reach $3.35B in 2026, growing to $6.93B by 2031. MLOps spending — the adjacent budget line — was $2.19B in 2024 and is projected to reach $16.61B by 2030. These are not speculative numbers. The money is already moving.

2. MCP: From Specification to Industry Standard

The Model Context Protocol has moved from specification to industry standard faster than most developer protocols in recent memory. The milestone that cemented this: in December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF) under the Linux Foundation, co-founded by Anthropic, Block, and OpenAI, with backing from Google, Microsoft, AWS, Cloudflare, and Bloomberg. MCP is no longer one company's protocol. It's governed by an open foundation with the major players at the table.

The adoption numbers reflect this:

97M+ monthly SDK downloads across the MCP ecosystem
20,470+ GitHub stars on Langfuse alone (one observability platform in the space)
Remote MCP servers up 4x since May 2025
Gartner predicts 40% of enterprise applications will include AI agents by end of 2026
75% of API gateway vendors expected to ship MCP features by 2026

Claude Desktop, Cursor, and dozens of agent frameworks support MCP natively. The protocol defines how agents discover and invoke tools — and with the recent additions of MCP Apps (tool calls returning interactive UI components) and elicitation (servers requesting structured input mid-task), the surface area is expanding significantly. Agents are no longer just calling functions. They're rendering interfaces and conducting multi-turn dialogues with users through the protocol itself.

Production implications are significant:

Tool call chains are complex. An agent may invoke 5-10 tools in a single execution. Any one of those calls can fail, return garbage, or introduce latency.
Costs compound silently. Each LLM call has a token cost. When agents chain multiple calls, costs multiply in ways that aren't visible without per-execution tracking.
Security surface expands. Every tool call is a potential vector for data leakage or prompt injection. The more tools an agent uses, the more attack surface exists.
Interactive surface adds new observability dimensions. MCP Apps and elicitation mean observability now extends beyond tool inputs and outputs to rendered UI state and mid-task user interactions. Traditional trace models don't capture this.

The Counter-Perspective

Adoption is not universal, and intellectual honesty requires acknowledging the pushback.

Perplexity CTO Denis Yarats publicly moved away from MCP toward direct APIs and CLIs (March 11, 2026), citing context window consumption and auth friction as practical blockers. His argument: MCP's abstraction layer adds overhead that doesn't justify itself for every use case, particularly when context windows are a finite resource and authentication across tool providers remains inconsistent.

This is a legitimate critique. MCP adds a layer. That layer has a cost — in tokens, in latency, in complexity. For single-tool integrations where the agent always calls the same API, a direct integration may be simpler. But for agents that discover and compose tools dynamically — the multi-tool, multi-provider pattern that's becoming standard in production — the protocol abstraction pays for itself. The AAIF governance and the 97M+ monthly downloads suggest the industry agrees, but the tradeoff is real and teams should evaluate it for their specific architecture.

3. Security: The Adoption Bottleneck

Security is the number one blocker for MCP adoption in production, and the data is stark.

According to Zuplo's survey of MCP adopters:

50% of respondents cite security and access control as their top challenge
25% of MCP servers have no authentication whatsoever
38% say security concerns are actively blocking their adoption

This isn't surprising. MCP was designed protocol-first, with authentication and authorization as concerns to be layered on by implementors. In practice, many implementors haven't layered it on. The result: production MCP deployments where any client can invoke any tool with no identity verification.

Cloudflare's launch of MCP Server Portals addresses part of this — providing managed infrastructure with built-in auth, rate limiting, and access control for remote MCP servers. But for self-hosted and local deployments, security remains the responsibility of each server implementation.

For observability specifically, this means the logging layer itself needs to be secure. An observability tool that accepts unauthenticated trace data is a data integrity problem waiting to happen.

4. What's Missing: Protocol-Native Observability

Most existing observability approaches for AI agents fall into two categories:

SDK-based instrumentation. You import a library, wrap your agent calls, and push traces to a vendor. This works, but it requires code changes per agent, per framework. It's bolted on, not built in.

Generic logging. Teams pipe stdout to a log aggregator and grep for problems. This captures text but loses structure — no span trees, no tool call attribution, no cost aggregation.

What's missing is protocol-native observability: an observability layer that speaks the same protocol agents already use. Instead of wrapping agent code, the observability tool registers as a peer in the agent's tool discovery. The agent discovers it and uses it automatically — no SDK integration, no code changes.

This is the approach Iris takes. As an MCP server, any MCP-compatible agent discovers Iris's tools (log_trace, evaluate_output, get_traces) and can invoke them as part of its normal execution. The observability layer is a first-class participant in the protocol, not an afterthought.

5. The Competitive Landscape (March 2026)

The past six months have seen significant movement in the agent observability space. A summary of what's happened:

ClickHouse acquired Langfuse (January 16, 2026). The leading open-source LLM observability platform — 26M+ SDK installs per month, 19 of the Fortune 50 as users — is now backed by a database company with a $400M Series D at a $15B valuation. This is the clearest signal yet that agent observability has enterprise gravity. A database company buying an observability platform tells you where the data volume is heading.

Braintrust raised $80M Series B at $800M valuation (February 2026). An eval-focused platform reaching near-unicorn status on the strength of agent evaluation alone. The market is pricing agent eval as a standalone category, not a feature of something else.

Galileo open-sourced Agent Control Plane (March 11, 2026). Another entrant open-sourcing their agent management layer, validating that open-source is the trust mechanism for this category.

New Relic launched Agentic Platform (February 2026). Traditional APM is moving into agent-specific observability, bridging the gap between infrastructure monitoring and agent behavior tracking.

IBM Instana added MCP-specific observability capabilities. Datadog shipped an MCP server bridge. Arize Phoenix builds on OpenTelemetry's GenAI semantic conventions for trace-level agent observability. The incumbents are all moving.

The pattern is clear: every major observability vendor is adding agent capabilities, and several agent-native startups are reaching meaningful scale. What remains underserved is the protocol-native approach — observability that lives inside the agent protocol rather than wrapping it from outside.

6. The Eval Gap: Heuristic vs. Semantic

Evaluation is where the observability story gets interesting — and where the most work remains.

Heuristic evaluation uses deterministic rules: regex for PII detection, keyword overlap for relevance, token ratios for cost efficiency. It's fast, reproducible, and cheap. It catches real problems — an SSN pattern in agent output is an SSN pattern regardless of context.

Semantic evaluation uses an LLM to judge another LLM's output. It can assess nuance: "Is this response actually helpful?" "Does this summary capture the key points?" But it's slow, expensive, and introduces its own failure modes (the judge model can hallucinate too).

The practical reality for most teams today:

	Heuristic	Semantic
Speed	< 1ms per rule	500ms–3s per evaluation
Cost	$0	$0.01–0.10 per evaluation
Reproducibility	Deterministic	Varies between runs
PII detection	Strong (regex patterns)	Unreliable
Quality assessment	Limited (keyword overlap)	Strong
Hallucination detection	Markers only ("As an AI...")	Contextual analysis

Most production systems will need both. Heuristic rules as the always-on safety net — catching PII, injection, cost overruns — with semantic evaluation for periodic quality audits. We break down this tradeoff in detail in Heuristic vs Semantic Eval.

7. OpenTelemetry GenAI Semantic Conventions: The Standard Arrives

A prediction from earlier versions of this report has materialized: OpenTelemetry's GenAI semantic conventions are now published, including dedicated specifications for agent spans and MCP semantic conventions.

This matters because OpenTelemetry is the lingua franca of infrastructure observability. GenAI semantic conventions mean agent traces can flow through the same pipelines, dashboards, and alerting systems that teams already use for their HTTP services and databases. The "two worlds" problem — infrastructure observability in one tool, agent observability in another — has a standards-based path to convergence.

Arize Phoenix and ContextForge are already building on these conventions. Any observability tool that doesn't align with OTel's GenAI semantics is building on a proprietary foundation that will require migration later. We explore this convergence further in MCP Meets OpenTelemetry.

Iris's span structure has been OpenTelemetry-compatible since v0.1.0. As the GenAI conventions mature and tooling catches up, traces logged through Iris will integrate with the broader OTel ecosystem without transformation.

8. Cost Visibility: The Hidden Expense

Agent costs are a blind spot for most teams. Individual API calls are cheap. But agents don't make individual calls — they make chains.

A typical agent execution might involve:

1 system prompt (500 tokens input)
3-5 tool calls, each with its own LLM round-trip
Context that grows with each step (previous results accumulate)
A final synthesis call with the full conversation

What looks like a $0.01 API call becomes $0.15-0.50 per execution when you trace the full chain. At 1,000 executions per day, that's $150-500/day — potentially $4,500-15,000/month. Most teams don't discover this until the invoice arrives.

Per-execution cost tracking isn't a nice-to-have. It's the difference between "our AI feature is viable" and "our AI feature is burning cash and we don't know why."

9. Regulatory Context

The EU AI Act's traceability requirements take effect in August 2026. Article 14 requires "human oversight" measures for high-risk AI systems, including the ability to understand and trace system behavior.

For teams building AI agents that make consequential decisions — customer support, financial analysis, healthcare triage — the ability to log every execution with full traceability isn't just good engineering. It's becoming a legal requirement.

This doesn't mean every agent needs enterprise compliance tooling today. But it does mean that teams building logging and eval infrastructure now are building on the right side of the regulatory curve.

10. Predictions

Near-term (2026):

Protocol-native observability becomes a recognized category. More tools will build on MCP directly rather than as SDK wrappers.
Cost tracking becomes table stakes. Teams will demand per-execution cost visibility before approving agent deployments.
Heuristic eval adoption accelerates as teams realize they need automated safety nets, not just manual review. The cost of not evaluating — what we call the eval tax — is becoming impossible to ignore.
Security tooling catches up to protocol adoption. The 25% of MCP servers with no auth will shrink as managed hosting (Cloudflare, etc.) and security-first server implementations become the default.
MCP Apps and elicitation expand the observability surface. Tools that only trace request/response pairs will miss the interactive dimension.

Medium-term (2026-2027):

Hybrid eval systems (heuristic + semantic) become the standard pattern. Heuristic for always-on safety, semantic for periodic quality audits.
Agent observability platforms integrate with existing APM tools via OpenTelemetry GenAI semantic conventions, bridging the infrastructure and AI observability worlds. This is no longer a prediction — the conventions exist. Adoption is the remaining variable.
Compliance-driven adoption increases as EU AI Act enforcement begins.
Consolidation accelerates. The ClickHouse-Langfuse acquisition is the first of many. Expect database companies, APM vendors, and cloud providers to acquire agent observability startups.

Longer-term (2027+):

Agent observability data feeds back into agent improvement loops. Eval results inform prompt optimization, model selection, and agent architecture decisions.
Standardized eval benchmarks emerge for common agent patterns (RAG, tool-use, multi-agent orchestration).
The distinction between "traditional APM" and "agent observability" blurs as platforms converge.

What Iris Does About This

Iris is the first MCP-native eval and observability tool for AI agents. Open-source core, MIT licensed. Any MCP-compatible agent discovers Iris automatically — no SDK, no code changes. Log traces, evaluate output quality, detect PII and prompt injection, track costs per execution.

If you're building agents and want to see what they're actually doing:

GitHub: github.com/iris-eval/mcp-server

Waitlist (cloud tier): iris-eval.com

Star the repo if this report was useful. The roadmap is public.

Top comments (1)

Raju Dandigam • Jun 30

The distinction between application observability and agent observability is important. Traditional monitoring can tell us whether a request succeeded and how long it took, but it cannot explain whether the agent produced a useful result, followed policy correctly, or made a poor decision. That creates a completely new category of operational visibility requirements. The challenge feels less like extending APM and more like defining a new layer that sits above it.