Observability for AI agents running through MCP has a new failure point: the MCP tool call.
Good. The broad version of this conversation has already been beaten to death. Agents need traces. Agents need evals. Agents need feedback loops. Fine. The sharper production question is what happens when the agent leaves the planner and crosses into a tool server owned by another team, another vendor, another runtime, or another cloud account.
That boundary is where the trace disappears.
Honeycomb is running O11yCon in San Francisco this week. Christine Yen's line in the announcement gets at the issue: agents are writing code, agents are triaging incidents, agents are running production through orchestration, and engineering has little visibility into what the agents did, let alone whether they added value. The visibility gap for these agents is along the path between the model's decision, the tool server, and the downstream services affected by the action.
The production shape is distributed tracing with a model in the loop.
A planner says "tool failed." An MCP server just sees an unrelated tools/call. A database sees a single query. A payment API sees a single request. The observability backend sees all these individual pieces and, operationally, has no idea what to do with them. Nobody can say whether the model chose the wrong tool, the planner's MCP client lost context somewhere along the line, the server failed to accept the call, or the downstream service simply timed out.
Logs within a given service are comfortable to view because the local nature of the stream makes them easy to interpret. However, as soon as an incident affects multiple services or tools, that comfortable stream of logs disappears.
MCP made tool integration portable. It did not magically make tool behavior observable. Focused has been pushing this shape for a while. In Developing AI Agency, the point was that useful agents need real engineering systems around them. In Streaming agent state with LangGraph, the point was that intermediate state matters while long-running work is happening. MCP adds a protocol boundary to that same production story. If the trace cannot cross it, the agent becomes opaque at the exact moment it starts doing useful work.
MCP gave us the carrier
MCP made tool integration for production tool calls easier. Making the behavior of those calls observable is a different job.
This brings us to a simple and useful place: SEP-414 reserves the W3C trace keys for W3C trace context propagation through MCP. So the MCP tools/call request can include trace context as part of params._meta, next to the tool name and arguments.
MCP typically wants _meta keys that start with a DNS-prefixed name. SEP-414 makes an exception for the three W3C trace keys so existing OpenTelemetry propagation can work without creating twelve slightly different names for the same thing. traceparent stays traceparent, tracestate stays tracestate, and baggage stays baggage.
Tiny standardization, huge operational consequence.
A universal set of properties for W3C trace context is a small thing to request. Without SEP-414, every agent stack invents its own set of properties in params: io.modelcontextprotocol.traceparent, otel_trace_parent, correlation IDs encoded in a vendor envelope, plus the special shape required by a proprietary monitoring stack. The resulting observability swamp would be indistinguishable from what exists today with services and their HTTP traces.
First, the agent runtime starts a new span or continues an existing one. Then the MCP client for that runtime injects W3C trace context into params._meta for the call. When the MCP server processes the call, it extracts the W3C trace context from params._meta. Then the server creates a new server span. Tool code invoked by that server, including API calls to databases, queues, workflow engines, and other services, runs under the same trace context.
The tool boundary is where agent observability either survives or dies.
HTTP spans will not save the agent loop
A tempting shortcut is to assume the transport already has tracing. The MCP server runs over HTTP. The ingress span exists. The collector sees requests. Done.
Nope.
That is why OpenTelemetry's MCP semantic conventions matter: HTTP spans only contain information about transport. Streamable MCP transports can contain more than one request, and one MCP operation can spread across retries and transports. The transport context and MCP context are related, but different.
A streamable HTTP request can sit under multiple MCP messages. A retry can create multiple transport-level attempts for one logical operation. Stdio has no HTTP request to hang a trace on at all. If instrumentation stops at the transport layer, the team is just looking at plumbing. The production question lives one layer up: what MCP method was called, what tool was called, what session was involved, what error type was returned, and which downstream spans received the trace context.
A trace is useful when it follows the boundary. In the simplest case, a single trace starts with a span created by the agent runtime. The span name should be boring and low-cardinality, with names like tools/call get_weather, tools/call query_customer, or tools/call create_ticket. The attributes carry the information that matters in production: mcp.method.name, gen_ai.tool.name, mcp.session.id, mcp.protocol.version, network.transport, and error.type. OpenTelemetry warns against adding high-cardinality resource URIs to span names by default. That creates backend cardinality problems for no benefit.
The same thing is true for baggage. Baggage is useful for correlation. It is also an attractive nuisance. A tenant hint here, a route class there, an evaluation cohort for a particular set of runs. Fine. But prompts, secrets, user emails, access tokens, and customer data do not belong in baggage because trace context is supposed to cross service boundaries.
Google's Cloud Trace documentation treats tracing through remote MCP request metadata as an implementation detail. A remote server can accept traceparent in headers or _meta. Once that tracing information is accepted and the trace is sampled, the server emits spans for the requested operation, including failures caused by the agent or by the tool, and latency caused by the client, network, or server processing.
Sampling policy becomes relevant for observability of the agent's tool work. If the agent's tool work is not sampled, the tool's work cannot be reconstructed later by whoever wired up the chat UI.
Fragmented truth still loses the incident
Separate traces can be valid. A vendor-operated MCP server may want a clean service boundary. A client team may not own the server. Langfuse's docs make that distinction directly: Langfuse's MCP tracing docs. But default separation is awful for incident management when the agent itself is causing a user-visible problem.
The agent chooses a tool. The MCP server executes the request. The database locks. The tool returns a timeout. The planner retries with slightly different arguments. The user waits. Each system can tell the truth from inside its own box. The operator still has to stitch together causality by timestamps, request IDs, Slack screenshots, and vibes (the official fourth pillar, apparently).
Without propagation, every system tells the truth in isolation.
In production flow, agent traces should form a chain that represents both the decision process for a request and the execution process carried out by services. The tool spans from an agent trace should link to the corresponding service spans. Having the agent's processing stages with nothing from subsequent services is model theater. Service spans without the corresponding tool decision are classic APM with no agent-specific information.
Honeycomb has been going down a similar route. Their Innovation Week writeup describes agent workflows that branch, retry, call tools, hand off, and trigger services. They frame Agent Timeline. The resulting view places the agent's work inside the incident loop and shows the causal chain behind a prompt log.
The implementation surface is small
Here is a concise specification for adding distributed tracing to an agent-enabled workflow:
- inject
traceparentinto MCPparams._meta - extract it on the MCP server
- name spans by MCP method plus stable tool or prompt name
- attach MCP and GenAI attributes with low cardinality
- propagate trace context to following API and database calls
- keep sensitive data out of
baggage - send the result to a backend that can show agent and service work together
The ecosystem around the MCP contract already does a decent amount of the heavy lifting. Grafana's MCP server docs include attributes such as gen_ai.tool.name, mcp.method.name, and mcp.session.id, with W3C trace context propagation from _meta. MCP Toolbox telemetry docs cover attributes for MCP method, transport, protocol, toolset, tool name, and error type. LangSmith accepts OpenTelemetry ingestion, which means MCP spans do not have to sit in an observability island away from LangChain or LangGraph applications.
In practice, agent systems run across different runtimes, including planners, graphs, model gateways, tool registries, MCP clients and servers, legacy APIs, databases, queues, approval steps, and eval jobs. Evidence of proper orchestration cannot scatter across architecture components and still be reviewable by team members from AI, platform, service, and business functions. We discussed the tradeoffs in Multi-Agent Orchestration in LangGraph. For trace propagation, the same reasoning applies. Architecture can be decomposed into modular components. Evidence for correct runtime behavior cannot.
A decent review checklist is simple:
- Can an operator start from a failed agent run and find the corresponding MCP
tools/callspan for the tool that failed? - Can they see the exact tool name without exploding cardinality?
- Can they jump from the client span to the server span?
- Can they see the downstream API, database, or queue work under the same trace?
- Can they distinguish model/tool selection failure from tool/server failure?
- Can they see
error.type, latency, tokens, and quality signals near the same workflow? - Can they prove no secrets or PII are leaking through baggage or span attributes?
Call it what it is: a pull request.
The owner is the team that owns the boundary
The trick keeping observability in agent systems stuck is assigning MCP observability to the AI team, or to the MCP tools team, or to the database platform team, while claiming the boundary is too hard for any one team to own.
There are four parties involved here: the vendor of the tool, the platform team, the AI team, and the service team. The vendor exposes spans. The platform team runs a collector to gather those spans. The AI team creates a planner span that is passed as context to tools. The service team instruments downstream API and database calls made by tools within an agent run. Someone has to own the boundary between those groups.
Own the boundary.
For an internal MCP server, trace propagation belongs in the server template for all calls. It should not be left to individual tools. For vendor-provided MCP servers, test the contract by sending traceparent in params._meta and verifying that the backend receives the linked span. Test trace propagation from the agent runtime for every tool call after context injection, without needing to chase separate dashboards. Baggage should have a clear policy before developers discover it as a convenient place to add sensitive information.
AI agent observability will continue to sound mysterious when production monitoring means staring at transcripts of model dialogs. A transcript is one artifact. It will never show the intent behind a command, the tools used to execute it, the side effects, the latency, the errors, or the downstream work required by systems that had to deal with the output of those tools.
MCP made tools portable. SEP-414 and the OpenTelemetry MCP conventions make the tool boundary traceable. The work is wonderfully unglamorous: pass the context, name the spans, control the attributes to keep cardinality low, protect baggage from sensitive information, and then follow the tool calls as the trace crosses the same boundary as the agent.
Follow the trace, follow the agent.


Top comments (1)
This is the observability piece that makes MCP feel production-grade rather than just convenient.
For database tools specifically, the trace should not stop at
tools/call query_customer. The useful audit path is more like:Without that chain, teams can see that “a tool failed” or “a query ran,” but not whether the model chose the wrong tool, the MCP server applied the wrong policy, or the downstream data source returned stale/partial data.
Related: https://conexor.io/blog/query-provenance-for-ai-database-agents?utm_source=devto&utm_medium=comment&utm_campaign=engagement
The baggage warning is important too. Trace context should cross boundaries; secrets and customer data should not hitch a ride.