Albert Alov

Posted on Mar 30

MCP servers are the fastest-growing part of the AI stack. They have zero observability.

#ai #mcp #opentelemetry #typescript

Your LLM agent calls a tool via MCP. The tool fails. Your trace shows tools/call search — error. That's it.

Not why it failed. Not how long it took. Not what arguments were passed. Not whether it was a timeout, a validation error, or a rate limit from a downstream API. Because nobody instruments the server side.

Every MCP observability tool watches the client. We built the first middleware that watches from inside the server. One import, one function call:

import { toadEyeMiddleware } from "toad-eye/mcp";

const server = new McpServer({ name: "my-server", version: "1.0.0" });
toadEyeMiddleware(server);

// Every tool, resource, and prompt handler is now instrumented.

Here's why this was harder than it sounds, and what we learned building it.

The black box

Here's what an MCP tool call trace looks like today — from the client's perspective:

invoke_agent orchestrator          200ms
├── chat gpt-4o                    1.2s
├── tools/call web-search           ???
│   └── (nothing — the server is a black box)
└── chat gpt-4o                    800ms

The LLM decided to call web-search. The client sent the JSON-RPC request. Something happened on the server. The client got a response — or an error.

The gap between "sent the request" and "got the response" is completely invisible.

MCP adoption is exploding. Claude Desktop, Cursor, Windsurf, Zed — every AI IDE supports it. Thousands of servers on npm. And every single one is a black box. When your tool breaks in production, you're debugging with nothing.

Why you can't just "add logging"

Your first instinct: console.log in the tool handler. Three reasons that doesn't work for MCP:

MCP uses JSON-RPC 2.0, not HTTP. No Express middleware. No Hono middleware. No request/response cycle to hook into. The SDK has a McpServer class where you register handlers — .tool(), .resource(), .prompt() — and routes JSON-RPC messages internally.

The SDK internals are private. You can't monkey-patch McpServer._requestHandlers — it's a private Map, and TypeScript strict mode won't let you touch it.

And the worst part — stdio transport. In stdio mode, stdout IS the JSON-RPC wire. Every byte on stdout is parsed by the client as a JSON-RPC message:

$ node my-mcp-server.js
debug                          ← your console.log
{"jsonrpc":"2.0","id":1,...}   ← actual MCP response

The client tries to parse debug as JSON. Can't. Connection dead. Your debugging broke the thing you were trying to debug.

The Wrapper Pattern

Can't patch internals. Can't hook HTTP. Can't use stdout. But we can intercept the public API.

When you call server.tool(), the SDK stores the handler internally. If we replace .tool() before any handlers are registered, we wrap every handler transparently:

const originalTool = server.tool.bind(server);

server.tool = function wrappedTool(name, ...rest) {
  const handlerIndex = rest.findIndex((arg) => typeof arg === "function");
  const originalHandler = rest[handlerIndex];

  rest[handlerIndex] = async function wrappedHandler(...args) {
    const span = startToolSpan(name);
    const start = performance.now();
    try {
      const result = await originalHandler(...args);
      endSpanSuccess(span);
      recordMcpToolCall(name, performance.now() - start, "success");
      return result;
    } catch (error) {
      endSpanError(span, error);
      recordMcpToolCall(name, performance.now() - start, "error");
      throw error;
    }
  };

  return originalTool(name, ...rest);
};

Same approach for .resource() and .prompt(). The handler is wrapped before it enters the SDK's private map. The SDK never knows. Your code never changes.

What you see after

Before: tools/call search — error.

After:

invoke_agent orchestrator                              [client-side]
├── chat gpt-4o                         1.2s
├── tools/call calculate                45ms   ✅      [server-side]
├── tools/call web-search               2.3s   ❌ RateLimitError
├── resources/read file:///config.json  3ms    ✅
└── chat gpt-4o                         800ms

Each operation gets a span with standard OTel attributes:

gen_ai.operation.name    = "tools/call"
gen_ai.tool.name         = "calculate"
mcp.server.name          = "my-server"
mcp.session.id           = "a1b2c3d4"
network.transport        = "pipe"

Any OTel-compatible backend — Jaeger, Grafana Tempo, Datadog, Arize Phoenix — recognizes them without configuration.

Metrics — patterns, not just incidents

Spans tell you about individual requests. Metrics tell you about patterns:

Metric	Type	What it tells you
`gen_ai.mcp.tool.duration`	Histogram	Which tools are slow? Latency trending up?
`gen_ai.mcp.tool.calls`	Counter	Which tools are popular? Agent over-using one?
`gen_ai.mcp.tool.errors`	Counter	Which tools are unreliable? What error types?
`gen_ai.mcp.resource.reads`	Counter	Access patterns for resources
`gen_ai.mcp.session.active`	UpDownCounter	How many MCP sessions right now?

"The search tool has an 8% error rate and P95 latency of 2.3 seconds." That's actionable. "A tool failed" is not.

The STDIO trap

This is the part we learned the hard way.

OpenTelemetry's SDK writes diagnostic messages to stdout by default. In an HTTP server, you'd never notice. In a stdio MCP server, those diagnostics are catastrophic.

OTel SDK initializes → writes "DiagAPI initialized" to stdout → MCP client parses it as JSON → parse fails → connection dead. Before a single tool call.

We spent 3 hours on "why does the connection die when I import toad-eye" before finding this. The fix is ten lines:

function ensureStdioSafe() {
  const stderrLogger = new DiagConsoleLogger();
  const safeLogger = {
    verbose: (...args) => stderrLogger.verbose(String(args[0])),
    debug:   (...args) => stderrLogger.debug(String(args[0])),
    info:    (...args) => stderrLogger.info(String(args[0])),
    warn:    (...args) => stderrLogger.warn(String(args[0])),
    error:   (...args) => stderrLogger.error(String(args[0])),
  };
  diag.setLogger(safeLogger, DiagLogLevel.WARN);
}

Redirects all OTel diagnostics to stderr. The MCP connection survives.

If you're building any MCP server with any OTel instrumentation: redirect diagnostics to stderr first. Before spans, before metrics, before anything.

Privacy by default

Tool arguments can contain anything. API keys. User data. File contents. Database credentials. The default must be safe:

toadEyeMiddleware(server);  // arguments NOT recorded

Opt in explicitly:

toadEyeMiddleware(server, {
  recordInputs: true,
  recordOutputs: true,
  redactKeys: ["apiKey", "token", "password"],
  maxPayloadSize: 4096,
});

Sensitive fields become [REDACTED] in spans. Large payloads get truncated. Compare with tools that record everything by default and leave privacy as "your problem."

Context propagation

The most powerful thing: linking client and server traces. One trace tree, complete picture.

HTTP does this with traceparent headers. MCP stdio has no headers. But it has _meta in JSON-RPC params:

{
  "method": "tools/call",
  "params": {
    "name": "calculate",
    "arguments": { "expression": "2+2" },
    "_meta": {
      "traceparent": "00-0af7651916cd43dd-b7ad6b71692033-01"
    }
  }
}

Our middleware extracts it. When the host injects _meta.traceparent, server spans become children of client spans. When it doesn't — graceful fallback, span starts as root. No crash, no error. Works with whatever context is available.

The landscape

We looked for MCP server-side observability before building this. We couldn't find any — not "nothing good," but nothing at all.

	Client-side tracing	Server-side middleware	Privacy controls	OTel-native
Datadog	✅	❌	❌	❌
Langfuse	✅	❌	❌	❌
AgentOps	✅	❌	❌	❌
toad-eye	✅	✅	✅	✅

Every tool watches the client. Nobody watches the server. We built this because our own bot's MCP tools kept failing and we had no way to diagnose why.

Quick checklist

If you're building or maintaining MCP servers:

Never console.log in stdio servers — use stderr or a logger
Redirect OTel diagnostics to stderr before initializing anything
Don't record tool arguments by default — they may contain secrets
Use standard span names: tools/call {name}, resources/read {uri}
Test with both stdio and SSE transports — they break differently
Check if your MCP host injects _meta.traceparent for trace linking

Implementation: Phase 1: Core middleware · Phase 2: Metrics + privacy · Phase 3: STDIO isolation

Previous articles:

toad-eye — open-source LLM observability, OTel-native: GitHub · npm

🐸👁️

Top comments (1)

Danny • Mar 31

This is a great point.

Server-side observability for MCP is badly needed, especially once stdio and custom server wrappers enter the picture.

One thing this also highlights. For many API integrations, the real problem starts even earlier when teams build a wrapper server at all. I’ve been working on MCS (Model Context Standard) around the idea that this is often a driver/spec problem, not a protocol/server problem.

Point an LLM at an OpenAPI-backed driver instead of standing up yet another wrapper service.
So I see this as complementary:

if you already run MCP servers, server-side tracing like this is essential
if your “server” is mostly a thin wrapper around an existing API, maybe you can eliminate that layer entirely

Curious how you see that tradeoff long term.