Your LLM agent calls a tool via MCP. The tool fails. Your trace shows tools/call search — error. That's it.
Not why it failed. Not how long it took. Not what arguments were passed. Not whether it was a timeout, a validation error, or a rate limit from a downstream API. Because nobody instruments the server side.
Every MCP observability tool watches the client. We built the first middleware that watches from inside the server. One import, one function call:
import { toadEyeMiddleware } from "toad-eye/mcp";
const server = new McpServer({ name: "my-server", version: "1.0.0" });
toadEyeMiddleware(server);
// Every tool, resource, and prompt handler is now instrumented.
Here's why this was harder than it sounds, and what we learned building it.
The black box
Here's what an MCP tool call trace looks like today — from the client's perspective:
invoke_agent orchestrator 200ms
├── chat gpt-4o 1.2s
├── tools/call web-search ???
│ └── (nothing — the server is a black box)
└── chat gpt-4o 800ms
The LLM decided to call web-search. The client sent the JSON-RPC request. Something happened on the server. The client got a response — or an error.
The gap between "sent the request" and "got the response" is completely invisible.
MCP adoption is exploding. Claude Desktop, Cursor, Windsurf, Zed — every AI IDE supports it. Thousands of servers on npm. And every single one is a black box. When your tool breaks in production, you're debugging with nothing.
Why you can't just "add logging"
Your first instinct: console.log in the tool handler. Three reasons that doesn't work for MCP:
MCP uses JSON-RPC 2.0, not HTTP. No Express middleware. No Hono middleware. No request/response cycle to hook into. The SDK has a McpServer class where you register handlers — .tool(), .resource(), .prompt() — and routes JSON-RPC messages internally.
The SDK internals are private. You can't monkey-patch McpServer._requestHandlers — it's a private Map, and TypeScript strict mode won't let you touch it.
And the worst part — stdio transport. In stdio mode, stdout IS the JSON-RPC wire. Every byte on stdout is parsed by the client as a JSON-RPC message:
$ node my-mcp-server.js
debug ← your console.log
{"jsonrpc":"2.0","id":1,...} ← actual MCP response
The client tries to parse debug as JSON. Can't. Connection dead. Your debugging broke the thing you were trying to debug.
The Wrapper Pattern
Can't patch internals. Can't hook HTTP. Can't use stdout. But we can intercept the public API.
When you call server.tool(), the SDK stores the handler internally. If we replace .tool() before any handlers are registered, we wrap every handler transparently:
const originalTool = server.tool.bind(server);
server.tool = function wrappedTool(name, ...rest) {
const handlerIndex = rest.findIndex((arg) => typeof arg === "function");
const originalHandler = rest[handlerIndex];
rest[handlerIndex] = async function wrappedHandler(...args) {
const span = startToolSpan(name);
const start = performance.now();
try {
const result = await originalHandler(...args);
endSpanSuccess(span);
recordMcpToolCall(name, performance.now() - start, "success");
return result;
} catch (error) {
endSpanError(span, error);
recordMcpToolCall(name, performance.now() - start, "error");
throw error;
}
};
return originalTool(name, ...rest);
};
Same approach for .resource() and .prompt(). The handler is wrapped before it enters the SDK's private map. The SDK never knows. Your code never changes.
What you see after
Before: tools/call search — error.
After:
invoke_agent orchestrator [client-side]
├── chat gpt-4o 1.2s
├── tools/call calculate 45ms ✅ [server-side]
├── tools/call web-search 2.3s ❌ RateLimitError
├── resources/read file:///config.json 3ms ✅
└── chat gpt-4o 800ms
Each operation gets a span with standard OTel attributes:
gen_ai.operation.name = "tools/call"
gen_ai.tool.name = "calculate"
mcp.server.name = "my-server"
mcp.session.id = "a1b2c3d4"
network.transport = "pipe"
Any OTel-compatible backend — Jaeger, Grafana Tempo, Datadog, Arize Phoenix — recognizes them without configuration.
Metrics — patterns, not just incidents
Spans tell you about individual requests. Metrics tell you about patterns:
| Metric | Type | What it tells you |
|---|---|---|
gen_ai.mcp.tool.duration |
Histogram | Which tools are slow? Latency trending up? |
gen_ai.mcp.tool.calls |
Counter | Which tools are popular? Agent over-using one? |
gen_ai.mcp.tool.errors |
Counter | Which tools are unreliable? What error types? |
gen_ai.mcp.resource.reads |
Counter | Access patterns for resources |
gen_ai.mcp.session.active |
UpDownCounter | How many MCP sessions right now? |
"The search tool has an 8% error rate and P95 latency of 2.3 seconds." That's actionable. "A tool failed" is not.
The STDIO trap
This is the part we learned the hard way.
OpenTelemetry's SDK writes diagnostic messages to stdout by default. In an HTTP server, you'd never notice. In a stdio MCP server, those diagnostics are catastrophic.
OTel SDK initializes → writes "DiagAPI initialized" to stdout → MCP client parses it as JSON → parse fails → connection dead. Before a single tool call.
We spent 3 hours on "why does the connection die when I import toad-eye" before finding this. The fix is ten lines:
function ensureStdioSafe() {
const stderrLogger = new DiagConsoleLogger();
const safeLogger = {
verbose: (...args) => stderrLogger.verbose(String(args[0])),
debug: (...args) => stderrLogger.debug(String(args[0])),
info: (...args) => stderrLogger.info(String(args[0])),
warn: (...args) => stderrLogger.warn(String(args[0])),
error: (...args) => stderrLogger.error(String(args[0])),
};
diag.setLogger(safeLogger, DiagLogLevel.WARN);
}
Redirects all OTel diagnostics to stderr. The MCP connection survives.
If you're building any MCP server with any OTel instrumentation: redirect diagnostics to stderr first. Before spans, before metrics, before anything.
Privacy by default
Tool arguments can contain anything. API keys. User data. File contents. Database credentials. The default must be safe:
toadEyeMiddleware(server); // arguments NOT recorded
Opt in explicitly:
toadEyeMiddleware(server, {
recordInputs: true,
recordOutputs: true,
redactKeys: ["apiKey", "token", "password"],
maxPayloadSize: 4096,
});
Sensitive fields become [REDACTED] in spans. Large payloads get truncated. Compare with tools that record everything by default and leave privacy as "your problem."
Context propagation
The most powerful thing: linking client and server traces. One trace tree, complete picture.
HTTP does this with traceparent headers. MCP stdio has no headers. But it has _meta in JSON-RPC params:
{
"method": "tools/call",
"params": {
"name": "calculate",
"arguments": { "expression": "2+2" },
"_meta": {
"traceparent": "00-0af7651916cd43dd-b7ad6b71692033-01"
}
}
}
Our middleware extracts it. When the host injects _meta.traceparent, server spans become children of client spans. When it doesn't — graceful fallback, span starts as root. No crash, no error. Works with whatever context is available.
The landscape
We looked for MCP server-side observability before building this. We couldn't find any — not "nothing good," but nothing at all.
| Client-side tracing | Server-side middleware | Privacy controls | OTel-native | |
|---|---|---|---|---|
| Datadog | ✅ | ❌ | ❌ | ❌ |
| Langfuse | ✅ | ❌ | ❌ | ❌ |
| AgentOps | ✅ | ❌ | ❌ | ❌ |
| toad-eye | ✅ | ✅ | ✅ | ✅ |
Every tool watches the client. Nobody watches the server. We built this because our own bot's MCP tools kept failing and we had no way to diagnose why.
Quick checklist
If you're building or maintaining MCP servers:
- Never
console.login stdio servers — use stderr or a logger - Redirect OTel diagnostics to stderr before initializing anything
- Don't record tool arguments by default — they may contain secrets
- Use standard span names:
tools/call {name},resources/read {uri} - Test with both stdio and SSE transports — they break differently
- Check if your MCP host injects
_meta.traceparentfor trace linking
Implementation: Phase 1: Core middleware · Phase 2: Metrics + privacy · Phase 3: STDIO isolation
Previous articles:
- #4: Your LLM streaming traces are lying to you
- #5: Your AI agent re-sends 80% of your budget every loop
- #6: Your LLM traces are write-only
toad-eye — open-source LLM observability, OTel-native: GitHub · npm
🐸👁️
Top comments (1)
This is a great point.
Server-side observability for MCP is badly needed, especially once stdio and custom server wrappers enter the picture.
One thing this also highlights. For many API integrations, the real problem starts even earlier when teams build a wrapper server at all. I’ve been working on MCS (Model Context Standard) around the idea that this is often a driver/spec problem, not a protocol/server problem.
Point an LLM at an OpenAPI-backed driver instead of standing up yet another wrapper service.
So I see this as complementary:
Curious how you see that tradeoff long term.