Last article we opened the MCP black box. One line of middleware, and every tool call gets a span, metrics, and privacy controls. Problem solved.
Except it wasn't. We had traces. We had metrics. But we couldn't see them — no dashboard, no demo, and the most interesting MCP feature was completely invisible.
MCP servers don't just receive tool calls. They can also call LLMs themselves — through a feature called sampling. Your server asks the client's LLM to generate a response. The request goes out. The response comes back. And the trace? Silent.
This article is about the four follow-ups that turned "we have observability" into "here's what it actually looks like — try it yourself in 5 minutes."
The missing piece: sampling
Most MCP tutorials show tools as pure functions. Input goes in, output comes out. But the MCP spec has a feature called sampling/createMessage — the server can ask the client to run an LLM call on its behalf.
Why? Because MCP servers don't have API keys. They don't talk to OpenAI directly. But sometimes a tool needs an LLM — to summarize a document, to classify an input, to decide the next step. Sampling lets the server delegate back to the client:
[Agent / Client] [MCP Server]
invoke_agent orchestrator
├── chat gpt-4o →
├── tools/call summarize → receives tool call
│ needs LLM to summarize
│ ← sampling/createMessage
│ chat gpt-4o (sampling) →
│ ← gets LLM response
│ returns tool result
│ ←
└── chat gpt-4o →
The tool call triggers an LLM call which is invisible in the trace. The middleware from article #7 traces tools/call summarize — but the sampling call inside it? Ghost. No span, no duration, no model name. A 2-second tool call where 1.8 seconds was the LLM and 200ms was the actual tool logic — and you can't tell.
traceSampling() — manual wrap for an unwrappable call
Sampling can't be auto-intercepted. The .tool() wrapper catches handler registration — but ctx.mcpReq.requestSampling() is a method call inside the handler body. There's no registration to intercept.
So we made it explicit:
import { toadEyeMiddleware, traceSampling } from "toad-eye/mcp";
const server = new McpServer({ name: "my-server", version: "1.0.0" });
toadEyeMiddleware(server);
server.tool("summarize", { text: z.string() }, async ({ text }, ctx) => {
// traceSampling wraps the sampling call with an OTel span
const response = await traceSampling(
() => ctx.mcpReq.requestSampling({
messages: [{ role: "user", content: { type: "text", text } }],
maxTokens: 500,
}),
{ model: "gpt-4o", maxTokens: 500 }
);
return {
content: [{ type: "text", text: response.content.text }],
};
});
The traceSampling() wrapper creates a sampling/createMessage {model} span — SpanKind.CLIENT, because the server is requesting an LLM call from the client. The span captures model, duration, and status:
startSamplingSpan("gpt-4o") → span "chat gpt-4o"
gen_ai.operation.name = "chat"
gen_ai.request.model = "gpt-4o"
gen_ai.mcp.sampling.duration_ms = 1834
mcp.server.name = "my-server"
SpanKind = CLIENT
Now the trace tells the full story:
tools/call summarize 2.1s
└── chat gpt-4o (sampling) 1.8s ← this was invisible before
The tool took 2.1 seconds. 1.8 of those were the LLM call. 300ms was the actual summarization logic. Without this span, you'd optimize the wrong thing.
The Grafana dashboard — from metrics to answers
Having metrics in Prometheus is step one. Knowing what to ask is step two. We built an MCP Server dashboard that answers the questions you actually have:
Top row — the four numbers
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Tool Call │ Avg Tool │ Error Rate │ Resource │
│ Rate │ Duration │ │ Reads │
│ 12.4 req/s │ 45.2 ms │ 2.3% │ 3.1 req/s │
└──────────────┴──────────────┴──────────────┴──────────────┘
Four stats. Glanceable. Green/yellow/red thresholds. If the error rate is red — you know immediately.
Middle row — the trends
Tool Call Rate by Tool — timeseries, broken down by gen_ai_tool_name:
sum by (gen_ai_tool_name) (rate(gen_ai_mcp_tool_calls_total[5m]))
Is your agent hammering one tool? Is traffic shifting from search to calculate over time? The line chart shows the pattern.
Tool Duration p50 / p95 — two lines per tool:
histogram_quantile(0.50, sum by (gen_ai_tool_name, le) (
rate(gen_ai_mcp_tool_duration_bucket[5m])
))
histogram_quantile(0.95, sum by (gen_ai_tool_name, le) (
rate(gen_ai_mcp_tool_duration_bucket[5m])
))
When your search tool's P95 jumps from 200ms to 2 seconds — you see it before users complain.
Bottom row — errors and resources
Errors by Tool — stacked bars by tool + error type:
sum by (gen_ai_tool_name, error_type) (rate(gen_ai_mcp_tool_errors_total[5m]))
Not just "errors are up" — which tool and what kind. RateLimitError on search? ValidationError on calculate? The stacked bars tell you instantly.
Resource Reads by URI — which resources are popular:
sum by (gen_ai_data_source_id) (rate(gen_ai_mcp_resource_reads_total[5m]))
The table — one-screen overview
The bottom is a table that merges all metrics per tool:
| Tool | Call Rate (req/s) | Avg Duration (ms) | p95 Duration (ms) | Error Rate |
|---|---|---|---|---|
| calculate | 8.2 | 12.3 | 24.1 | 0% |
| get-weather | 3.1 | 145.2 | 312.8 | 3.2% |
| search | 1.1 | 890.4 | 2,134 | 8.7% |
Error rate cells are color-coded: green < 5%, yellow < 10%, red > 10%. You see the problem tool in one glance.
The dashboard is auto-provisioned — npx toad-eye init scaffolds it into your infra/toad-eye/grafana/dashboards/ directory. No manual Grafana setup.
The demo server — try it yourself
Theory is nice. Running code is better.
# 1. Start the observability stack
npx toad-eye up
# 2. Run the demo MCP server via MCP Inspector
npx @modelcontextprotocol/inspector npx tsx demo/src/mcp-server/index.ts
MCP Inspector opens in your browser. You see three tools:
-
calculate — safe math evaluation (
2 + 2 * 3→14) - get-weather — mock weather API with simulated latency (50-250ms)
- timestamp — current time in multiple formats
Plus a resource (server-info) and a prompt (weather-report).
Call a few tools. Then open the dashboards:
-
Jaeger http://localhost:16686 — find service
toad-eye-mcp-demo, see individual spans - Grafana http://localhost:3100 — MCP Server dashboard, see metrics aggregate
-
Prometheus http://localhost:9090 — raw queries, autocomplete
gen_ai_mcp
The demo server is intentionally simple — three tools, mock data, no external dependencies. The point isn't the tools. The point is seeing what the observability looks like in practice.
Here's the full server — 50 lines of actual logic:
import { initObservability } from "toad-eye";
import { toadEyeMiddleware } from "toad-eye/mcp";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
initObservability({
serviceName: "toad-eye-mcp-demo",
endpoint: "http://localhost:4318",
});
const server = new McpServer({ name: "toad-eye-mcp-demo", version: "1.0.0" });
toadEyeMiddleware(server, { recordInputs: true, recordOutputs: true });
server.tool(
"calculate",
"Evaluate a math expression",
{ expression: z.string() },
async ({ expression }) => {
const sanitized = expression.replace(/[^0-9+\-*/().% ]/g, "");
if (sanitized !== expression) {
throw new Error(`Invalid characters in expression: ${expression}`);
}
const result = new Function(`return (${sanitized})`)() as number;
return { content: [{ type: "text", text: `${expression} = ${result}` }] };
},
);
server.tool(
"get-weather",
"Get current weather for a city (mock data)",
{ city: z.string() },
async ({ city }) => {
await new Promise((r) => setTimeout(r, 50 + Math.random() * 200));
const conditions = ["sunny", "cloudy", "rainy", "snowy", "windy"] as const;
const condition = conditions[Math.floor(Math.random() * conditions.length)]!;
const tempC = Math.round(-10 + Math.random() * 45);
return {
content: [{ type: "text", text: JSON.stringify({ city, condition, tempC }) }],
};
},
);
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("MCP demo server running — open Grafana at http://localhost:3100");
Notice console.error on the last line — not console.log. Because stdout is the JSON-RPC wire. We learned this the hard way (article #7).
Metrics in the public API
One detail that bit us: MCP metrics existed in code but were invisible to library users. The GEN_AI_METRICS constant — the public interface for all toad-eye metric names — didn't include MCP metrics. Users writing custom dashboards or alerts had no way to discover them.
Fixed:
export const GEN_AI_METRICS = {
// ... existing LLM metrics ...
// MCP Server
MCP_TOOL_DURATION: "gen_ai.mcp.tool.duration",
MCP_TOOL_CALLS: "gen_ai.mcp.tool.calls",
MCP_TOOL_ERRORS: "gen_ai.mcp.tool.errors",
MCP_RESOURCE_READS: "gen_ai.mcp.resource.reads",
MCP_SESSION_ACTIVE: "gen_ai.mcp.session.active",
} as const;
Now you can reference GEN_AI_METRICS.MCP_TOOL_DURATION in your code instead of hardcoding the string "gen_ai.mcp.tool.duration". Small thing, but it's the difference between a library and a collection of code.
Session tracking was also added — MCP_SESSION_ACTIVE is an UpDownCounter that increments when middleware initializes. In a multi-server deployment, you can see how many MCP sessions are active across your fleet.
The full trace tree
With all four follow-ups shipped, here's what a complete MCP trace looks like:
invoke_agent orchestrator [client process]
├── chat gpt-4o 1.2s client LLM call
├── tools/call calculate 12ms ✅ [MCP server]
├── tools/call get-weather 187ms ✅ [MCP server]
├── tools/call summarize 2.1s ✅ [MCP server]
│ └── chat gpt-4o (sampling) 1.8s server → client LLM
├── resources/read toad-eye://info 3ms ✅ [MCP server]
└── chat gpt-4o 800ms client LLM call
Client-side agent spans. Server-side tool spans. Server-initiated LLM spans. One trace, complete story. From "the agent decided to call a tool" to "the tool asked the LLM for help" to "the result came back" — every step is visible.
This is what MCP observability looks like when it's done.
Quick start — 5 minutes to your first MCP trace
# Install toad-eye (if not already)
npm install toad-eye
# Start the stack
npx toad-eye init
npx toad-eye up
# Run the demo MCP server with Inspector
npx @modelcontextprotocol/inspector npx tsx demo/src/mcp-server/index.ts
# Call some tools in Inspector, then check:
# Jaeger: http://localhost:16686 (service: toad-eye-mcp-demo)
# Grafana: http://localhost:3100 (dashboard: MCP Server)
# Prometheus: http://localhost:9090 (query: gen_ai_mcp_tool_calls_total)
Five minutes. Real traces. Real metrics. Real dashboard. No mock data.
Implementation: Follow-up 1: Demo · Follow-up 2: Dashboard · Follow-up 3: Sampling · Follow-up 4: Metrics API
Previous articles:
- #5: Your AI agent is re-sending 80% of your budget every loop
- #6: Your LLM traces are write-only
- #7: MCP servers are a black box
toad-eye — open-source LLM observability, OTel-native: GitHub · npm
🐸👁️
Top comments (0)