Albert Alov

Posted on Apr 4

We traced an MCP server calling an LLM — both sides, one trace

#webdev #ai #programming #javascript

Last article we opened the MCP black box. One line of middleware, and every tool call gets a span, metrics, and privacy controls. Problem solved.

Except it wasn't. We had traces. We had metrics. But we couldn't see them — no dashboard, no demo, and the most interesting MCP feature was completely invisible.

MCP servers don't just receive tool calls. They can also call LLMs themselves — through a feature called sampling. Your server asks the client's LLM to generate a response. The request goes out. The response comes back. And the trace? Silent.

This article is about the four follow-ups that turned "we have observability" into "here's what it actually looks like — try it yourself in 5 minutes."

The missing piece: sampling

Most MCP tutorials show tools as pure functions. Input goes in, output comes out. But the MCP spec has a feature called sampling/createMessage — the server can ask the client to run an LLM call on its behalf.

Why? Because MCP servers don't have API keys. They don't talk to OpenAI directly. But sometimes a tool needs an LLM — to summarize a document, to classify an input, to decide the next step. Sampling lets the server delegate back to the client:

[Agent / Client]                        [MCP Server]
invoke_agent orchestrator
├── chat gpt-4o                  →
├── tools/call summarize         →      receives tool call
│                                       needs LLM to summarize
│                                  ←    sampling/createMessage
│   chat gpt-4o (sampling)       →
│                                  ←    gets LLM response
│                                       returns tool result
│                                  ←
└── chat gpt-4o                  →

The tool call triggers an LLM call which is invisible in the trace. The middleware from article #7 traces tools/call summarize — but the sampling call inside it? Ghost. No span, no duration, no model name. A 2-second tool call where 1.8 seconds was the LLM and 200ms was the actual tool logic — and you can't tell.

`traceSampling()` — manual wrap for an unwrappable call

Sampling can't be auto-intercepted. The .tool() wrapper catches handler registration — but ctx.mcpReq.requestSampling() is a method call inside the handler body. There's no registration to intercept.

So we made it explicit:

import { toadEyeMiddleware, traceSampling } from "toad-eye/mcp";

const server = new McpServer({ name: "my-server", version: "1.0.0" });
toadEyeMiddleware(server);

server.tool("summarize", { text: z.string() }, async ({ text }, ctx) => {
  // traceSampling wraps the sampling call with an OTel span
  const response = await traceSampling(
    () => ctx.mcpReq.requestSampling({
      messages: [{ role: "user", content: { type: "text", text } }],
      maxTokens: 500,
    }),
    { model: "gpt-4o", maxTokens: 500 }
  );

  return {
    content: [{ type: "text", text: response.content.text }],
  };
});

The traceSampling() wrapper creates a sampling/createMessage {model} span — SpanKind.CLIENT, because the server is requesting an LLM call from the client. The span captures model, duration, and status:

startSamplingSpan("gpt-4o") → span "chat gpt-4o"
  gen_ai.operation.name = "chat"
  gen_ai.request.model = "gpt-4o"
  gen_ai.mcp.sampling.duration_ms = 1834
  mcp.server.name = "my-server"
  SpanKind = CLIENT

Now the trace tells the full story:

tools/call summarize         2.1s
└── chat gpt-4o (sampling)   1.8s  ← this was invisible before

The tool took 2.1 seconds. 1.8 of those were the LLM call. 300ms was the actual summarization logic. Without this span, you'd optimize the wrong thing.

The Grafana dashboard — from metrics to answers

Having metrics in Prometheus is step one. Knowing what to ask is step two. We built an MCP Server dashboard that answers the questions you actually have:

Top row — the four numbers

┌──────────────┬──────────────┬──────────────┬──────────────┐
│  Tool Call   │  Avg Tool    │  Error Rate  │  Resource    │
│  Rate        │  Duration    │              │  Reads       │
│  12.4 req/s  │  45.2 ms     │  2.3%        │  3.1 req/s   │
└──────────────┴──────────────┴──────────────┴──────────────┘

Four stats. Glanceable. Green/yellow/red thresholds. If the error rate is red — you know immediately.

Middle row — the trends

Tool Call Rate by Tool — timeseries, broken down by gen_ai_tool_name:

sum by (gen_ai_tool_name) (rate(gen_ai_mcp_tool_calls_total[5m]))

Is your agent hammering one tool? Is traffic shifting from search to calculate over time? The line chart shows the pattern.

Tool Duration p50 / p95 — two lines per tool:

histogram_quantile(0.50, sum by (gen_ai_tool_name, le) (
  rate(gen_ai_mcp_tool_duration_bucket[5m])
))
histogram_quantile(0.95, sum by (gen_ai_tool_name, le) (
  rate(gen_ai_mcp_tool_duration_bucket[5m])
))

When your search tool's P95 jumps from 200ms to 2 seconds — you see it before users complain.

Bottom row — errors and resources

Errors by Tool — stacked bars by tool + error type:

sum by (gen_ai_tool_name, error_type) (rate(gen_ai_mcp_tool_errors_total[5m]))

Not just "errors are up" — which tool and what kind. RateLimitError on search? ValidationError on calculate? The stacked bars tell you instantly.

Resource Reads by URI — which resources are popular:

sum by (gen_ai_data_source_id) (rate(gen_ai_mcp_resource_reads_total[5m]))

The table — one-screen overview

The bottom is a table that merges all metrics per tool:

Tool	Call Rate (req/s)	Avg Duration (ms)	p95 Duration (ms)	Error Rate
calculate	8.2	12.3	24.1	0%
get-weather	3.1	145.2	312.8	3.2%
search	1.1	890.4	2,134	8.7%

Error rate cells are color-coded: green < 5%, yellow < 10%, red > 10%. You see the problem tool in one glance.

The dashboard is auto-provisioned — npx toad-eye init scaffolds it into your infra/toad-eye/grafana/dashboards/ directory. No manual Grafana setup.

The demo server — try it yourself

Theory is nice. Running code is better.

# 1. Start the observability stack
npx toad-eye up

# 2. Run the demo MCP server via MCP Inspector
npx @modelcontextprotocol/inspector npx tsx demo/src/mcp-server/index.ts

MCP Inspector opens in your browser. You see three tools:

calculate — safe math evaluation (2 + 2 * 3 → 14)
get-weather — mock weather API with simulated latency (50-250ms)
timestamp — current time in multiple formats

Plus a resource (server-info) and a prompt (weather-report).

Call a few tools. Then open the dashboards:

Jaeger http://localhost:16686 — find service toad-eye-mcp-demo, see individual spans
Grafana http://localhost:3100 — MCP Server dashboard, see metrics aggregate
Prometheus http://localhost:9090 — raw queries, autocomplete gen_ai_mcp

The demo server is intentionally simple — three tools, mock data, no external dependencies. The point isn't the tools. The point is seeing what the observability looks like in practice.

Here's the full server — 50 lines of actual logic:

import { initObservability } from "toad-eye";
import { toadEyeMiddleware } from "toad-eye/mcp";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";

initObservability({
  serviceName: "toad-eye-mcp-demo",
  endpoint: "http://localhost:4318",
});

const server = new McpServer({ name: "toad-eye-mcp-demo", version: "1.0.0" });
toadEyeMiddleware(server, { recordInputs: true, recordOutputs: true });

server.tool(
  "calculate",
  "Evaluate a math expression",
  { expression: z.string() },
  async ({ expression }) => {
    const sanitized = expression.replace(/[^0-9+\-*/().% ]/g, "");
    if (sanitized !== expression) {
      throw new Error(`Invalid characters in expression: ${expression}`);
    }
    const result = new Function(`return (${sanitized})`)() as number;
    return { content: [{ type: "text", text: `${expression} = ${result}` }] };
  },
);

server.tool(
  "get-weather",
  "Get current weather for a city (mock data)",
  { city: z.string() },
  async ({ city }) => {
    await new Promise((r) => setTimeout(r, 50 + Math.random() * 200));
    const conditions = ["sunny", "cloudy", "rainy", "snowy", "windy"] as const;
    const condition = conditions[Math.floor(Math.random() * conditions.length)]!;
    const tempC = Math.round(-10 + Math.random() * 45);
    return {
      content: [{ type: "text", text: JSON.stringify({ city, condition, tempC }) }],
    };
  },
);

const transport = new StdioServerTransport();
await server.connect(transport);
console.error("MCP demo server running — open Grafana at http://localhost:3100");

Notice console.error on the last line — not console.log. Because stdout is the JSON-RPC wire. We learned this the hard way (article #7).

Metrics in the public API

One detail that bit us: MCP metrics existed in code but were invisible to library users. The GEN_AI_METRICS constant — the public interface for all toad-eye metric names — didn't include MCP metrics. Users writing custom dashboards or alerts had no way to discover them.

Fixed:

export const GEN_AI_METRICS = {
  // ... existing LLM metrics ...

  // MCP Server
  MCP_TOOL_DURATION: "gen_ai.mcp.tool.duration",
  MCP_TOOL_CALLS: "gen_ai.mcp.tool.calls",
  MCP_TOOL_ERRORS: "gen_ai.mcp.tool.errors",
  MCP_RESOURCE_READS: "gen_ai.mcp.resource.reads",
  MCP_SESSION_ACTIVE: "gen_ai.mcp.session.active",
} as const;

Now you can reference GEN_AI_METRICS.MCP_TOOL_DURATION in your code instead of hardcoding the string "gen_ai.mcp.tool.duration". Small thing, but it's the difference between a library and a collection of code.

Session tracking was also added — MCP_SESSION_ACTIVE is an UpDownCounter that increments when middleware initializes. In a multi-server deployment, you can see how many MCP sessions are active across your fleet.

The full trace tree

With all four follow-ups shipped, here's what a complete MCP trace looks like:

invoke_agent orchestrator                          [client process]
├── chat gpt-4o                          1.2s      client LLM call
├── tools/call calculate                 12ms  ✅  [MCP server]
├── tools/call get-weather               187ms ✅  [MCP server]
├── tools/call summarize                 2.1s  ✅  [MCP server]
│   └── chat gpt-4o (sampling)           1.8s      server → client LLM
├── resources/read toad-eye://info       3ms   ✅  [MCP server]
└── chat gpt-4o                          800ms     client LLM call

Client-side agent spans. Server-side tool spans. Server-initiated LLM spans. One trace, complete story. From "the agent decided to call a tool" to "the tool asked the LLM for help" to "the result came back" — every step is visible.

This is what MCP observability looks like when it's done.

Quick start — 5 minutes to your first MCP trace

# Install toad-eye (if not already)
npm install toad-eye

# Start the stack
npx toad-eye init
npx toad-eye up

# Run the demo MCP server with Inspector
npx @modelcontextprotocol/inspector npx tsx demo/src/mcp-server/index.ts

# Call some tools in Inspector, then check:
# Jaeger:     http://localhost:16686 (service: toad-eye-mcp-demo)
# Grafana:    http://localhost:3100  (dashboard: MCP Server)
# Prometheus: http://localhost:9090  (query: gen_ai_mcp_tool_calls_total)

Five minutes. Real traces. Real metrics. Real dashboard. No mock data.

Implementation: Follow-up 1: Demo · Follow-up 2: Dashboard · Follow-up 3: Sampling · Follow-up 4: Metrics API

Previous articles:

toad-eye — open-source LLM observability, OTel-native: GitHub · npm

🐸👁️

DEV Community

We traced an MCP server calling an LLM — both sides, one trace

The missing piece: sampling

`traceSampling()` — manual wrap for an unwrappable call

The Grafana dashboard — from metrics to answers

Top row — the four numbers

Middle row — the trends

Bottom row — errors and resources

The table — one-screen overview

The demo server — try it yourself

Metrics in the public API

The full trace tree

Quick start — 5 minutes to your first MCP trace

Top comments (0)

The missing piece: sampling

traceSampling() — manual wrap for an unwrappable call

The Grafana dashboard — from metrics to answers

Top row — the four numbers

Middle row — the trends

Bottom row — errors and resources

The table — one-screen overview

The demo server — try it yourself

Metrics in the public API

The full trace tree

Quick start — 5 minutes to your first MCP trace

`traceSampling()` — manual wrap for an unwrappable call