Tracing Tool Calls in MCP Workflows: Per-Tool Latency, Cost, and Failure Modes

#programming #typescript #ai #webdev

Tracing Tool Calls in MCP Workflows: Per-Tool Latency, Cost, and Failure Modes

MCP agents are opaque by default. You call generate(), the model decides to invoke four tools, and three seconds later you get an answer. Which tool took 2.8 of those seconds? Was it the database lookup? The file read? The search API that's been flaky since Tuesday?

Without per-tool tracing, you're debugging a black box.

This article builds on three earlier pieces — AI Observability: Logging, Tracing, and Monitoring Your AI Calls, Building AI Agents with MCP and TypeScript in 2026, and OpenTelemetry for AI: Tracing Every Token Through Your Pipeline — and goes one level deeper: tracing individual tool invocations in multi-step MCP workflows, not just the outer LLM call.

Why Tool-Level Tracing Matters

A typical MCP workflow looks like this:

User prompt
  → search (github.search_code)         ~200ms
  → read × 3 (github.read_file)         ~150ms × 3
  → analyze (code-analyzer.check)       ~2,800ms  ← the real bottleneck
  → write (github.create_issue)         ~300ms
Total: ~3,750ms

If you only trace the outer generate() call, you see 3.75 seconds and conclude "the LLM is slow." It isn't. The analyzer tool is slow, and it's probably doing something fixable — a missing index, a cold lambda, an API without keepalive.

A "slow agent" is almost always one slow tool. The rest is noise.

The Tracing Architecture

The pattern has three parts:

NeuroLink's transformParams hook — fires before each tool invocation, starts an OTEL span
NeuroLink's transformResponse hook — fires after each tool returns, ends the span with latency + outcome
onError hook — fires on tool failures, records the error type and whether it's retryable

You wrap the MCP tool layer with spans, not the entire generate() call. This gives you one span per tool call, nested under the parent trace.

Setup: OpenTelemetry + NeuroLink

npm install @juspay/neurolink
npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/exporter-otlp-http

Initialize your tracer once at startup:

// tracing.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-otlp-http";
import { trace, SpanStatusCode } from "@opentelemetry/api";

export const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
  }),
});

sdk.start();

export const tracer = trace.getTracer("neurolink-mcp-agent", "1.0.0");
export { SpanStatusCode };

Wrapping Each Tool Call in a Span

NeuroLink's middleware system exposes transformParams and transformResponse. For MCP tool tracing, you write middleware that instruments the tool dispatch layer.

Here's the core pattern:

import { NeuroLink } from "@juspay/neurolink";
import { tracer, SpanStatusCode } from "./tracing";
import { context, trace } from "@opentelemetry/api";

// Track in-flight spans by tool call ID so we can close them on response
const activeSpans = new Map<string, ReturnType<typeof tracer.startSpan>>();

const mcpTracingMiddleware = {
  name: "mcp-tool-tracer",
  priority: 115, // Runs before analytics, after transform

  transformParams: async (params: any, ctx: any) => {
    // Only instrument MCP tool calls, not direct generate() calls
    if (!params.toolCall) return params;

    const toolName = params.toolCall.name;
    const serverId = params.toolCall.serverId ?? "unknown";
    const callId = params.toolCall.id ?? `${toolName}-${Date.now()}`;

    // Start a span as a child of the current active trace
    const parentContext = ctx.traceContext ?? context.active();
    const span = tracer.startSpan(
      `mcp.tool.${toolName}`,
      {
        attributes: {
          "mcp.tool.name": toolName,
          "mcp.server.id": serverId,
          "mcp.call.id": callId,
          "mcp.input.size": JSON.stringify(params.toolCall.input ?? {}).length,
        },
      },
      parentContext
    );

    activeSpans.set(callId, span);

    // Attach call start time for latency calculation
    return { ...params, _spanCallId: callId, _spanStartMs: Date.now() };
  },

  transformResponse: async (response: any, params: any) => {
    const callId = params._spanCallId;
    if (!callId) return response;

    const span = activeSpans.get(callId);
    if (!span) return response;

    const latencyMs = Date.now() - (params._spanStartMs ?? Date.now());

    span.setAttributes({
      "mcp.latency_ms": latencyMs,
      "mcp.output.size": JSON.stringify(response?.result ?? {}).length,
      "mcp.success": true,
    });
    span.setStatus({ code: SpanStatusCode.OK });
    span.end();
    activeSpans.delete(callId);

    return response;
  },
};

Handling Failure Modes

MCP tool failures come in three distinct flavors, and they need different handling:

const mcpErrorMiddleware = {
  name: "mcp-error-classifier",
  onError: (error: Error, metadata: any) => {
    const callId = metadata._spanCallId;
    const span = callId ? activeSpans.get(callId) : undefined;

    // Classify the failure
    let errorClass: string;
    let retryable: boolean;

    if (error.message.includes("ETIMEDOUT") || error.message.includes("timeout")) {
      // Tool server didn't respond in time
      errorClass = "timeout";
      retryable = true;
    } else if (error.message.includes("rate_limit") || error.message.includes("429")) {
      // Tool's upstream API is rate-limited
      errorClass = "rate_limit";
      retryable = true; // After backoff
    } else if (
      error.message.includes("SyntaxError") ||
      error.message.includes("JSON") ||
      error.message.includes("schema")
    ) {
      // Tool returned malformed output
      // This is almost always a tool implementation bug, not a transient error
      errorClass = "malformed_output";
      retryable = false;
    } else {
      errorClass = "unknown";
      retryable = metadata.recoverable ?? false;
    }

    if (span) {
      span.setAttributes({
        "mcp.error.class": errorClass,
        "mcp.error.message": error.message,
        "mcp.error.retryable": retryable,
        "mcp.latency_ms": Date.now() - (metadata._spanStartMs ?? Date.now()),
      });
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: `${errorClass}: ${error.message}`,
      });
      span.end();

      if (callId) activeSpans.delete(callId);
    }

    // Don't swallow — NeuroLink re-throws after onError fires
    logger.error("mcp_tool_error", {
      tool: metadata.toolCall?.name,
      errorClass,
      retryable,
      message: error.message,
    });
  },
};

The malformed output case deserves special attention. When a tool returns JSON that doesn't match the schema the LLM expected, the model may retry the tool call silently — or hallucinate an answer without the data. Both are bad. Tag these errors distinctly so you can find them in your trace backend.

Trace ID Propagation Across the Agent Run

Each generate() call should share a single trace ID so you can reconstruct the full tool chain in your trace viewer. Pass the trace context explicitly:

import { NeuroLink } from "@juspay/neurolink";
import { tracer, SpanStatusCode } from "./tracing";
import { context, propagation, ROOT_CONTEXT } from "@opentelemetry/api";

async function runAgent(prompt: string, agentRunId: string) {
  // Create the root span for the entire agent run
  const rootSpan = tracer.startSpan("agent.run", {
    attributes: {
      "agent.run_id": agentRunId,
      "agent.prompt_length": prompt.length,
    },
  });

  // All tool spans created during this generate() will be children of rootSpan
  const agentContext = trace.setSpan(context.active(), rootSpan);

  const ai = new NeuroLink({
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    apiKey: process.env.ANTHROPIC_API_KEY,
    middleware: [mcpTracingMiddleware, mcpErrorMiddleware],
  });

  await ai.addExternalMCPServer("github", {
    command: "npx",
    args: ["-y", "@modelcontextprotocol/server-github"],
    transport: "stdio",
    env: { GITHUB_TOKEN: process.env.GITHUB_TOKEN },
  });

  let result: any;

  await context.with(agentContext, async () => {
    result = await ai.generate({
      input: { text: prompt },
    });
  });

  rootSpan.setAttributes({
    "agent.tool_calls": result?.toolCalls?.length ?? 0,
    "agent.tokens.total": result?.usage?.totalTokens ?? 0,
    "agent.success": true,
  });
  rootSpan.setStatus({ code: SpanStatusCode.OK });
  rootSpan.end();

  return result;
}

In Jaeger, Grafana Tempo, or Honeycomb, this renders as a flame graph: the root span contains the LLM call, the LLM call contains all the tool spans, each tool span shows latency and outcome. The slow tool stands out immediately.

What to Track Per Tool

Once you have spans, you want these attributes on every tool call:

Attribute	Why
`mcp.tool.name`	Which tool — the primary grouping key
`mcp.server.id`	Which MCP server — helps distinguish tools with the same name on different servers
`mcp.latency_ms`	The single most useful number for debugging slow agents
`mcp.input.size`	Large inputs often cause timeouts; correlate with latency
`mcp.output.size`	Unexpected output sizes suggest a schema mismatch
`mcp.error.class`	`timeout` / `rate_limit` / `malformed_output` — drives alerting strategy
`mcp.error.retryable`	Whether the agent should attempt this tool call again

For cost-bearing tools (ones that hit paid APIs internally), you can add mcp.estimated_cost_usd from your internal pricing table. The tool cost is often larger than the LLM cost for data-heavy workflows.

A note on where this data lives: NeuroLink's generate() result exposes a toolCalls array (each entry is {toolCallId, toolName, args}) plus toolExecutions ({name, input, output}) and toolResults. The middleware sees the full call/response inside transformParams / wrapGenerate — that's where you do the rich span work — but the post-generation summary is also available on the result object for ad-hoc analytics.

Per-Tool Latency Budgets

You can extend the tracing middleware to enforce latency budgets per tool class — failing fast instead of waiting for a timeout:

const TOOL_TIMEOUT_MS: Record<string, number> = {
  "github.read_file": 5_000,
  "github.search_code": 8_000,
  "database.query": 3_000,
  "web.search": 10_000,
};

const timeoutMiddleware = {
  name: "tool-timeout",
  priority: 125,
  transformParams: async (params: any) => {
    if (!params.toolCall) return params;

    const toolName = params.toolCall.name;
    const budget = TOOL_TIMEOUT_MS[toolName];

    if (budget) {
      // Attach deadline for downstream middleware/provider to enforce
      return { ...params, toolCall: { ...params.toolCall, timeoutMs: budget } };
    }
    return params;
  },
};

A practical note on timeoutMs: NeuroLink's tool registration accepts a per-tool timeoutMs (see ToolRegistrationOptions in the SDK) and forwards it to the tool executor for MCP-level enforcement — so registering a tool with { execute, timeoutMs: 3000 } is the cleanest way to set a hard per-tool deadline. The middleware pattern above mutates params.toolCall.timeoutMs, which is only useful if you have a downstream middleware that reads it — pick one approach, not both.

Putting It All Together

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink({
  provider: "anthropic",
  model: "claude-sonnet-4-6",
  apiKey: process.env.ANTHROPIC_API_KEY,
  middleware: [
    timeoutMiddleware,      // Priority 125: set per-tool deadlines
    mcpTracingMiddleware,   // Priority 115: open/close OTEL spans
    mcpErrorMiddleware,     // Priority 115: classify errors (onError hook)
    // Priority 100: NeuroLink's built-in analytics middleware
  ],
});

await ai.addExternalMCPServer("github", {
  command: "npx",
  args: ["-y", "@modelcontextprotocol/server-github"],
  transport: "stdio",
  env: { GITHUB_TOKEN: process.env.GITHUB_TOKEN },
});

// Run with trace propagation
const result = await runAgent(
  "Find all files using processPayment() and create a security issue if any lack input validation",
  "run-" + crypto.randomUUID()
);

Your trace now shows:

Root span: total agent execution time
LLM span: actual model inference time (usually the smallest component)
Tool spans: github.search_code, github.read_file × N, github.create_issue with individual latencies
Error spans: any tools that failed, with error class and retryable flag

The model is usually not your bottleneck. The tools are. Now you can prove it.

Get started with NeuroLink: