Jay

Posted on Apr 8

The MCP Evaluation Framework Nobody Talks About (But Should)

#ai #mcp #agents

Your agent worked fine in staging. It called the right MCP tools, returned clean outputs, passed the test suite. Then it hit production, a user sent a slightly different query, and it picked the wrong tool, passed malformed arguments, and chained three unnecessary calls before returning garbage.

I've watched this happen more times than I'd like. The model isn't the problem. The missing piece is an evaluation system that matches how MCP actually behaves at runtime.

Why MCP Changes the Eval Problem

Before MCP, most agents had hardcoded tools. You could write deterministic tests: "Given this input, the agent should call search_docs with these parameters." That worked.

MCP flips that model. An MCP-connected agent discovers tools at runtime from one or more MCP servers. The available tools can change between requests. The agent decides what to call, in what order, with what arguments, based on the user's prompt and context injected through MCP resources.

Anthropic open-sourced MCP in late 2024. Within a year it had 97 million monthly SDK downloads and 10,000+ published servers. In December 2025, Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation (AAIF), with OpenAI, Google, Microsoft, and AWS backing the move.

This creates three evaluation problems that didn't exist before:

Dynamic tool selection is non-deterministic. The same query can produce different tool call sequences depending on which MCP servers are connected and what they expose at that moment. You can't assert "the agent must call this specific tool." You evaluate whether the choice was reasonable given the available options.

Context injection needs validation. MCP servers inject resources that shape the agent's reasoning. If a resource returns stale data or an unexpected format, the agent reasons incorrectly. Your eval needs to cover whether that injected context was used correctly, not just whether the final output looked reasonable.

Chains need end-to-end tracing. A single request can trigger 5 to 10 MCP tool calls across different servers, each with its own latency, failure mode, and output quality. Following only the final response misses every intermediate failure.

Five Dimensions to Evaluate

1. Tool Selection Accuracy

Did the agent pick the right tool? Measure this against labeled examples where humans identified the optimal tools for a given query. Two sub-metrics:

Precision: Out of all tools called, how many were necessary?
Recall: Out of all tools that should have been called, how many were?

High precision with low recall means the agent is too conservative and missing useful tools. Low precision with high recall means it's calling unnecessary tools, burning tokens, and increasing latency.

2. Argument Correctness

Even when the agent picks the right tool, it can pass wrong arguments. Validate that:

Arguments match the MCP tool's JSON schema
Types are correct (no string where an integer belongs)
Required fields are present and populated
Semantic accuracy holds: did it pass the correct document ID for this specific task, not a random one?

3. Task Completion Rate

This is the bottom-line metric. Did the agent actually accomplish what the user asked? I use LLM-as-a-judge evaluators here because they catch cases where every individual tool call succeeded but the agent failed to synthesize the results correctly.

4. Chain Efficiency

MCP agents can make far more tool calls than necessary. Track:

Total tool calls per request
Redundant calls (same tool, same arguments called twice)
Calls whose outputs never appeared in the final response
Total chain latency

An agent that calls 8 tools when 2 would do isn't just slow. It's expensive and significantly harder to debug.

5. Context Utilization

MCP servers expose resources that influence the agent's reasoning. Evaluate whether the agent used that context accurately or hallucinated information that contradicted it. The key metrics are groundedness and context relevance.

Here are the thresholds I use as a starting baseline:

Tool Selection Precision: >85%
Tool Selection Recall: >90%
Argument Schema Compliance: >98%
Task Completion: >80%
Chain Efficiency Ratio (min needed calls / actual calls): >0.7
Groundedness: >85%
P95 Latency: <5s

You Can't Eval What You Can't See

Tracing is the foundation. The standard approach is OpenTelemetry-based instrumentation, where each MCP tool call becomes a span recording: tool name, server name, arguments, response, latency, and status code. These spans nest under a parent trace representing the full user request.

A well-instrumented MCP trace captures:

Root span: User query received, final response returned
LLM decision span: Model reasoning, tool selection decision
MCP tool call spans: One per invocation, with full arguments and response
Context retrieval spans: MCP resource fetches
Synthesis span: Final response generation from tool outputs

TraceAI is an open-source library that extends OpenTelemetry with AI-specific semantic conventions. It supports 20+ frameworks including OpenAI, Anthropic, LangChain, and CrewAI. Setup is under 10 lines:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="mcp_agent_prod"
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

Once traces are flowing, you can visualize every LLM call, tool invocation, and retrieval step as nested timelines on the Future AGI Observe dashboard, with latency, cost, and evaluation scores side-by-side.

Building the Pipeline

Step 1: Instrument your agent.
Set up auto-instrumentation with TraceAI or a compatible library. Capture the MCP-specific attributes too: which server the tool came from, schema version, and whether the call was a retry. That context is critical when debugging failures at 2am.

Step 2: Define your evaluation criteria.
Pick metrics from the five pillars based on your use case. A support agent should prioritize task completion and groundedness. A code generation agent should prioritize argument correctness and chain efficiency.

Step 3: Set up automated evaluators.
For subjective measurements like task completion and response quality, use LLM-as-a-judge. For objective checks like schema compliance and latency thresholds, use deterministic validators.

The evaluation SDK ships with 60+ pre-built templates covering factual accuracy, groundedness, tone, conciseness, and more:

from fi.evals import Evaluator

evaluator = Evaluator(
    fi_api_key="your_api_key",
    fi_secret_key="your_secret_key"
)

result = evaluator.evaluate(
    eval_templates="groundedness",
    inputs={
        "context": retrieved_context,
        "output": agent_response
    },
    config={"model": "turing_flash"}
)

Step 4: Sample and score production traffic.
Don't eval every request. A 10-20% sampling rate works for most teams. For finance or healthcare, push toward 100%. Future AGI's Eval Tasks let you schedule scoring on live or historical traffic with configurable sampling rates.

Step 5: Alert on regression.
Threshold-based alerts are what turn passive monitoring into an actual feedback loop:

Task completion drops below 80%? Alert.
Average tool calls per request spikes above 6? Alert.
Argument schema compliance dips below 95%? Alert.

Route these to Slack, PagerDuty, or your CI/CD pipeline.

Failure Patterns Worth Flagging

A few I keep running into:

Testing only the happy path. Dev and staging MCP servers have limited tool sets. Mirror production MCP server configs in your test environment, or you're not actually testing the surface area that breaks.

Evaluating calls in isolation. Evaluating each tool call without considering the chain misses ordering failures. Evaluate full sequences and flag when order affects correctness.

LLM-as-a-judge without deterministic checks. LLM evaluators are inconsistent on their own. Pair them with schema validation, not instead of it.

No established baseline. If you don't record baseline metrics in the first week, you can't detect degradation. Track deltas. Absolute scores lie.

No cost tracking. Tool calls compound fast in MCP chains. Include token and call costs in every trace. Set spike alerts before the bill does it for you.

Evaluating post-ship only. Running evals only after deployment means you're always reacting. Enable tracing in experiment mode during development and surface failure patterns before they reach production.

Closing the Loop

Evaluation without action is just monitoring. The actual cycle:

Trace every MCP tool call with OpenTelemetry-compatible instrumentation
Evaluate sampled traces across the five metrics automatically
Identify failure patterns through clustering: which tools fail most, which queries produce the worst task completion scores
Iterate on prompts, tool descriptions, and MCP server configurations based on evaluation feedback
Verify improvements by comparing eval scores across deployment versions

The teams shipping reliable MCP-connected agents aren't the ones with the best models. They're the ones with the best evaluation pipelines. Start there.

DEV Community