MCP servers are easy to wire up. You export a tool, define a schema, connect a client, done. Then Claude generates the wrong arguments, the handler silently misroutes an edge case, and the judge returns 0.9 for an empty string response.
The protocol layer rarely gets tested properly. Here's the 3 test framework I use. Each one catches a different class of failure.
1. Schema Contract — Did someone rename a field?
What breaks: A developer changes groundTruth to ground_truth in the tool schema. Every MCP caller breaks silently. No type error at runtime, no warning, just wrong behaviour.
How to test: Import the exported schema object directly and assert on its shape. No server boot, no network call, no mocks. Pure property assertions on the contract.
expect(schema.type).toBe('object');
expect(schema.properties).toHaveProperty('llm_response');
expect(schema.properties).toHaveProperty('groundTruth');
expect(schema.properties.criteria.type).toBe('array');
expect(schema.required).toContain('llm_response');
expect(schema.required).toContain('groundTruth');
expect(schema.required).toContain('criteria');
If a field is renamed or retyped, this fails before any integration test wastes time booting. It's the cheapest regression guard in the stack.
2. Tool Behaviour — Does the protocol layer handle edge cases?
What breaks: The MCP server handler has routing or edge-case bugs independent of the LLM. An empty string input shouldn't score 0.9. A judge crash shouldn't return garbage — it should surface as a structured error.
How to test: Wire MCP Client → Server through InMemoryTransport (no network). Replace the real Claude judge with a vi.fn() mock. This isolates the protocol layer completely from LLM variability.
Test 1 — empty string input
llm_response: ""
Expected: score < 0.2
Got: overall: 0.9 ✗ FAIL
Root cause: assertion threshold was > 1 — impossible on a 0–1 scale.
Fix: expect(result.overall).toBeLessThan(0.2)
Test 2 — judge throws "Claude API is down"
Expected: { error: "Claude API is down", isError: true }
Got: { error: "Claude API is down", isError: true } ✓ PASS
The failing test here isn't a protocol bug. It's a broken assertion — the threshold was set to > 1, which no valid score can ever satisfy. This test suite found it.
3. LLM in Loop — Does Claude actually generate the right arguments?
What breaks: The schema tests pass, the transport tests pass, but Claude generates wrong tool arguments when it sees the real tool list. Or the judge is miscalibrated and scores a clearly wrong answer at 0.9. The only way to catch this is to run the full round-trip with real inputs.
How to test: Send Claude a message via the Anthropic API. Claude reads the tool list from the MCP server, decides to call analyze_response_quality, and generates its own input arguments. The test doesn't control what Claude sends. Then assert on the scores.
Known-good: "Photosynthesis is the process by which plants use sunlight,
water, and CO2 to produce glucose and oxygen."
→ overall: 0.95 (accuracy: 1.0, relevance: 0.9) assert ≥ 0.8 ✓
Known-bad: "Photosynthesis is when plants absorb soil nutrients to grow."
→ overall: 0.05 (accuracy: 0.0, relevance: 0.1) assert ≤ 0.4 ✓
The full path: user message → Claude → tool_use: analyze_response_quality → MCP Client → InMemoryTransport → MCP Server → callClaudeJudge → score back to Claude.
This catches two things at once: whether Claude generates valid arguments, and whether judge is calibrated correctly.
I ran this on a real MCP server. Here's what came back.
9 tests total. 1 failed. 89% pass rate.
| Test | Result |
|---|---|
| Schema Contract | 5/5 |
| Tool Behaviour | 1/2 |
| LLM in Loop | 2/2 |
The single failure is in Tool Behaviour, and it's not a server bug. The assertion for the empty string case expects overall > 1 — a threshold that's impossible to satisfy on a 0–1 scale. The mock judge returned 0.9, which is actually wrong behavior (empty input should score near zero), but the test would have failed regardless of what score came back. Two bugs in one: an over-optimistic mock and a broken assertion.
The fix is two lines:
// wrong
expect(result.overall).toBeGreaterThan(1);
// right — empty input should score low
expect(result.overall).toBeLessThan(0.2);
Why three separate layers?
Each test catches a different class of bug. Run all three before shipping.
| Layer | What it tests | Needs API? |
|---|---|---|
| Schema Contract | Tool definition shape | No |
| Tool Behavior | MCP protocol + handler | No (mock judge) |
| LLM in Loop | Claude's argument generation + judge calibration | Yes |
Schema fails fast and cheap. Tool Behavior catches protocol bugs without burning API credits. LLM in Loop is the only one that validates actual end-to-end behavior — Claude reading the tool list, generating arguments, and getting a meaningful score back.



Top comments (0)