DEV Community: Rajat Sharma

The 3 test framework I use for MCP servers

Rajat Sharma — Thu, 05 Mar 2026 09:06:15 +0000

MCP servers are easy to wire up. You export a tool, define a schema, connect a client, done. Then Claude generates the wrong arguments, the handler silently misroutes an edge case, and the judge returns 0.9 for an empty string response.

The protocol layer rarely gets tested properly. Here's the 3 test framework I use. Each one catches a different class of failure.

1. Schema Contract — Did someone rename a field?

What breaks: A developer changes groundTruth to ground_truth in the tool schema. Every MCP caller breaks silently. No type error at runtime, no warning, just wrong behaviour.

How to test: Import the exported schema object directly and assert on its shape. No server boot, no network call, no mocks. Pure property assertions on the contract.

expect(schema.type).toBe('object');
expect(schema.properties).toHaveProperty('llm_response');
expect(schema.properties).toHaveProperty('groundTruth');
expect(schema.properties.criteria.type).toBe('array');
expect(schema.required).toContain('llm_response');
expect(schema.required).toContain('groundTruth');
expect(schema.required).toContain('criteria');

If a field is renamed or retyped, this fails before any integration test wastes time booting. It's the cheapest regression guard in the stack.

2. Tool Behaviour — Does the protocol layer handle edge cases?

What breaks: The MCP server handler has routing or edge-case bugs independent of the LLM. An empty string input shouldn't score 0.9. A judge crash shouldn't return garbage — it should surface as a structured error.

How to test: Wire MCP Client → Server through InMemoryTransport (no network). Replace the real Claude judge with a vi.fn() mock. This isolates the protocol layer completely from LLM variability.

Test 1 — empty string input
llm_response: ""
Expected: score < 0.2
Got:      overall: 0.9  ✗ FAIL

Root cause: assertion threshold was > 1 — impossible on a 0–1 scale.
Fix: expect(result.overall).toBeLessThan(0.2)

Test 2 — judge throws "Claude API is down"
Expected: { error: "Claude API is down", isError: true }
Got:      { error: "Claude API is down", isError: true }  ✓ PASS

The failing test here isn't a protocol bug. It's a broken assertion — the threshold was set to > 1, which no valid score can ever satisfy. This test suite found it.

3. LLM in Loop — Does Claude actually generate the right arguments?

What breaks: The schema tests pass, the transport tests pass, but Claude generates wrong tool arguments when it sees the real tool list. Or the judge is miscalibrated and scores a clearly wrong answer at 0.9. The only way to catch this is to run the full round-trip with real inputs.

How to test: Send Claude a message via the Anthropic API. Claude reads the tool list from the MCP server, decides to call analyze_response_quality, and generates its own input arguments. The test doesn't control what Claude sends. Then assert on the scores.

Known-good: "Photosynthesis is the process by which plants use sunlight,
             water, and CO2 to produce glucose and oxygen."
→ overall: 0.95  (accuracy: 1.0, relevance: 0.9)  assert ≥ 0.8  ✓

Known-bad: "Photosynthesis is when plants absorb soil nutrients to grow."
→ overall: 0.05  (accuracy: 0.0, relevance: 0.1)  assert ≤ 0.4  ✓

The full path: user message → Claude → tool_use: analyze_response_quality → MCP Client → InMemoryTransport → MCP Server → callClaudeJudge → score back to Claude.

This catches two things at once: whether Claude generates valid arguments, and whether judge is calibrated correctly.

I ran this on a real MCP server. Here's what came back.

9 tests total. 1 failed. 89% pass rate.

Test	Result
Schema Contract	5/5
Tool Behaviour	1/2
LLM in Loop	2/2

The single failure is in Tool Behaviour, and it's not a server bug. The assertion for the empty string case expects overall > 1 — a threshold that's impossible to satisfy on a 0–1 scale. The mock judge returned 0.9, which is actually wrong behavior (empty input should score near zero), but the test would have failed regardless of what score came back. Two bugs in one: an over-optimistic mock and a broken assertion.

The fix is two lines:

// wrong
expect(result.overall).toBeGreaterThan(1);

// right — empty input should score low
expect(result.overall).toBeLessThan(0.2);

Why three separate layers?

Each test catches a different class of bug. Run all three before shipping.

Layer	What it tests	Needs API?
Schema Contract	Tool definition shape	No
Tool Behavior	MCP protocol + handler	No (mock judge)
LLM in Loop	Claude's argument generation + judge calibration	Yes

Schema fails fast and cheap. Tool Behavior catches protocol bugs without burning API credits. LLM in Loop is the only one that validates actual end-to-end behavior — Claude reading the tool list, generating arguments, and getting a meaningful score back.

Your RAG pipeline is probably lying to you (here's how to test it)

Rajat Sharma — Tue, 03 Mar 2026 11:43:26 +0000

I've seen a lot of RAG setups that look great in demos. Clean answers, fast responses, confident tone. Then something breaks silently and nobody notices until a user gets a wrong answer served with full confidence.

The problem isn't the LLM. It's that the RAG layer rarely gets tested properly.

Here's the 5-test framework I use. Each one catches a different class of failure.

1. Faithfulness — Is the answer grounded in your docs?

What breaks: The model answers correctly... but pulls from its training memory, not your knowledge base. You can't audit it. You can't trust it.

How to test: Use an LLM-as-judge to map every claim in the answer back to a retrieved chunk. Score = supported_claims / total_claims. Set a threshold and fail anything below it.

Answer: "Minimum password length is 16 characters"
Chunk:  "Passwords must be a minimum of 16 characters..."
→ Claim supported ✓

If the answer says "12 characters (industry standard)" and your doc says 16, that's hallucination. Even if 12 is technically reasonable.

2. Context Precision — Are you even retrieving the right chunks?

What breaks: Garbage in, garbage out. Your retriever pulls irrelevant docs and the LLM does its best with bad context.

How to test: For each query, score the relevance of every retrieved chunk (0 to 1) using an LLM judge. If less than 2 out of 3 chunks score above your threshold, your embedding/retrieval setup has alignment issues.

This one catches problems that faithfulness testing misses. It checks before the answer is generated.

3. Negative Testing — Does it know what it doesn't know?

What breaks: Someone asks about something not in your KB. The model fills the gap with training data and answers confidently. In compliance, legal, or medical contexts, this is genuinely dangerous.

How to test: Write a list of questions that are deliberately outside your knowledge base. For each one, check if the response contains a refusal phrase like "I don't have information about..." and if it doesn't, you've caught a live hallucination.

Query: "How many vacation days do employees get?" (not in KB)

✓ Pass: "I don't have that information in the knowledge base."
✗ Fail: "Employees typically receive 15 days per year."

Simple string matching. No LLM needed. Fast and deterministic.

4. Retrieval Unit Test — Did your re-index break anything?

What breaks: You swap your embedding model, change chunk size, or re-index. Now a query that used to find the right doc doesn't anymore. No errors thrown. Pipeline looks fine. It's just wrong.

How to test: Maintain a ground-truth lookup: query -> expected doc ID. After every infrastructure change, run this. Check that the expected doc ID appears somewhere in your top-K results.

Query: "What is the minimum password length?"
Expected: doc-003-security in top-3
Got: [doc-003-security (0.94), doc-001 (0.41), doc-002 (0.38)] ✓

No LLM needed. Pure regression testing for your retrieval layer.

5. Stale Data — Did your update actually propagate?

What breaks: You update a policy doc and re-ingest it. But vector stores don't automatically replace old embeddings. They add the new one alongside the old one. Now both exist. Queries return either version non-deterministically.

How to test: Two phases.

Phase 1: Ingest v1, query, confirm old value ("90 days") is returned
Phase 2: Clear the collection, ingest v2, query, confirm new value ("60 days") is returned AND old value is absent

If you skip the clear step in phase 2, you'll reproduce the bug right in your test suite.

expect(v2Answer).toMatch(/\b60\b/);
expect(data.answer).not.toMatch(new RegExp(`\\b${stale.v1.assertion}\\b`));

I ran this on a real KB. Here's what came back.

14 tests total. 4 failed. 78% pass rate.

Test	Result
Faithfulness	3/3
Context Precision	0/3
Negative	5/5
Retrieval Unit	4/4
Stale Data	1/1

Here's what the failures actually look like.

Context precision failed all 3 tests with the same pattern. Every query retrieved exactly 1 relevant chunk out of 3. The right document always ranked first, but the similarity scores were too close together. For a password policy query:

doc-003-security      0.374  relevant
doc-002-remote-work   0.209  irrelevant
doc-005-reimbursement 0.191  irrelevant

That 0.165 gap between the right doc and the wrong ones isn't confidence, it's noise. The retriever is finding the right doc by a slim margin. The answers look fine because the LLM is working around weak retrieval. That won't hold as the KB grows.

The takeaway

Most RAG failures aren't spectacular crashes. They're silent. Confident wrong answers. Stale policy info. Hallucinated numbers. A retriever that quietly regressed after a model swap.

These 5 tests cover the whole pipeline:

Layer	Test
Generation	Faithfulness
Retrieval quality	Context Precision
Out-of-scope handling	Negative
Retrieval regression	Unit Test
Update propagation	Stale Data