pengspirit

Posted on May 5 • Originally published at github.com

Schema descriptions are load-bearing: why missing parameter descriptions break MCP clients

#mcp #claude #devtools #testing

I shipped mcp-probe — a CLI that points at any MCP server, enumerates every tool, resource, and prompt, calls each with auto-generated arguments, validates against declared schemas, prints a pass/fail scorecard, and exits 0/1 for CI.

The plan for launch week: run it against the official Node MCP servers and post results. The first run made me look like I'd broken half the ecosystem. The second, after I read my own output, told a different story — most failures were bugs in my client, not the servers. The rest collapsed into one finding about schema design.

This post is the corrected version. Three sections: what mcp-probe does, what the scorecards say, and the three bugs I fixed in my own client first.

1. What mcp-probe does

One command. stdio, SSE, or Streamable HTTP transport. No config file required.

npx @incultnitollc/mcp-probe test "npx -y @modelcontextprotocol/server-memory"

Output is a scorecard:

Tools callable:      9/9
Resources readable:  n/a
Prompts callable:    n/a
Schema warnings:     4
ALL CHECKS PASSED

Exit code 0 if everything passes, 1 if anything fails. Drop it in CI:

- run: npx -y @incultnitollc/mcp-probe test "node dist/index.js"

Install globally if you'd rather not npx every time:

npm install -g @incultnitollc/mcp-probe

The mental model is curl for MCP servers. You don't open Claude Desktop, hand-write a config, restart the app, and stare at the tool list to see whether anything broke. You run one command and get a scorecard.

2. What I found across the four official Node servers

Here is the actual scorecard from docs/scorecards/SUMMARY.md, re-run on @incultnitollc/mcp-probe@1.0.1:

Server	Tools	Resources	Prompts	Schema warns	Status
`@modelcontextprotocol/server-memory`	9 / 9	n/a	n/a	4	PASS
`@modelcontextprotocol/server-sequential-thinking`	1 / 1	n/a	n/a	0	PASS
`@modelcontextprotocol/server-everything`	12 / 13	7 / 7	3 / 4	1	partial
`@modelcontextprotocol/server-filesystem`	8 / 14	n/a	n/a	18	partial

Aggregate: 30 of 37 tools callable across four servers, 81%. Two servers fully pass. The other two have a single failure pattern between them.

A scope note before the finding, because I got this wrong the first time: Anthropic's fetch MCP server is Python-only, installed via uvx mcp-server-fetch. It has never been published to npm. mcp-probe runs against any stdio MCP server regardless of language — only this scorecard is scoped to the official Node servers. Earlier launch copy of mine that called server-fetch "broken on npm" was wrong, and I want to flag it explicitly here because I almost shipped that draft.

Now the real finding. Every remaining failure on the partial-pass servers traces to the same root cause: missing description fields on schema properties.

On server-filesystem, six of the fourteen tools fail because mcp-probe doesn't know which arguments are supposed to be file paths versus directory paths versus arbitrary strings. The path parameter on read_file, read_text_file, read_media_file, edit_file, and write_file has no description in the schema, so my client defaults to the allowed sandbox directory itself. The server correctly returns EISDIR (you tried to read a directory as a file) or EACCES (you tried to write to one). move_file fails the same way — both source and destination resolve to the same directory, and the server correctly refuses the no-op rename. The server is doing its job. The schema is the gap.

On server-everything, one prompt fails because the resourceType argument has no description. It's an enum — "Text" or "Blob" — but with no description and no examples, my client passes the literal string "test" and the server correctly returns Invalid resourceType: test. The schema validator inside mcp-probe even raises a warning on this property before the call fires:

WARN  get-resource-reference — Property "resourceType" missing description

That warning is the diagnostic working as intended — mcp-probe still attempts the call, then surfaces both the warning and the resulting failure side-by-side so you can see the connection.

The substantive insight, and the line I'll repeat at every MCP-related event for the next year: when an MCP server ships parameter properties without descriptions, no automated tool can guess valid arguments. Not mcp-probe. Not your IDE's autocomplete. Not an LLM trying to call the tool from Claude Desktop. Schema descriptions aren't documentation polish. They're the instruction manual the model is reading every time it picks an argument. They're load-bearing.

If you maintain an MCP server and you want a quick win, add "description" to every property in every input schema. The 18 schema warnings on server-filesystem are not 18 separate problems — they're 18 instances of the same one-line fix.

3. The three bugs I fixed in my own client first

Here's the part I want to be honest about. The first time I ran mcp-probe against server-filesystem, I got 2 of 14 tools passing and a scorecard that screamed FAIL. My instinct was to write a launch post saying "the official filesystem server is broken." I almost did.

Then I actually read my own output. Most of those failures were because my client was sending arguments the server had no way to accept. A diagnostic tool is only credible if it can distinguish "your server is broken" from "I sent garbage." Stress-testing forced that distinction, and three commits came out of it before I trusted the scorecard.

Commit 3825170 — show the args we sent on every failure. When a tool or prompt call fails, mcp-probe now prints the exact JSON it sent alongside the server's error response. Before this, a failure looked like MCP error -32603: Invalid resourceType: test with no indication that "test" was something my client had auto-generated. After this, you can read the failure and immediately tell whether the server rejected something reasonable or something nonsense. This is the smallest of the three changes and the most important one for the trust story.

Commit ce4f55e — sandbox-aware paths. server-filesystem enforces an allowed-directory sandbox. mcp-probe now calls list_allowed_directories before generating sample arguments and uses one of those directories as the default for any path-shaped parameter. On macOS, where /tmp is a symlink to /private/tmp, it normalizes via realpath so the path the server receives matches what the sandbox check expects. This single commit moved server-filesystem from 2 of 14 passing to 8 of 14. The remaining 6 are the missing-description cases I already covered — the bugs that aren't mine.

Prompt-argument enum extractor. When a prompt argument is described in prose like "one of: Text, Blob" instead of as a JSON Schema enum, mcp-probe now tries to parse the allowed values out of the description string and pick one. Partial — it works on the prompts that have prose-level documentation, and it does nothing for arguments like resourceType on server-everything that have neither schema enum nor prose description. This is why the schema-description finding above isn't theoretical: I built the workaround, and the workaround can't help when there's no text to read.

The loop, in one sentence: I had to make my client honest about what it was sending before I could call any server's failure a server bug.

Try it

npm install -g @incultnitollc/mcp-probe
mcp-probe test "npx -y @modelcontextprotocol/server-memory"

Repo: github.com/incultnitollc/mcp-probe
npm: @incultnitollc/mcp-probe
Raw scorecards from this post: docs/scorecards/
Pre-publish checklist for MCP server maintainers: docs/checklist.md

If you maintain an MCP server and you want a scorecard run against it, open an issue with the test-my-server template and I'll post the results as a comment. If mcp-probe reports something that looks like a server bug and isn't, open an issue against mcp-probe instead — that's the loop that produced commits 3825170 and ce4f55e, and it's the only way the diagnostic gets more trustworthy.

Top comments (7)

Mads Hansen • May 5

Strong point. “Schema descriptions are load-bearing” is probably one of the most practical MCP lessons right now.

The part I like is that the failure is not dramatic. Nothing explodes. The client just guesses badly:

path vs directory path
enum intent vs arbitrary string
search query vs identifier
read operation vs mutating operation

That is exactly the kind of ambiguity agents turn into retries, wrong calls, or quiet confidence in the wrong result.

I’d add one production habit: treat tool schemas like API contracts, not generated documentation. Every parameter should explain:

what kind of value it expects
what constraints apply
what not to pass
whether the value changes data or only narrows a read
one good example when ambiguity is likely

If the model has to infer meaning from a parameter name alone, the tool interface is under-specified. This matters even more for database tools, where a vague schema can become a real access-control or blast-radius problem.

pengspirit • May 5

Shipped — commit 9b6f5e3 on main.

Section 1: the 5-axis is the new top-of-section paragraph plus a bullet on mutation-vs-read legibility. Section 6: mutation-tool blast radius as an access-control surface, your harder-version framing. Acknowledgments section at the bottom credits you with a link back to this comment.

Used your dev.to handle as the placeholder. Drop a GH / personal / domain handle if you'd rather be credited under that and I'll push a one-line edit.

Live: github.com/incultnitollc/mcp-probe...

pengspirit • May 5

The mutation-vs-read axis is the dimension my checklist Section 1 missed — and it's the safety-critical one. When an LLM is choosing between two tools that could both plausibly match a user's intent, "this one writes" vs "this one only narrows a read" is exactly the disambiguator that should be load-bearing. Schema property descriptions aren't quite the right place to put that signal — a sibling annotation (mutating: true, or convention-based read_* / update_* name prefixes) might be cleaner. But the model can't act on a distinction that isn't surfaced anywhere, so something has to carry it.
Treating tool schemas as API contracts rather than generated documentation is the framing I want to push too. The access-control / blast-radius point on database tools is the harder version of the same argument: a vague query parameter isn't just a usability issue, it's a privilege-escalation surface.

Going to PR your 5-item list (with the mutating/read axis) into github.com/incultnitollc/mcp-probe... Section 1, with attribution. If you have a public handle / GH name you'd want credited, drop it.

Thomas Landgraf • May 5

The thesis is right but I'd push it one step further: descriptions can also be wrong, and that failure mode is harder to catch than missing ones. Hit this when a parameter described as "ISO 8601 date string" turned out to also accept Unix epoch milliseconds at runtime - silent type coercion in the server, the LLM started passing epochs because that's what the upstream tool returned, and the description was technically a lie for weeks before anyone noticed.

mcp-probe already catches schema-vs-runtime gaps (the EISDIR / EACCES path stuff is exactly that). Wonder if there's a way to also flag description-vs-runtime drift - round-trip a generated argument that matches the description, then a deliberately off-spec one, and check whether the server rejects the off-spec form like the description claims it should. Probably can't get to 100% without server cooperation, but a heuristic pass on common shapes (regex-ish hints in description + JSON Schema type) would catch a lot of the silent-coercion cases.

pengspirit • May 7

This is the sharper version of the thesis — thank you for pushing it.

You're right that "wrong" beats "missing" for damage potential, because tooling and reviewers both pattern-match on presence/absence. A description that lies fluently passes every existing check.

A heuristic pass is feasible and I think there are three round-trips worth wiring up:

Format-hint extraction — regex over the description for common shape claims: ISO 8601, epoch, UUID, email, URL, base64, hex, enumerated literals in quotes. Combine with the JSON Schema type for a derived expectation.
Conformant probe — generate an argument matching the strictest reading (e.g. 2026-05-06T12:00:00Z for ISO 8601), call the tool, expect success.
Off-spec probe — generate a deliberately violating argument that still passes the JSON Schema type (e.g. 1746547200000 for ISO 8601 — still a string if type: "string" is loose, or numeric if it isn't). If the server accepts it without complaint, flag DESCRIPTION_DRIFT_SUSPECTED with the conformant/off-spec pair as evidence.

Won't catch deep semantic drift without server cooperation, agreed. But the silent-coercion family — epoch-vs-ISO, naive-vs-aware datetimes, base64-vs-hex, locale-mixed numbers — is exactly the shape that regex-able description hints already encode.

Filed it as a v1.1 issue with a seed heuristic library: github.com/incultnitollc/mcp-probe... — if you've got the canonical pathological cases from your incident, drop them on the issue and I'll seed the library with them.

arun rajkumar • May 7

The "load-bearing" framing is the right shape — the same observation applies one level up at the tool level. Most MCP catalogues we've audited had perfectly described parameters but no description of when not to call this tool, which is the bit that actually decides whether an agent reaches for the right surface. The half-hour we spent adding "anti-purpose" descriptions to about a dozen of our internal tools cut the wrong-tool-selected rate roughly in half. Arguably the parameter case in this post is just the most visible instance of a broader rule: every field of every schema an agent reads is doing structural work whether you specified it or not.

pengspirit • May 7

Yes — anti-purpose is the missing axis. Parameter descriptions tell the agent 'how' to call a tool; tool descriptions (and especially "when not to") tell it 'whether' to. Both are schema, both are load-bearing, both are usually under-specified.

The 50% wrong-tool-selection drop tracks with what I see auditing public MCP servers: most tool descriptions read like marketing copy, not selection criteria. "Searches the web" doesn't help an agent choose between three search tools. "Use for current events; do not use for code lookup or internal docs" does.

Going to thread this into the next post — tool-level disambiguation deserves its own treatment.