Divyanshu Shekhar

Posted on Jan 7

MCP Servers Rarely Crash. That’s the Problem.

#ai #mcp #chatgpt #programming

If you are building MCP servers, chances are your system does not fail the way you expect.

It starts up.

Tools registered successfully.

The LLM connects.

Responses come back.

And yet, something feels off.

Certain tools are barely used.

Arguments look almost correct.

The agent retries, rephrases, or quietly works around missing behaviour.

Nothing crashes. Nothing alerts. Nothing obviously breaks.

That is exactly the problem.

MCP failures are logical, not technical

In most backend systems, failure is explicit.

You get an exception.

A timeout.

A non-200 response.

An alert at 3 a.m.

MCP systems fail differently.

A tool is callable but unusable.

A schema is technically valid but semantically wrong.

A required precondition is never satisfied.

A downstream tool never receives the data it assumes exists.

The system still produces output.

Those are the failures that cause the most damage, because they look like success.

Tool contracts exist, but they are not enforced

MCP tools are defined by contracts: schemas, descriptions, and behavioural assumptions.

In practice, those contracts drift constantly.

Fields that are required in reality are marked optional.

Descriptions encode behaviour that schemas never enforce.

Breaking changes slip in without versioning.

Assumptions live in comments instead of checks.

Most teams only discover these issues when an LLM tries to use the tool.

That is not validation. That is delegation.

An LLM is a probabilistic consumer, not a contract checker.

Hidden dependency chains shape execution

Very few MCP servers are flat.

One tool prepares input for another.

Another assumes a previous tool has already run.

A third only works if both succeed.

None of this is explicit.

There is no dependency graph.

No execution guarantees.

No visibility into which tools are structurally required for others to function.

The result:

Order-sensitive behaviour
Partial execution paths that look valid
Agents hallucinating glue logic to compensate

Your MCP server already has a dependency graph.

You just cannot see it.

Prompt-based testing does not surface these failures

The usual response is "we'll catch it in testing".

In practice, that means:

Trying a few prompts
Watching the agent’s behaviour
Tweaking descriptions
Retrying when something looks off

This is not testing. It is sampling.

Prompt-based testing is non-deterministic.

LLMs smooth over structural problems instead of exposing them.

Retries and self-corrections hide the real failure modes.

If your test strategy is “try a few prompts and see what happens”, you are observing behaviour, not validating the system.

By the time a failure is visible, it has already passed through an LLM filter.

Why this is specific to MCP-style systems

This is not a criticism of MCP.

MCP exposes capabilities, not workflows.

Execution is mediated by an LLM, not a deterministic controller.

Control is probabilistic.

Failures are negotiated, not thrown.

MCP did not introduce these problems.

It removed the illusion that software is always in control.

Once you accept that, traditional debugging and testing approaches start to look insufficient.

What this means if you are building an MCP server today

A few practical conclusions fall out of this:

Treat tool schemas as real contracts, not documentation
Assume prompt-based testing will miss structural failures
Make tool dependencies explicit, even if only conceptually
Aim to surface failures before an LLM is involved, not after

If you only discover problems through agent behaviour, you are debugging too late.

Finding failures earlier

After running into these issues repeatedly, one pattern became clear:

By the time an LLM interacts with your MCP server, structural problems are already baked in.

Contracts should fail before execution.

Dependencies should be visible before prompts are written.

Unsafe changes should surface before deployment.

That line of thinking led us to experiment with analysing MCP servers statically, instead of discovering issues indirectly through LLM behaviour.

That work is documented here: https://docs.syrin.dev/

DEV Community