Ursula Harrell

Posted on Jun 30

Building Reliable Claude Code Workflows with MCP Servers

#claudecode #mcp #devtools #ai

TL;DR

MCP servers unlock powerful agent workflows in Claude Code, but they fail in subtle ways that can silently kill long-running sessions. This article covers the three most common failure modes, practical debugging steps, and workflow patterns that make multi-MCP setups actually reliable in production.

What MCP Servers Are and Why They Matter

Model Context Protocol (MCP) is Anthropic's open standard for giving language models structured access to external tools and data sources. Instead of stuffing context into a prompt, you expose a server that Claude Code can call — read a file, query a database, search GitHub, run a browser action. The model decides when to call which tool, and the results feed back into the conversation.

For agent workflows, this is a big deal. A well-configured MCP setup lets Claude Code autonomously research a codebase, open pull requests, run queries, and write results back to disk — all in one session. I've used this pattern to automate tasks that used to take me 45 minutes of copy-pasting between terminals.

The catch is that MCP servers are separate processes (or remote services) that Claude Code has to coordinate with over a transport layer. When that coordination breaks down — and it does — the failure modes are confusing and the error messages are often useless. Let me walk through what I've learned debugging these in production.

The Three Most Common MCP Failure Modes

1. stdio Transport Dies Silently

Most local MCP servers use stdio transport: Claude Code spawns a child process and communicates via stdin/stdout. If that process crashes — out of memory, unhandled exception, bad config — Claude Code doesn't always surface a clean error. It just stops getting responses. The agent will either hang, retry into a loop, or return a confusing "tool returned empty" message.

I've seen this most often with filesystem servers pointed at directories with tens of thousands of files, and with custom Python MCP servers that have unhandled exceptions on edge-case inputs.

2. HTTP/SSE Transport Times Out on Slow Routes

Remote MCP servers (browser tools, hosted database connectors, third-party APIs) typically use HTTP with Server-Sent Events. These connections are sensitive to network latency and intermediate proxy behavior. A tool call that takes 8 seconds on a slow network route will frequently time out before the response arrives, especially if your Claude API connection itself is going through a congested path.

This one is insidious because the first tool call in a session often succeeds — the connection is warm — and then subsequent calls in the same session fail. The SSE stream has dropped and neither side notices immediately.

3. Tool Call Returns null Because the Server Hit a Resource Limit

The MCP spec allows servers to return empty or null results without throwing an error. GitHub's MCP server will silently return nothing when you've hit a rate limit. A database MCP server with a query timeout will return an empty result set instead of an error. Your filesystem MCP will return truncated output if you try to read a 50MB log file.

Claude Code sees an empty tool result and tries to continue reasoning from it, which produces nonsense. This is the hardest failure mode to debug because the session looks healthy — no crashes, no timeouts — just wrong outputs.

Practical Debugging: Reading MCP Logs and Testing in Isolation

First tool: claude mcp list shows every configured MCP server and its transport type. claude mcp get <name> dumps the full config including the command or URL being used. Run these before you assume your config is correct — I've wasted hours debugging a server that was pointing at the wrong binary.

# List all configured MCP servers
$ claude mcp list
filesystem   stdio   npx @modelcontextprotocol/server-filesystem /workspace
github       stdio   npx @modelcontextprotocol/server-github
websearch    http    http://localhost:8080/mcp

# Inspect a specific server's config
$ claude mcp get filesystem
Name: filesystem
Type: stdio
Command: npx @modelcontextprotocol/server-filesystem /workspace
Status: connected

# Check Claude Code MCP debug logs (macOS/Linux)
$ tail -f ~/.claude/logs/mcp-*.log

# Test a stdio MCP server manually outside Claude Code
$ echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' | \
  npx @modelcontextprotocol/server-filesystem /workspace

The manual echo test is underused. Pipe a raw JSON-RPC request directly to the server process and see what comes back. If it hangs or errors here, the problem is the server itself, not Claude Code.

For HTTP/SSE servers, use curl -N to open the SSE stream and watch it stay alive (or drop) over 30+ seconds. That tells you whether your network route is stable enough for long agent sessions.

Workflow Patterns That Reduce MCP Fragility

After enough production debugging, I've settled on a few patterns that make a real difference:

Keep MCP servers stateless. Don't design a workflow where tool call #4 depends on in-memory state set by tool call #1. If the server restarts between calls, that state is gone. Write intermediate results to disk via the filesystem MCP instead.
Don't chain more than 4-5 MCP tool calls in a single agent turn. Long chains give more opportunities for a mid-chain failure to corrupt the whole result. Break complex tasks into explicit checkpoints where Claude Code writes a summary to a file before continuing.
Prefer filesystem writes over in-memory state across tools. This is the same principle as above, but worth repeating: the filesystem is your most reliable MCP server. Use it as a scratchpad.
Set explicit timeouts in your MCP server configs where the protocol allows it. A 30-second timeout that surfaces a real error is better than a 5-minute hang.
Isolate unstable servers. If your custom web search MCP is flaky, don't run it in the same session as your critical filesystem + GitHub workflow. Test it separately first.

A Concrete Example: Research-Then-Code Workflow

Here's a workflow I use regularly: Claude Code researches a topic using a web search MCP, writes notes to disk via the filesystem MCP, then opens a PR via the GitHub MCP. The key is structuring it so a failure in any one server doesn't kill the whole session.

In CLAUDE.md at the project root:

# Agent Workflow Instructions

## MCP Tool Usage Rules
- After each research tool call, write findings to `./scratch/research-<topic>.md` before proceeding
- If websearch MCP returns empty, log the failure to `./scratch/errors.log` and continue with cached results
- Never chain more than 3 MCP tool calls without writing a checkpoint file
- GitHub PR creation is the final step only — do not attempt it if previous steps produced errors

## Checkpoint Protocol
1. Research phase: use websearch MCP, write to ./scratch/
2. Code phase: use filesystem MCP to read/write source files
3. Review phase: read ./scratch/ notes + source files, no external calls
4. Publish phase: use github MCP to open PR

The CLAUDE.md instructions explicitly tell the agent to write checkpoints and handle partial failures gracefully. Without this, Claude Code will try to recover from a failed tool call by improvising, which usually makes things worse.

When the websearch MCP returns empty (rate limit, network issue), the agent logs it and continues rather than spinning. When the filesystem MCP is the only server that's reliably up, the agent can still complete the code phase and defer the GitHub step to a separate session.

The Foundation: A Stable Connection to Anthropic's API

Here's something that took me a while to internalize: MCP tool call failures are often not MCP problems at all. They're Claude API connection problems.

When Claude Code makes a tool call, it sends the result back to the model and waits for the next response. If the underlying HTTP connection to Anthropic's API is flaky — packet loss, routing instability, regional congestion — that follow-up model call fails, and the whole tool chain collapses. You see it as an MCP failure, but the actual problem is network instability between your machine and api.anthropic.com.

I've seen this pattern repeatedly with developers on unstable ISPs or in regions with inconsistent routing to Anthropic's infrastructure. The MCP servers themselves are fine; the model just never gets the tool result because the API connection dropped mid-session.

For developers who hit this consistently, NasaCode is worth looking at — it's a dedicated connection layer for IDE agent workflows (Claude Code, Cursor, Copilot) that stabilizes the route to Anthropic's API. If you've ruled out your MCP servers as the problem and you're still seeing cascading failures in long sessions, unstable API routing is the likely culprit and a dedicated tunnel is the practical fix.

Conclusion

MCP servers are powerful but they fail in ways that are easy to misdiagnose — silent stdio crashes, SSE timeouts, and empty returns from resource-limited servers all look similar from the outside. The practical fixes are: test servers in isolation before trusting them in agent workflows, use filesystem checkpoints to make sessions resumable, don't chain too many tool calls without a pause, and make sure the foundation — your connection to Anthropic's API — is actually stable. Get those basics right and multi-MCP agent workflows become genuinely reliable rather than a source of constant debugging.

DEV Community