DEV Community

The Real Problems Start After Your MCP Server Works

I've been spending a lot of time building and deploying MCP servers, experimenting with tool orchestration, agent workflows, and different ways to make LLM systems interact with external systems more reliably.

MCP support evolution chart

At first, MCPs feel simple:

  • Expose tools
  • Connect an agent
  • Call functions
  • Done

But once MCP servers start becoming useful in real workflows, a completely different class of engineering problems begins appearing:

  • Context explosion
  • Unreliable tool selection
  • Hallucinated tool calls
  • Scaling bottlenecks
  • Permission boundaries
  • Non-deterministic agent behavior

This post is not a "how to build an MCP server" guide.

Instead, it's a quick breakdown of the real engineering problems that start appearing after MCPs move to production.

Problem #1 — Too Many Tools Make Agents Worse

One of the biggest MCP engineering problems today is tool overload.

At small scale:
adding tools feels powerful.

At larger scale:

Agent
 ├── 120 tools
 ├── Massive context
 ├── Tool confusion
 └── Lower reliability
     └── Similar tools compete
Enter fullscreen mode Exit fullscreen mode

This is where tool grouping becomes extremely important. Some tools must stay partitioned and selectively loaded. More tools do not automatically create smarter agents, sometimes they create noisier ones.

Fix direction: Group tools by domain and load them selectively per request context. See how GitHub MCP handles tool discovery and scoping github/github-mcp-server/pkg/tooldiscovery. Atlassian Rovo takes a similar approach for Jira + Confluence tool scoping atlassian/atlassian-mcp-server

Problem #2 — Context Windows Become Infrastructure Problems

One thing that becomes obvious very quickly after deploying MCPs is that token usage is no longer just an LLM problem. Every MCP design decision plays a key role:

  • Tool description
  • Schema
  • API payload
  • Output tokens

Fix direction: Design lean schemas, surface only what the agent needs. See how Stripe's agent toolkit keeps tool payloads focused and minimal stripe/agent-toolkit. The official MCP memory server is also a clean reference for output-efficient tool design modelcontextprotocol/servers/src/memory

Problem #3 — Deterministic Tool Calls Are Hard

One of the biggest misconceptions in MCP systems is:

"If the tool exists, the agent will use it correctly."

In practice, overlapping descriptions, ambiguous naming, and similar tools cause agents to pick incorrect tools surprisingly often.

The real challenge is not making tools callable, it's making the correct tool callable at the correct time.

The tighter and clearer the tool description:

  • The more deterministic the agent behavior becomes
  • The fewer hallucinated calls happen
  • The more reliable orchestration becomes

Fix direction: Write tool descriptions like API contracts, single responsibility, zero ambiguity. Google Gemini CLI's tool definitions folder is a strong reference for well-scoped, clearly named tools google-gemini/gemini-cli/src/tools. Cloudflare MCP also demonstrates clean, single-purpose tool design cloudflare/mcp-server-cloudflare

Problem #4 — MCP Servers Become Distributed Systems

Once MCP traffic grows, MCP servers stop behaving like simple integrations. They start behaving like distributed backend systems.

At production scale:

retries matter,
state management matters,
observability matters,
routing matters,
and horizontal scaling matters.

Client Request
      ↓
Load Balancer
      ↓
Stateless MCP Instance
      ↓
Redis / Session Store
      ↓
External APIs
Enter fullscreen mode Exit fullscreen mode

Fix direction: Build stateless MCP handlers from day one and wire in observability early. Cloudflare Workers MCP shows how to run stateless, edge-deployed MCP instances at scale cloudflare/workers-mcp. GitHub MCP's observability layer is worth studying before you scale github/github-mcp-server/pkg/observability

A Walkthrough of GitHub's MCP Evolution

GitHub discussed how their MCP server gradually evolved while solving:

  • Huge tool surfaces
  • Context overload
  • Scaling problems
  • Auth workflows
  • Distributed infrastructure concerns

Their architecture eventually moved toward:

  • Stateless MCP servers
  • Redis-backed session handling
  • Grouped tool sets
  • Dynamic tooling concepts
  • OAuth-based auth flows
  • Scoped tool visibility
  • Aggressive token optimization
Client
   ↓
MCP Server
   ↓
Tool Sets
   ├── Repo tools
   ├── PR tools
   ├── Actions tools
   └── Issue tools
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

MCP engineering will increasingly depend on:

  1. Strong authentication gateways
  2. Scoped permissions
  3. Tighter tool descriptions
  4. TTL caches
  5. Deterministic orchestration
  6. Context-efficient architectures

Top comments (9)

Collapse
 
james_oconnor_dev profile image
James O'Connor

The real after-launch headache for us was tool-schema versioning. Once we wrote MCP tool schemas as a DDD-style ubiquitous-language artifact, drift between client and server stopped silently breaking calls. We version the schema in a shared package both sides import. A schema change requires a coordinated bump and a migration test. Adds 15 minutes of process per change, removes the agent-silently-called-the-wrong-tool class of bug entirely.

Collapse
 
madhaviai profile image
Madhavi Pasumarthi(#madhaviai) • Edited

@txdesk really liked your point around deterministic intent-based filtering before the orchestration loop. I still personally feel MCP should stay tightly scoped and simple at the core, while larger-scale routing, retries, wait states, and LLMs orchestration intelligence improving .

@james_oconnor_dev this is a very strong production insight. Schema/version drift between clients and MCP servers . Your shared-contract + migration-testing approach is honestly a very solid fix. I also think the upcoming MCP TTL/cache validation direction could help clients better manage tool/version refresh cycles and reduce unnecessary revalidation overhead as adoption matures.
modelcontextprotocol.io/seps/2549-...

Collapse
 
txdesk profile image
TxDesk

Thanks Madhavi. Agree on keeping MCP tightly scoped at the core, the moment the protocol tries to absorb orchestration logic, the contract surface explodes and every client implementation diverges. The complexity belongs in the layer above, where it can be opinionated per use case.

SEP-2549 is interesting. The TTL/cache validation direction would solve a real operational pain point I hit early on (clients hammering schema endpoints on every tool call). My current workaround is in-memory client-side caching with manual invalidation on tool-update webhooks. The spec'd version would let me delete a meaningful chunk of glue code.

Collapse
 
max_quimby profile image
Max Quimby

The tool-overload point is the one I wish more people internalized before shipping. We had an MCP server with ~40 tools and tool-call accuracy collapsed past about 25 — the model would confidently invent plausible-looking tool names that didn't exist. Pruning to a focused 12 fixed it overnight.

A couple of additions from production wear-and-tear:

  • Description ambiguity is the silent killer. Two tools whose descriptions overlap by even 30% will get confused. We started writing tool descriptions like SDK docstrings — explicit "use this when…" and "do NOT use this for…" sections. Selection accuracy jumped measurably.
  • Argument schemas need to be punishing. Optional fields invite hallucinated values. Make everything required that can be, and let the host pass null explicitly.
  • Observability on tool_call → tool_result latency caught more bugs than any other signal — slow tools cause the model to retry or abandon mid-plan, which looks like a "model quality" problem until you look at the trace.

Stateless + Redis sessions is the right backbone. Curious if you've settled on a pattern for streaming tool results back during long-running calls — that's the one I'm still iterating on.

Collapse
 
madhaviai profile image
Madhavi Pasumarthi(#madhaviai)

Thanks for resonating with my post. You are right, MCP accuracy and reliability have increased with a reduction in the number of tools and tight tool descriptions for a lot of the successful MCP public there.

MCP made improvements for long-running tasks, which help async agent orchestration. Please check MCP Tasks; let a server return a taskId instead of waiting for a long-running result. The task includes ttlMs and pollIntervalMs, so the client knows how long the task is retained and how often to poll.
MCP also supports progress updates through progressToken and notifications/progress.

Collapse
 
txdesk profile image
TxDesk

Max's "do NOT use this for" framing is the one I'd reach for first, it's the cheapest fix and it goes further than people expect. One thing I'd add from running a larger surface in production: description tightening eventually hits a ceiling, and the next move is pre-filtering the tool list outside the model.

What worked for me: a cheap intent-classification call upstream of the agent loop, which narrows the available tool set per turn before the model ever sees it. The agent gets a slice of the surface that matches the intent, not the whole catalog. Selection accuracy on the larger tool set went from "needs constant description tuning to stay viable" to "stable across a much wider set" because the model isn't choosing from N tools, it's choosing from 5-8 that the upstream call already qualified as relevant.

It also helps with the cache-stability point on Madhavi's #2. Static tool sets cache well; dynamically filtered ones do too, as long as the filter is deterministic per intent class. The thing that breaks caching is per-request dynamic tool availability with no stable boundary, which is what naively-pruned tool lists end up being.

On Max's open question about streaming long-running results: we ended up doing the same Tasks-style pattern Madhavi mentions, taskId + progress updates. The thing that surprised us was that even short tool calls benefit from the progress-update channel, because the model's behavior on "no signal for 800ms" was to assume failure and retry, which compounded the latency problem. Heartbeating non-streaming tools at a fixed interval fixed it.

Collapse
 
harjjotsinghh profile image
Harjot Singh

Getting an MCP server to respond is the easy part is the line every MCP tutorial should open with, because the demo is a tool returning JSON and the product is everything on your list. The highest-stakes item is the one you put last: stopping an agent from calling a destructive tool it shouldn't, and the key insight is that this cannot live in the prompt. An agent can be argued, confused, or injection-tricked into trying anything, so don't delete prod is a wish, not a control. It has to be structural at the server boundary, scoped permissions per tool, the destructive ones gated behind explicit approval or simply not exposed to that agent, so a bad call is refused by the system rather than declined by a model. Auth and rate limiting are the same shape of lesson: the server has to assume the caller is non-deterministic and occasionally adversarial, which is a different threat model than a normal API with a human or fixed client behind it. Tool versioning is the sneaky one people learn the hard way, an agent that learned v1's tool contract breaks silently when you ship v2 with a changed schema, no error, just subtly wrong calls, so versioning and back-compat matter more here than in a normal API. Make the boundary enforce what the agent can do, because the agent won't reliably restrain itself. That enforce-at-the-tool-server instinct is core to how I think about MCP in Moonshift. For the destructive-tool problem, are you gating with per-tool permissions/approvals, or keeping dangerous tools off the agent's surface entirely?

Collapse
 
james_oconnor_dev profile image
James O'Connor

@madhaviai agreed on keeping MCP's core tight. The intent-based filtering belongs in the orchestration layer, not inside MCP. We split it: MCP tools stay shape-typed and stateless, the orchestrator owns intent classification (a small DistilBERT model running 30ms) and routes to one of three tool subsets. The classifier gets retrained when intent drift shows up in traces. MCP itself never knows about the intent layer. Keeps your composability promise intact, removes the silent-call class of bug at the same time.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.