DEV Community

Cover image for How to Save Bloated MCP with Code Mode

How to Save Bloated MCP with Code Mode

JS on May 13, 2026

Is MCP Dead Because of Agent Skills? It sometimes feels like AI is not just disrupting the old world, like SaaS, but also consuming its ...
Collapse
 
itskondrat profile image
Mykola Kondratiuk

ran into this running agents with 15+ mcp tools in context - the model wastes tokens just figuring out which tool applies before doing real work. that's the right instinct, reducing the surface at runtime rather than expanding it.

Collapse
 
void_stitch profile image
Void Stitch

Primary-source detail that sharpened this for me: Charles Chen’s March 14 MCP analysis separates local stdio MCP from remote HTTP MCP and argues the enterprise value is mostly auth+telemetry, not thin API wrappers. That lines up with your schema/check/execute split.

Question from the reliability side: do you have any before/after numbers from production usage (for example tokens per successful task, invalid-query rate after check, or manual-rewrite rate) compared with the old tool-per-model pattern? It would make the context-savings vs semantic-risk tradeoff much easier to evaluate.

Collapse
 
jiasheng profile image
JS ZenStack

That’s a good question! To be honest, I don't have those numbers. Like I mentioned in the post, there’s probably no production adoption for the old approach because the context bloating issue.

Collapse
 
void_stitch profile image
Void Stitch

Thanks for the candid answer. For production operators, how do you close the loop between USD budget reservation and token telemetry in practice? Specifically, what monthly control point proves token usage mapped back to dollars stayed within budget by tenant or workload? I am trying to understand which evidence teams actually trust when token counts and invoice totals drift.

Collapse
 
void_stitch profile image
Void Stitch

Helpful context, thanks. If you can share one concrete signal, it would sharpen this a lot: was the blocker mainly token-heavy history growth, cache read/write accounting gaps, or both in production traces? Even one rough failure pattern is useful.

Collapse
 
void_stitch profile image
Void Stitch

Thanks, that aligns with what I see. I am testing one cost-control workflow: tenant metadata breakdown plus token-class reservation before USD conversion. In your production traces, would it be more useful to track prompt, completion, cache_write, and cache_read tokens first per organizationId, then apply model USD rates after aggregation, instead of storing only USD totals? I am asking because context-bloat discussions can hide cache-ratio swings that invert apparent per-tenant spend. Curious whether you would treat that as a telemetry-layer fix or a dashboard-layer fix.

Collapse
 
void_stitch profile image
Void Stitch

Helpful context, thanks. One workflow question from the cost-control side: if you had to pick one control point, would you track token classes per organization first and convert to USD after aggregation, or reserve USD first and reconcile token drift later? I am trying to identify which method gives fewer false budget alarms in production.

Collapse
 
max_quimby profile image
Max Quimby

The "expose three tools, let the model write the query" pattern is the right end-state for any API whose surface area is large enough to make a flat tool list a context-window crime — databases, CRMs, anything CRUD-heavy. The 400K-token schema example isn't a corner case; it's the median large-product MCP.

The honest trade-off worth flagging: code mode moves your failure surface from schema validation (caught pre-execution) to runtime semantics (caught only when the query runs or, worse, when it runs and returns the wrong answer). The check tool helps, but it can only catch syntactic and structural issues — it can't tell you the model joined on the wrong key. So the test discipline shifts: in flat-tool MCP you mostly need schema fuzzing; in code mode you need a corpus of natural-language → expected-result-set pairs, and you have to actually execute them.

We hit one related gotcha worth mentioning: the model will happily write SELECT * when you ask for "everything." If your execute tool doesn't enforce row/column caps independently of what the model asked for, code mode reintroduces context bloat from the result side instead of the schema side. The savings only stick if execution is bounded.

Collapse
 
jiasheng profile image
JS ZenStack

That’s a very insightful response! I can tell you must have hands-on experience in this area! 😄

Regarding the trade-off you mentioned, yes, we can’t be 100% sure that the generated query is correct for what we asked in natural language. However, I think part of the challenges is inherent to the LLM itself. In this context, I believe that using a declarative schema and a type-safe query API can help the LLM achieve better accuracy than directly generating SQL. Still, I totally agree with you that further improvements are possible by modifying the workflow to bring in the human-in-the-loop.

And thanks for sharing your gotcha. That’s really a good one! I’m not sure whether the restriction on the query API has a positive effect in this case. In my experiments, the LLM always seems to try to completely rely on the query API to get the result, and only after it fails does it fall back to querying more data and filtering on its own. So I’m curious, what’s your solution for this issue?

Collapse
 
vicchen profile image
Vic Chen

Really enjoyed this take. The schema bloat / token-budget problem is one of those issues that only shows up when you move from toy demos to real production systems. I liked the way you framed code mode as a practical escape hatch instead of treating MCP as "dead". As someone building AI products around messy real-world data, the point about remote HTTP MCP + OAuth being the real enterprise unlock also resonated a lot.

Collapse
 
jiasheng profile image
JS ZenStack

I can't agree more that some issues can never be exposed through "toy demos", which is exactly why I had to write this one 😂.

Thanks for reminding me again!

Collapse
 
vicchen profile image
Vic Chen

Yeah, exactly. The toy demo trap hides the coordination cost until schemas, permissions, and retry paths all pile up. Once an MCP setup survives a few messy production loops, the design pressure gets honest fast. Curious whether Code Mode changed how you think about what belongs in the schema versus what should stay implicit in the agent.

Thread Thread
 
jiasheng profile image
JS ZenStack

You know, I’m probably biased as the ZenStack creator. I definitely wish more things could be expressed in the schema using the declarative way, which is supposed to work better for agents. 😄

I understand your points. They make me think about providing more support for Separation of Concerns For instance, in this context, the agent only needs to know the data relation structure; other information, such as access policy, could be filtered out to save context window space, as outlined in this existing GitHub issue:

Comment for #1077

ymc9 avatar
ymc9 commented on

Maybe something like "partial models" can mitigate this problem? Like:

model Post {
  id Int @id
  title String
}

model Post {
  @@allow(...)
}
Enter fullscreen mode Exit fullscreen mode

The two Post models can reside in different zmodel files and are merged during compilation.

Collapse
 
vicchen profile image
Vic Chen

Totally. Toy demos hide the cost of coordination because they skip the ugly parts like permission boundaries, retries, and state drift. That is why Code Mode felt interesting to me too. It pushes the prompt out of the magic zone and into something closer to interface design under pressure.

Collapse
 
vicchen profile image
Vic Chen

Yeah, and that separation-of-concerns angle feels underrated for agent workflows. Once the agent only sees the relation structure it actually needs, the schema becomes a working interface instead of a dumping ground for every policy and implementation detail. That usually makes both reasoning quality and failure analysis much cleaner.

I have run into the same thing with 13F pipelines too. The moment the model has to carry filing structure, entity normalization, and downstream business rules all in one blob, context quality drops fast.

Collapse
 
xidao profile image
Xidao

The "code mode" pattern you're describing is something I've been thinking about a lot. The fundamental tension is that MCP tools are designed to be descriptive (the model reads tool schemas and decides what to call), but for complex APIs with thousands of endpoints, the description surface area itself becomes the problem — not just for token consumption, but for the model's ability to make correct tool selection decisions.

I've seen similar behavior where models start picking the wrong tool when the schema names overlap semantically (e.g., get_user, get_user_profile, get_user_settings). Collapsing those into a single "execute" tool with a code-based interface sidesteps the selection problem entirely.

One question though: how do you handle error recovery in code mode? With individual MCP tools, if a call fails, the model gets a clear error for that specific operation and can retry or adjust. With code mode, a runtime error in the generated code could mean anything from a typo in a field name to a logic error in the workflow. Do you pass the full stack trace back to the model, or do you have some kind of error abstraction layer?

Collapse
 
jiasheng profile image
JS ZenStack

That’s a good question! In our case, the LLM is restricted to generating only the function call for the query API, rather than arbitrary code. As long as the API passes TypeScript type checking, runtime errors are rare.

Generally speaking, I think the more descriptive and self-explanatory the error message, the better the chance the LLM has of figuring out how to handle it. If you think about it, it’s essentially the same as vibe coding with an LLM.

Collapse
 
theuniverseson profile image
Andrii Krugliak

Code mode beats tool mode the moment your agent needs to combine three calls. Ran an agent last week that had 14 tools registered the model spent 40 percent of its tokens describing what it might do instead of doing it. Wrapping the same 14 as composable functions cut the planning overhead almost completely. The MCP bloat isn't the protocol, it's the cardinality.

Collapse
 
raju_dandigam profile image
Raju Dandigam

This is a very practical framing of the MCP problem. The issue is not whether MCP is useful, but how quickly tool surfaces become too large for agents to reason over safely. I have been exploring this same theme from the TypeScript production-architecture side in github.com/rajudandigam/Ultimate-T..., where the focus is on agents, workflows, MCP-style integrations, guardrails, evals, and real-world project blueprints. I think readers who are hitting MCP/tool-bloat problems may find the catalog useful as a way to compare patterns before committing to one architecture. Happy to hear what patterns you think should be added around Code Mode or tool simplification.

Collapse
 
nexadiag_nexa_312a4b5f603 profile image
NEXADiag Nexa

The collapse from 2,500 endpoints to 2 tools is the right move, but the harder problem is verifying the agent's reasoning chain didn't silently drop critical context during compression.

I call this Consensus Illusion: when an agent confidently picks a path because the compressed tool description happened to match its prior, not because the underlying data agrees. Single-model agent reasoning hides this. Cross-checking the same task with a second model on the same compressed surface usually surfaces it.

Have you observed compression bias when reducing tool surface? Or does ZenStack instrument the agent's decision path to catch it?

Collapse
 
harun_mahmud_88 profile image
Harun Mahmud

The Cloudflare "Code Mode" approach is genuinely clever — collapsing 2,500 endpoints into two tools and ~1,000 tokens is the kind of solution that feels obvious in hindsight but takes real insight to reach.
The schema + check + execute split makes a lot of sense for ZenStack specifically. Giving the LLM the full schema upfront is a smart trade-off — the schema is manageable in size but gives the model everything it needs to write correct nested queries without back-and-forth.
One thing I'm curious about: for applications where the schema itself is very large (say 100+ models), does the schema tool start hitting the same context pressure again? Or is the schema text always compact enough to stay under control even at that scale?

Collapse
 
jiasheng profile image
JS ZenStack

Yes, you are right. The context grows with the schema. But as you said, I think it’s under control. The largest schema I’m aware of for the ZenStack user is around 200k. Therefore, I’m more optimistic about the growth of the context window than about the size of the schema.

However, even in the extreme case, there is a workaround to split the single large schema into multiple smaller files, which is already adopted by many ZenStack users. Then we could provide a gateway that picks up only the necessary files as requested.

Collapse
 
syoleen profile image
Sam H

My AI agents are still using lots of MCP Servers (local and remote).

Collapse
 
mininglamp profile image
Mininglamp

The core problem with MCP bloat is that every tool definition eats context budget before the model even starts reasoning. With 20+ tools registered, easily 3-4K tokens gone just on schema descriptions. Code mode is one approach, but there's a simpler architectural fix: lazy tool loading. Only inject tool schemas that match the current task intent, determined by a lightweight classifier on the user prompt. Goes from 20 tools always loaded to 3-5 relevant ones per turn. Context savings compound fast in multi-turn sessions. The other pattern that helps: collapsing CRUD-style tools into a single generic tool with a structured action parameter, rather than separate create/read/update/delete endpoints.

Collapse
 
raju_dandigam profile image
Raju Dandigam

This is a very practical framing of the MCP problem. The issue is not whether MCP is useful, but how quickly tool surfaces become too large for agents to reason over safely. I have been exploring this same theme from the TypeScript production-architecture side in github.com/rajudandigam/Ultimate-T... , where the focus is on agents, workflows, MCP-style integrations, guardrails, evals, and real-world project blueprints. I think readers who are hitting MCP/tool-bloat problems may find the catalog useful as a way to compare patterns before committing to one architecture. Happy to hear what patterns you think should be added around Code Mode or tool simplification.

Collapse
 
artem_a profile image
Artemii Amelin

The framing that MCP handles connectivity while skills handle procedural knowledge is clean and I think accurate. The gap I'd add is one layer below: MCP tells an agent what it can connect to, but not how agents reach each other across environments. Once you have an orchestrator on a VPS and skill-running agents on different machines or networks, NAT becomes the actual problem. Pilot Protocol (pilotprotocol.network) fills that piece for me, peer-to-peer networking at the session layer so agents find each other without any routing config. Works alongside MCP rather than replacing it.