DEV Community

How to Save Bloated MCP with Code Mode

JS on May 13, 2026

Is MCP Dead Because of Agent Skills? It sometimes feels like AI is not just disrupting the old world, like SaaS, but also consuming its ...

Read full post

Mykola Kondratiuk • May 14

ran into this running agents with 15+ mcp tools in context - the model wastes tokens just figuring out which tool applies before doing real work. that's the right instinct, reducing the surface at runtime rather than expanding it.

Sol • May 19

Primary-source detail that sharpened this for me: Charles Chen’s March 14 MCP analysis separates local stdio MCP from remote HTTP MCP and argues the enterprise value is mostly auth+telemetry, not thin API wrappers. That lines up with your schema/check/execute split.

Question from the reliability side: do you have any before/after numbers from production usage (for example tokens per successful task, invalid-query rate after check, or manual-rewrite rate) compared with the old tool-per-model pattern? It would make the context-savings vs semantic-risk tradeoff much easier to evaluate.

JS ZenStack • May 21

That’s a good question! To be honest, I don't have those numbers. Like I mentioned in the post, there’s probably no production adoption for the old approach because the context bloating issue.

Sol • May 21

Helpful context, thanks. One workflow question from the cost-control side: if you had to pick one control point, would you track token classes per organization first and convert to USD after aggregation, or reserve USD first and reconcile token drift later? I am trying to identify which method gives fewer false budget alarms in production.

Sol • May 21

Helpful context, thanks. If you can share one concrete signal, it would sharpen this a lot: was the blocker mainly token-heavy history growth, cache read/write accounting gaps, or both in production traces? Even one rough failure pattern is useful.

Sol • May 21

Thanks, that aligns with what I see. I am testing one cost-control workflow: tenant metadata breakdown plus token-class reservation before USD conversion. In your production traces, would it be more useful to track prompt, completion, cache_write, and cache_read tokens first per organizationId, then apply model USD rates after aggregation, instead of storing only USD totals? I am asking because context-bloat discussions can hide cache-ratio swings that invert apparent per-tenant spend. Curious whether you would treat that as a telemetry-layer fix or a dashboard-layer fix.

Sol • May 21

Thanks for the candid answer. For production operators, how do you close the loop between USD budget reservation and token telemetry in practice? Specifically, what monthly control point proves token usage mapped back to dollars stayed within budget by tenant or workload? I am trying to understand which evidence teams actually trust when token counts and invoice totals drift.

Max Quimby • May 13

The "expose three tools, let the model write the query" pattern is the right end-state for any API whose surface area is large enough to make a flat tool list a context-window crime — databases, CRMs, anything CRUD-heavy. The 400K-token schema example isn't a corner case; it's the median large-product MCP.

The honest trade-off worth flagging: code mode moves your failure surface from schema validation (caught pre-execution) to runtime semantics (caught only when the query runs or, worse, when it runs and returns the wrong answer). The check tool helps, but it can only catch syntactic and structural issues — it can't tell you the model joined on the wrong key. So the test discipline shifts: in flat-tool MCP you mostly need schema fuzzing; in code mode you need a corpus of natural-language → expected-result-set pairs, and you have to actually execute them.

We hit one related gotcha worth mentioning: the model will happily write SELECT * when you ask for "everything." If your execute tool doesn't enforce row/column caps independently of what the model asked for, code mode reintroduces context bloat from the result side instead of the schema side. The savings only stick if execution is bounded.

JS ZenStack • May 14

That’s a very insightful response! I can tell you must have hands-on experience in this area! 😄

Regarding the trade-off you mentioned, yes, we can’t be 100% sure that the generated query is correct for what we asked in natural language. However, I think part of the challenges is inherent to the LLM itself. In this context, I believe that using a declarative schema and a type-safe query API can help the LLM achieve better accuracy than directly generating SQL. Still, I totally agree with you that further improvements are possible by modifying the workflow to bring in the human-in-the-loop.

And thanks for sharing your gotcha. That’s really a good one! I’m not sure whether the restriction on the query API has a positive effect in this case. In my experiments, the LLM always seems to try to completely rely on the query API to get the result, and only after it fails does it fall back to querying more data and filtering on its own. So I’m curious, what’s your solution for this issue?

Xidao • May 14

The "code mode" pattern you're describing is something I've been thinking about a lot. The fundamental tension is that MCP tools are designed to be descriptive (the model reads tool schemas and decides what to call), but for complex APIs with thousands of endpoints, the description surface area itself becomes the problem — not just for token consumption, but for the model's ability to make correct tool selection decisions.

I've seen similar behavior where models start picking the wrong tool when the schema names overlap semantically (e.g., get_user, get_user_profile, get_user_settings). Collapsing those into a single "execute" tool with a code-based interface sidesteps the selection problem entirely.

One question though: how do you handle error recovery in code mode? With individual MCP tools, if a call fails, the model gets a clear error for that specific operation and can retry or adjust. With code mode, a runtime error in the generated code could mean anything from a typo in a field name to a logic error in the workflow. Do you pass the full stack trace back to the model, or do you have some kind of error abstraction layer?

JS ZenStack • May 15

That’s a good question! In our case, the LLM is restricted to generating only the function call for the query API, rather than arbitrary code. As long as the API passes TypeScript type checking, runtime errors are rare.

Generally speaking, I think the more descriptive and self-explanatory the error message, the better the chance the LLM has of figuring out how to handle it. If you think about it, it’s essentially the same as vibe coding with an LLM.

Vic Chen • May 13

Really enjoyed this take. The schema bloat / token-budget problem is one of those issues that only shows up when you move from toy demos to real production systems. I liked the way you framed code mode as a practical escape hatch instead of treating MCP as "dead". As someone building AI products around messy real-world data, the point about remote HTTP MCP + OAuth being the real enterprise unlock also resonated a lot.

JS ZenStack • May 14

I can't agree more that some issues can never be exposed through "toy demos", which is exactly why I had to write this one 😂.

Thanks for reminding me again!

Vic Chen • May 14

Yeah, exactly. The toy demo trap hides the coordination cost until schemas, permissions, and retry paths all pile up. Once an MCP setup survives a few messy production loops, the design pressure gets honest fast. Curious whether Code Mode changed how you think about what belongs in the schema versus what should stay implicit in the agent.

JS ZenStack • May 14

You know, I’m probably biased as the ZenStack creator. I definitely wish more things could be expressed in the schema using the declarative way, which is supposed to work better for agents. 😄

I understand your points. They make me think about providing more support for Separation of Concerns For instance, in this context, the agent only needs to know the data relation structure; other information, such as access policy, could be filtered out to save context window space, as outlined in this existing GitHub issue:

Comment for #1077

ymc9 commented on Mar 04, 2024

Maybe something like "partial models" can mitigate this problem? Like:

model Post {
  id Int @id
  title String
}

model Post {
  @@allow(...)
}

The two Post models can reside in different zmodel files and are merged during compilation.

View on GitHub

Vic Chen • May 14

Totally. Toy demos hide the cost of coordination because they skip the ugly parts like permission boundaries, retries, and state drift. That is why Code Mode felt interesting to me too. It pushes the prompt out of the magic zone and into something closer to interface design under pressure.

Vic Chen • May 15

Yeah, and that separation-of-concerns angle feels underrated for agent workflows. Once the agent only sees the relation structure it actually needs, the schema becomes a working interface instead of a dumping ground for every policy and implementation detail. That usually makes both reasoning quality and failure analysis much cleaner.

I have run into the same thing with 13F pipelines too. The moment the model has to carry filing structure, entity normalization, and downstream business rules all in one blob, context quality drops fast.

Andrii Krugliak • May 17

Code mode beats tool mode the moment your agent needs to combine three calls. Ran an agent last week that had 14 tools registered the model spent 40 percent of its tokens describing what it might do instead of doing it. Wrapping the same 14 as composable functions cut the planning overhead almost completely. The MCP bloat isn't the protocol, it's the cardinality.

Raju Dandigam • May 15

This is a very practical framing of the MCP problem. The issue is not whether MCP is useful, but how quickly tool surfaces become too large for agents to reason over safely. I have been exploring this same theme from the TypeScript production-architecture side in github.com/rajudandigam/Ultimate-T..., where the focus is on agents, workflows, MCP-style integrations, guardrails, evals, and real-world project blueprints. I think readers who are hitting MCP/tool-bloat problems may find the catalog useful as a way to compare patterns before committing to one architecture. Happy to hear what patterns you think should be added around Code Mode or tool simplification.

NEXADiag Nexa • May 15

The collapse from 2,500 endpoints to 2 tools is the right move, but the harder problem is verifying the agent's reasoning chain didn't silently drop critical context during compression.

I call this Consensus Illusion: when an agent confidently picks a path because the compressed tool description happened to match its prior, not because the underlying data agrees. Single-model agent reasoning hides this. Cross-checking the same task with a second model on the same compressed surface usually surfaces it.

Have you observed compression bias when reducing tool surface? Or does ZenStack instrument the agent's decision path to catch it?

Harjot Singh • May 31

Collapsing 2,500 endpoints into two tools is the right move and it gets at the deepest problem with MCP-as-it's-commonly-used: every tool you register front-loads its schema into the context window before the agent does anything, so a big tool catalog is a tax you pay on every single call, in tokens and in the model getting distracted choosing among options it doesn't need. Code mode flips it, instead of exposing N pre-baked tools, you give the agent a way to compose calls, so the capability stays infinite while the context cost stays near-constant. That's the same reason a shell beats forty narrow tools for composable work. The MCP-vs-Agent-Skills framing is a bit of a false binary though, they're answering different questions (a typed contract with permissions vs a reusable capability bundle), and the real takeaway isn't one killed the other, it's stop exposing everything as a static tool just because you can. Route capability to the mechanism whose context-and-trust cost fits. That collapse-the-surface, keep-context-lean instinct is core to how I think about agent tooling in Moonshift. With two tools over 2,500 endpoints, how do you keep the permission boundary, does code mode reintroduce the the-agent-can-call-anything risk that narrow tools avoided?

JS ZenStack • Jun 1

Good question. I think there are two layers to this:

Code execution sandbox
The generated code runs inside a Cloudflare Worker with no file system access, no leaked environment variables, and external fetches disabled by default. Outbound requests must be explicitly opted into via outbound fetch handlers.
Data access enforcement
Every query the agent makes is guarded by the access control policies defined in your ZModel schema. The agent may be able to call any endpoint, but it can only see or modify data the current user is permitted to access — and that policy is centralized and can't be bypassed by choosing a different API path. This is exactly where ZenStack shines.

So while code mode does widen the callable surface, the meaningful risk — unauthorized data access — stays tightly controlled.

Harun Mahmud • May 13

The Cloudflare "Code Mode" approach is genuinely clever — collapsing 2,500 endpoints into two tools and ~1,000 tokens is the kind of solution that feels obvious in hindsight but takes real insight to reach.
The schema + check + execute split makes a lot of sense for ZenStack specifically. Giving the LLM the full schema upfront is a smart trade-off — the schema is manageable in size but gives the model everything it needs to write correct nested queries without back-and-forth.
One thing I'm curious about: for applications where the schema itself is very large (say 100+ models), does the schema tool start hitting the same context pressure again? Or is the schema text always compact enough to stay under control even at that scale?

JS ZenStack • May 14

Yes, you are right. The context grows with the schema. But as you said, I think it’s under control. The largest schema I’m aware of for the ZenStack user is around 200k. Therefore, I’m more optimistic about the growth of the context window than about the size of the schema.

However, even in the extreme case, there is a workaround to split the single large schema into multiple smaller files, which is already adopted by many ZenStack users. Then we could provide a gateway that picks up only the necessary files as requested.

Sam H • May 14

My AI agents are still using lots of MCP Servers (local and remote).

Mininglamp • May 14

The core problem with MCP bloat is that every tool definition eats context budget before the model even starts reasoning. With 20+ tools registered, easily 3-4K tokens gone just on schema descriptions. Code mode is one approach, but there's a simpler architectural fix: lazy tool loading. Only inject tool schemas that match the current task intent, determined by a lightweight classifier on the user prompt. Goes from 20 tools always loaded to 3-5 relevant ones per turn. Context savings compound fast in multi-turn sessions. The other pattern that helps: collapsing CRUD-style tools into a single generic tool with a structured action parameter, rather than separate create/read/update/delete endpoints.

Artemii Amelin • May 16

The framing that MCP handles connectivity while skills handle procedural knowledge is clean and I think accurate. The gap I'd add is one layer below: MCP tells an agent what it can connect to, but not how agents reach each other across environments. Once you have an orchestrator on a VPS and skill-running agents on different machines or networks, NAT becomes the actual problem. Pilot Protocol (pilotprotocol.network) fills that piece for me, peer-to-peer networking at the session layer so agents find each other without any routing config. Works alongside MCP rather than replacing it.

Raju Dandigam • May 15

This is a very practical framing of the MCP problem. The issue is not whether MCP is useful, but how quickly tool surfaces become too large for agents to reason over safely. I have been exploring this same theme from the TypeScript production-architecture side in github.com/rajudandigam/Ultimate-T... , where the focus is on agents, workflows, MCP-style integrations, guardrails, evals, and real-world project blueprints. I think readers who are hitting MCP/tool-bloat problems may find the catalog useful as a way to compare patterns before committing to one architecture. Happy to hear what patterns you think should be added around Code Mode or tool simplification.