DEV Community

Cover image for How to Save Bloated MCP with Code Mode
JS for ZenStack

Posted on • Edited on • Originally published at zenstack.dev

How to Save Bloated MCP with Code Mode

Collapsing 2,500 endpoints into two tools

Is MCP Dead Because of Agent Skills?

It sometimes feels like AI is not just disrupting the old world, like SaaS, but also consuming its own children in the new one. Less than a year ago, the Model Context Protocol (MCP) was the industry's "golden child", with every vendor and platform scrambling to integrate it. Then, just 6 months later, the spotlight shifted again—this time to Agent Skills, the younger sibling that seems to have stolen much of the thunder from MCP.

Suddenly, "MCP is dead" is all over social media, just like "SaaS is dead" before. However, this is a typical social media's nature to use extreme words to trigger emotions and manufacture virality. Let's first hear what their parents, Anthropic, said when they donated MCP to the Linux Foundation and released Agent Skills as an open standard at the end of 2025:

We've also seen how complementary skills and MCP servers are. MCP provides secure connectivity to external software and data, while skills provide the procedural knowledge for using those tools effectively. Partners who've invested in strong MCP integrations were a natural starting point.

— Mahesh Murag, Product manager at Anthropic

Here's the real-world industry experience on how that actually works:

Skills vs MCP tools for agents: when to use what

If you want to explore more about the double-edged sword of the MCP, here is a good article that got hyped on HackerNews:

@chrlschn - MCP is Dead; Long Live MCP!

Understanding the social media zeitgeist around CLIs and the premature death of MCP

favicon chrlschn.dev

I agree with the author's point of view that the remote http MCP server is actually the game-changer and will be a key linchpin in organizational and enterprise adoption, shifting from vibe-coding to agentic engineering.

As a consequence, the built-in support for OAuth is a good feature that was ever added, if not the best. This is definitely one of the most important things for organizational and enterprise adaptation, which could be well illustrated by the example below in the above article:

An engineer leaves your team? Revoke their OAuth token and access to the MCP server; they never had access to other keys and secrets to start with.

That's why I also wrote a post to demonstrate how to create an HTTP MCP server with OAuth support from scratch to enable AI access to the database safely:

MCP Is Bloated

My blog post gained quite a few attention within the ZenStack community. Given how neat the ZenStack solution is, I expected to see a variety of MCP servers with integrated Auth released.

Unfortunately, it's never happened. I didn't realize the reason until the following GitHub issue was created by one user:

[Feature Request] Optimized Schema Generation for MCP Servers #2277

Hey there 👋

First of all, thank you for the awesome work you’re doing with ZenStack — it’s been really inspiring to follow, especially your blog post about connecting databases to MCP. That article motivated me to try building an MCP server using ZenStack for access-controlled tool functions.

Context

I’ve been experimenting with integrating ZenStack into an MCP server setup to enable AI access to a complex database through controlled tool functions. The idea was to leverage ZenStack’s model-based access control to safely expose data operations like User_findFirst, etc.

However, I ran into a major limitation when working with our database, which has a very large and complex schema. When I tried adding even a single tool (e.g., User_findFirst), the MCP console threw an error indicating that the token length was around 410k, while the available token limit was only about 130k.

Investigation

After some testing, it became clear that the issue stems from the schema size — both the Zod and JSON schemas generated by ZenStack are simply too large for MCP’s context limits.

Manually defining smaller or simplified schemas for every tool would be technically possible but not practical or scalable for large projects. This makes it currently infeasible to use ZenStack with MCP for any non-trivial database schema.

Feature Request / Discussion

Would it be possible to introduce a way to generate "optimized" or "compressed" schemas specifically for MCP servers?

For example:

  • Partial schema generation: Include only fields used by the selected operation.
  • Simplified schema export: Replace detailed type definitions with minimal references or summaries.
  • Configurable schema depth: Allow developers to specify how deep related models should be expanded.
  • Schema reuse / referencing: Use shared references to avoid repetition in Zod/JSON schemas.
  • Minified JSON schema output: Optionally generate schemas in a minified format (no whitespace, reduced metadata) to significantly decrease payload size and reduce token usage during LLM interactions.

Why This Matters

Such optimization would:

  • Make ZenStack much more practical for MCP-based AI integrations.
  • Reduce token usage drastically.
  • Maintain the benefits of access control and type safety.
  • Enable developers with large or complex schemas to actually use ZenStack-powered MCP servers in production.

Summary

  • Issue: ZenStack-generated schemas are too large for MCP context limits when working with big databases.
  • Goal: Add support for schema optimization (e.g., depth limiting, partial exports, or simplified schemas).
  • Benefit: Makes ZenStack viable for MCP + AI applications at scale.

Would love to hear your thoughts on whether this could fit into ZenStack’s roadmap or if there are any existing workarounds you’d recommend.

Thanks again for all your work - it’s an amazing project! 🙌

TLDR, the context window is bloated to 400K even when the user tried to load a single tool that the MCP server provided, making it infeasible to use. This "context bloating" issue is one of the most criticized issues of MCP, and it is why people are switching to Skill instead. For example, here is what the official Playwright GitHub says about Playwright CLI vs Playwright MCP

Modern coding agents increasingly favor CLI–based workflows exposed as SKILLs over MCP because CLI invocations are more token-efficient: they avoid loading large tool schemas and verbose accessibility trees into the model context, allowing agents to act through concise, purpose-built commands.

ZenStack's approach actually amplified this issue to the greatest extent. The reason is that the tools exposed by the ZenStack MCP server are the ORM's query API, which is a superset of Prisma ORM's query API. One of its strengths is that it allows nested relation queries. This means you can traverse all the models in your entire database from a single function call, something like this:

const users = await db.user.findMany({
  select: {
    id: true,
    name: true,
    posts: {
      // relation-level filter 
      where: {
        tags: { some: { name: "typescript" } },
      },
      select: {
        id: true,
        title: true,

        tags: {
          // deep filter 
          where: { name: "typescript" },
          select: { name: true },  
        },

        comments: {
          where: {
            body: { contains: "great" },
          },
          // switch to include here — we want all Comment columns
          include: {
            author: {
              select: { id: true, name: true },
            },
          },
        },
      },
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

While water can carry a boat, it can also overturn it. The JSON schema generated for even a single MCP server tool becomes bloated, encompassing the entire application's schema.

A simple fix is to limit the depth of nested relation traversals to simplify the JSON schema. However, I consider this more of a workaround than a real solution:

  • It is neither practical nor scalable, as it can only select a limited portion of the model, making it infeasible for large projects.
  • It loses the strength of the query API. What could have been accomplished with a single query now needs several request/response cycles, impacting performance and leading to higher token usage.

Advent of Code Mode MCP

One good thing in the software development world is that, whatever problem you encounter, you are not the first to face it in most cases. Even better, people are likely to share their problems and solutions. Here comes the one from Cloudflare:

Code Mode: give agents an entire API in 1,000 tokens

The Cloudflare API has over 2,500 endpoints. Exposing each one as an MCP tool would consume over 2 million tokens. With Code Mode, we collapsed all of it into two tools and roughly 1,000 tokens of context.

favicon blog.cloudflare.com

Just from the title, you can tell it's exactly what I need. 😄 The solution they adopted is to, instead of describing every operation as a separate tool, let the LLM write the code directly, which they named it as Code Mode. To be more specific, the MCP server for the entire Cloudflare API only exposes two tools: search and execute. Then the footprint stays fixed, no matter how many API endpoints exist. As a matter of result, they managed to reduce the number of input tokens used by 99.9% to around 1k tokens.

Let's see how we can use this approach to address our problem!

ZenStack's Solution

We could have used the same methods as Cloudflare to expose both search and execute tools. But we could do better than that because of the two intrinsic characteristics of ZenStack's case:

  • One model is technically connected to all the other models in the entire database. As mentioned above, this is the culprit of the context-bloating issue.
  • LLM is already quite familiar with how to write the Query API because it has encountered innumerable examples in GitHub and blog posts in its training data set, thanks to Prisma.

As a result, it provides 3 tools:

schema

Instead of providing a search tool, we would offer a schema tool that sends the entire schema to the LLM, giving it a whole picture of the application with some simple examples. Although this approach means the tool's footprint is no longer fixed, the size of the entire schema text file is manageable by most LLMs nowadays.

execute

The execute is nothing more. Actually, it's even simpler, because it would only run a fixed number of function calls corresponding to the query API, like findMany , createMany , updateMany etc. instead of arbitrary code.

check

This tool is used for LLM to check if the parameters of the function call it composed are valid before execute. This is kind of icing on the cake, because I believe the LLM could still figure out the correct parameters from the returned error message of execute . In fact, after playing around with the MCP server in Claude and getting several successful responses, I just realized it never actually called the check . Here is the response I got when I asked why he didn't call check as prompted:

You're right, and that's a fair callout. The gym:check tool description explicitly says:
"Use the 'check' tool first to validate the call, then execute it."
I skipped it entirely — which means I was running queries without type-checking them first. I got lucky that the queries happened to be valid, but that's not the right approach. The check step exists to catch invalid field names, wrong argument shapes, or unsupported operations before they hit the database.

I have to add a MUST to the prompt of execute to make sure it gets called.

You MUST use the "check" tool first to validate the query before calling this tool
Enter fullscreen mode Exit fullscreen mode

The reason for adding this check is to adhere to the software engineering principle of "High cohesion and low coupling," which benefits both people and AI. For example, execute only runs valid function calls, so any error it returns is a runtime issue that the user definitely needs to be aware of.

Test with Claude

I simulated a complex gym application with over 50 models and AI-generated mock data, and played using the Claude desktop using Sonnet 4.6. Honestly, the result is looking pretty solid.

It could generate a very complex nested query with more than 10 models involved, and the rendered result is quite good:

complex-nested-query

If the task can't be achieved within one operation, it knows to use multiple operations to get it done.

multiple-operations

The more impressive thing is that throughout my entire experiment, there was only one time that the initial function call created by LLM is incorrect, which was caught by the 'check' tool. And it simply takes another route to get the job done.

check-error

Try It for Your Own Application

Here is a simple project you can run locally:

ZenStack Code Mode MCP Server with Authorization

This repository contains a sample implementation of a remote MCP server with authentication and authorization for ZenStack v3. It demonstrates how to set up an MCP server that can be accessed remotely, with user authentication and role-based access control.

Features

  • Remote MCP server accessible with credential Authentication support
  • ZenStack query API exposed through the MCP server using Code Mode with only 3 tools to improve context efficiency

Tools

  • schema

    Get the complete zmodel schema, available query APIs, and rules for complex query patterns.

  • check

    Validate a query function call before execution. Performs TypeScript type checking to ensure the operation and arguments are valid for the given model.

  • execute

    Execute a query operation on the database. Use the "check" tool first to validate the call, then execute it.

Quick Start

  1. Install dependencies:

    pnpm install
    Enter fullscreen mode Exit fullscreen mode
  2. Start the server:

    pnpm run dev
    Enter fullscreen mode Exit fullscreen mode

Testing the

You can simply test for your own application by changing the schema.zmodel to fit your application. The lessons I have learned make me pretty sure you would encounter some issues with it. I would appreciate it if you could let me know so we could iterate on it to make it more useful.

Top comments (30)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

ran into this running agents with 15+ mcp tools in context - the model wastes tokens just figuring out which tool applies before doing real work. that's the right instinct, reducing the surface at runtime rather than expanding it.

Collapse
 
void_stitch profile image
Void Stitch

Primary-source detail that sharpened this for me: Charles Chen’s March 14 MCP analysis separates local stdio MCP from remote HTTP MCP and argues the enterprise value is mostly auth+telemetry, not thin API wrappers. That lines up with your schema/check/execute split.

Question from the reliability side: do you have any before/after numbers from production usage (for example tokens per successful task, invalid-query rate after check, or manual-rewrite rate) compared with the old tool-per-model pattern? It would make the context-savings vs semantic-risk tradeoff much easier to evaluate.

Collapse
 
jiasheng profile image
JS ZenStack

That’s a good question! To be honest, I don't have those numbers. Like I mentioned in the post, there’s probably no production adoption for the old approach because the context bloating issue.

Collapse
 
void_stitch profile image
Void Stitch

Thanks for the candid answer. For production operators, how do you close the loop between USD budget reservation and token telemetry in practice? Specifically, what monthly control point proves token usage mapped back to dollars stayed within budget by tenant or workload? I am trying to understand which evidence teams actually trust when token counts and invoice totals drift.

Collapse
 
void_stitch profile image
Void Stitch

Helpful context, thanks. If you can share one concrete signal, it would sharpen this a lot: was the blocker mainly token-heavy history growth, cache read/write accounting gaps, or both in production traces? Even one rough failure pattern is useful.

Collapse
 
void_stitch profile image
Void Stitch

Thanks, that aligns with what I see. I am testing one cost-control workflow: tenant metadata breakdown plus token-class reservation before USD conversion. In your production traces, would it be more useful to track prompt, completion, cache_write, and cache_read tokens first per organizationId, then apply model USD rates after aggregation, instead of storing only USD totals? I am asking because context-bloat discussions can hide cache-ratio swings that invert apparent per-tenant spend. Curious whether you would treat that as a telemetry-layer fix or a dashboard-layer fix.

Collapse
 
void_stitch profile image
Void Stitch

Helpful context, thanks. One workflow question from the cost-control side: if you had to pick one control point, would you track token classes per organization first and convert to USD after aggregation, or reserve USD first and reconcile token drift later? I am trying to identify which method gives fewer false budget alarms in production.

Collapse
 
void_stitch profile image
Void Stitch

Helpful context, thank you. One boundary I’m testing is token-class reservation before spend: cache-write, cache-read, and uncached input/output split separately. Even a rough 1-week split would show whether USD-only reservation under-reserves. Is that split observable in your current logs?

Collapse
 
max_quimby profile image
Max Quimby

The "expose three tools, let the model write the query" pattern is the right end-state for any API whose surface area is large enough to make a flat tool list a context-window crime — databases, CRMs, anything CRUD-heavy. The 400K-token schema example isn't a corner case; it's the median large-product MCP.

The honest trade-off worth flagging: code mode moves your failure surface from schema validation (caught pre-execution) to runtime semantics (caught only when the query runs or, worse, when it runs and returns the wrong answer). The check tool helps, but it can only catch syntactic and structural issues — it can't tell you the model joined on the wrong key. So the test discipline shifts: in flat-tool MCP you mostly need schema fuzzing; in code mode you need a corpus of natural-language → expected-result-set pairs, and you have to actually execute them.

We hit one related gotcha worth mentioning: the model will happily write SELECT * when you ask for "everything." If your execute tool doesn't enforce row/column caps independently of what the model asked for, code mode reintroduces context bloat from the result side instead of the schema side. The savings only stick if execution is bounded.

Collapse
 
jiasheng profile image
JS ZenStack

That’s a very insightful response! I can tell you must have hands-on experience in this area! 😄

Regarding the trade-off you mentioned, yes, we can’t be 100% sure that the generated query is correct for what we asked in natural language. However, I think part of the challenges is inherent to the LLM itself. In this context, I believe that using a declarative schema and a type-safe query API can help the LLM achieve better accuracy than directly generating SQL. Still, I totally agree with you that further improvements are possible by modifying the workflow to bring in the human-in-the-loop.

And thanks for sharing your gotcha. That’s really a good one! I’m not sure whether the restriction on the query API has a positive effect in this case. In my experiments, the LLM always seems to try to completely rely on the query API to get the result, and only after it fails does it fall back to querying more data and filtering on its own. So I’m curious, what’s your solution for this issue?

Collapse
 
xidao profile image
Xidao

The "code mode" pattern you're describing is something I've been thinking about a lot. The fundamental tension is that MCP tools are designed to be descriptive (the model reads tool schemas and decides what to call), but for complex APIs with thousands of endpoints, the description surface area itself becomes the problem — not just for token consumption, but for the model's ability to make correct tool selection decisions.

I've seen similar behavior where models start picking the wrong tool when the schema names overlap semantically (e.g., get_user, get_user_profile, get_user_settings). Collapsing those into a single "execute" tool with a code-based interface sidesteps the selection problem entirely.

One question though: how do you handle error recovery in code mode? With individual MCP tools, if a call fails, the model gets a clear error for that specific operation and can retry or adjust. With code mode, a runtime error in the generated code could mean anything from a typo in a field name to a logic error in the workflow. Do you pass the full stack trace back to the model, or do you have some kind of error abstraction layer?

Collapse
 
jiasheng profile image
JS ZenStack

That’s a good question! In our case, the LLM is restricted to generating only the function call for the query API, rather than arbitrary code. As long as the API passes TypeScript type checking, runtime errors are rare.

Generally speaking, I think the more descriptive and self-explanatory the error message, the better the chance the LLM has of figuring out how to handle it. If you think about it, it’s essentially the same as vibe coding with an LLM.

Collapse
 
vicchen profile image
Vic Chen

Really enjoyed this take. The schema bloat / token-budget problem is one of those issues that only shows up when you move from toy demos to real production systems. I liked the way you framed code mode as a practical escape hatch instead of treating MCP as "dead". As someone building AI products around messy real-world data, the point about remote HTTP MCP + OAuth being the real enterprise unlock also resonated a lot.

Collapse
 
jiasheng profile image
JS ZenStack

I can't agree more that some issues can never be exposed through "toy demos", which is exactly why I had to write this one 😂.

Thanks for reminding me again!

Collapse
 
vicchen profile image
Vic Chen

Yeah, exactly. The toy demo trap hides the coordination cost until schemas, permissions, and retry paths all pile up. Once an MCP setup survives a few messy production loops, the design pressure gets honest fast. Curious whether Code Mode changed how you think about what belongs in the schema versus what should stay implicit in the agent.

Thread Thread
 
jiasheng profile image
JS ZenStack

You know, I’m probably biased as the ZenStack creator. I definitely wish more things could be expressed in the schema using the declarative way, which is supposed to work better for agents. 😄

I understand your points. They make me think about providing more support for Separation of Concerns For instance, in this context, the agent only needs to know the data relation structure; other information, such as access policy, could be filtered out to save context window space, as outlined in this existing GitHub issue:

Comment for #1077

ymc9 avatar
ymc9 commented on

Maybe something like "partial models" can mitigate this problem? Like:

model Post {
  id Int @id
  title String
}

model Post {
  @@allow(...)
}
Enter fullscreen mode Exit fullscreen mode

The two Post models can reside in different zmodel files and are merged during compilation.

Collapse
 
vicchen profile image
Vic Chen

Totally. Toy demos hide the cost of coordination because they skip the ugly parts like permission boundaries, retries, and state drift. That is why Code Mode felt interesting to me too. It pushes the prompt out of the magic zone and into something closer to interface design under pressure.

Collapse
 
vicchen profile image
Vic Chen

Yeah, and that separation-of-concerns angle feels underrated for agent workflows. Once the agent only sees the relation structure it actually needs, the schema becomes a working interface instead of a dumping ground for every policy and implementation detail. That usually makes both reasoning quality and failure analysis much cleaner.

I have run into the same thing with 13F pipelines too. The moment the model has to carry filing structure, entity normalization, and downstream business rules all in one blob, context quality drops fast.

Collapse
 
theuniverseson profile image
Andrii Krugliak

Code mode beats tool mode the moment your agent needs to combine three calls. Ran an agent last week that had 14 tools registered the model spent 40 percent of its tokens describing what it might do instead of doing it. Wrapping the same 14 as composable functions cut the planning overhead almost completely. The MCP bloat isn't the protocol, it's the cardinality.

Collapse
 
raju_dandigam profile image
Raju Dandigam

This is a very practical framing of the MCP problem. The issue is not whether MCP is useful, but how quickly tool surfaces become too large for agents to reason over safely. I have been exploring this same theme from the TypeScript production-architecture side in github.com/rajudandigam/Ultimate-T..., where the focus is on agents, workflows, MCP-style integrations, guardrails, evals, and real-world project blueprints. I think readers who are hitting MCP/tool-bloat problems may find the catalog useful as a way to compare patterns before committing to one architecture. Happy to hear what patterns you think should be added around Code Mode or tool simplification.

Collapse
 
nexadiag_nexa_312a4b5f603 profile image
NEXADiag Nexa

The collapse from 2,500 endpoints to 2 tools is the right move, but the harder problem is verifying the agent's reasoning chain didn't silently drop critical context during compression.

I call this Consensus Illusion: when an agent confidently picks a path because the compressed tool description happened to match its prior, not because the underlying data agrees. Single-model agent reasoning hides this. Cross-checking the same task with a second model on the same compressed surface usually surfaces it.

Have you observed compression bias when reducing tool surface? Or does ZenStack instrument the agent's decision path to catch it?

Collapse
 
harjjotsinghh profile image
Harjot Singh

Collapsing 2,500 endpoints into two tools is the right move and it gets at the deepest problem with MCP-as-it's-commonly-used: every tool you register front-loads its schema into the context window before the agent does anything, so a big tool catalog is a tax you pay on every single call, in tokens and in the model getting distracted choosing among options it doesn't need. Code mode flips it, instead of exposing N pre-baked tools, you give the agent a way to compose calls, so the capability stays infinite while the context cost stays near-constant. That's the same reason a shell beats forty narrow tools for composable work. The MCP-vs-Agent-Skills framing is a bit of a false binary though, they're answering different questions (a typed contract with permissions vs a reusable capability bundle), and the real takeaway isn't one killed the other, it's stop exposing everything as a static tool just because you can. Route capability to the mechanism whose context-and-trust cost fits. That collapse-the-surface, keep-context-lean instinct is core to how I think about agent tooling in Moonshift. With two tools over 2,500 endpoints, how do you keep the permission boundary, does code mode reintroduce the the-agent-can-call-anything risk that narrow tools avoided?

Collapse
 
jiasheng profile image
JS ZenStack

Good question. I think there are two layers to this:

  1. Code execution sandbox
    The generated code runs inside a Cloudflare Worker with no file system access, no leaked environment variables, and external fetches disabled by default. Outbound requests must be explicitly opted into via outbound fetch handlers.

  2. Data access enforcement
    Every query the agent makes is guarded by the access control policies defined in your ZModel schema. The agent may be able to call any endpoint, but it can only see or modify data the current user is permitted to access — and that policy is centralized and can't be bypassed by choosing a different API path. This is exactly where ZenStack shines.

So while code mode does widen the callable surface, the meaningful risk — unauthorized data access — stays tightly controlled.

Collapse
 
harun_mahmud_88 profile image
Harun Mahmud

The Cloudflare "Code Mode" approach is genuinely clever — collapsing 2,500 endpoints into two tools and ~1,000 tokens is the kind of solution that feels obvious in hindsight but takes real insight to reach.
The schema + check + execute split makes a lot of sense for ZenStack specifically. Giving the LLM the full schema upfront is a smart trade-off — the schema is manageable in size but gives the model everything it needs to write correct nested queries without back-and-forth.
One thing I'm curious about: for applications where the schema itself is very large (say 100+ models), does the schema tool start hitting the same context pressure again? Or is the schema text always compact enough to stay under control even at that scale?

Collapse
 
jiasheng profile image
JS ZenStack

Yes, you are right. The context grows with the schema. But as you said, I think it’s under control. The largest schema I’m aware of for the ZenStack user is around 200k. Therefore, I’m more optimistic about the growth of the context window than about the size of the schema.

However, even in the extreme case, there is a workaround to split the single large schema into multiple smaller files, which is already adopted by many ZenStack users. Then we could provide a gateway that picks up only the necessary files as requested.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.