Ryosuke Tsuji

Posted on May 1 • Edited on May 24 • Originally published at ryantsuji.dev

Cutting Self-Built MCP Server Token Usage by 90% — The Parking Pattern

#mcp #ai #architecture #webdev

Hi, I'm Ryan, CTO at airCloset.

In my previous posts I introduced the full picture of our 17 internal MCP servers, an MCP server that lets you search 991 internal tables in natural language, a Graph RAG MCP for measuring initiative impact, and the Sandbox MCP that lets non-engineers publish AI-built apps safely.

This time I want to share something that came out of running those in production — a small trick we use to cut token consumption on self-built MCP servers.

The Annoyance: MCPs Eat More Tokens Than You'd Think

The first surprise when extending an AI agent with MCP is that token consumption is higher than expected.

An MCP tool call is, at the end of the day, JSON-RPC over HTTP. Both the arguments the AI sends and the result the tool returns land directly in the conversation context. If you implement things naively:

Sending whole files as arguments → thousands of lines of source code stick to the context
Returning all DB query rows → a multi-thousand-row × multi-column table sticks to the context

A single tool call can easily consume tens of thousands of tokens, putting the Claude Code session straight into compaction.

It's worse than just inefficiency: above a certain row count, the response simply fails to come back at all because it exceeds MCP's payload size limit.

When we were ramping up our internal MCP fleet, this little mismatch was reliably making the tool experience worse.

The Pattern: Park the Big Stuff Elsewhere, Pass Only a Key

The fix is embarrassingly simple:

Take the parts that tend to grow and move them off the MCP wire. Pass only a reference key (or URL) through MCP itself.

Both the request side and the response side benefit from the same idea.

Direction	What to remove	Where to park it
Request	Large files / source code	GitHub, Drive, or any object store
Response	Large list data / query results	Spreadsheet / GCS / BigQuery

Two examples from airCloset.

Example 1: Lighter Requests — Sandbox MCP × Self-Hosted Git Server

Last time I wrote about Sandbox MCP, the platform that lets non-engineers publish AI-built apps internally. The first iteration was fully MCP tool-driven file uploads.

sandbox_write_file(app_name: "todo-app", path: "index.html", content: "<html>...")
sandbox_write_file(app_name: "todo-app", path: "app.js", content: "import ...")
sandbox_publish(app_name: "todo-app")

The moment apps got slightly bigger, this collapsed:

Constant chunking: hitting the payload size limit, the AI looped through "first half of file A → second half → first half of file B → ..."
Tokens going up in flames: full source code landed in the conversation context — a single deploy of a few-thousand-line app could burn tens of thousands of tokens
Retries made it worse: the AI would "verify after sending" by re-reading the same file with sandbox_read_file. Write → read → write loops

So we changed the contract: MCP only returns a URL; the actual content moves over git push.

# 1. MCP returns a git URL — no payload involved
sandbox_init_repo(app_name: "todo-app")
# → https://mcp-sandbox.example.com/git/sandbox/ryan/todo-app.git

# 2. AI runs git in the background — MCP isn't involved
git init && git add . && git commit -m "init"
git remote add sandbox <returned URL>
git push sandbox main

# 3. Only the deploy command goes through MCP
sandbox_publish(app_name: "todo-app")

git push gives us:

No file size limit
Differential transfer — second-time pushes are fast
Source code never lands in the MCP conversation context

From the AI's point of view, it's just "I got handed a git URL; I push to it." Fundamentally different in token economics.

By the way, we don't use GitHub Organizations here. Issuing GitHub seats for every employee wasn't worth the cost or operational overhead, and we already had a self-hosted Git Server on GCE for a different purpose, so we just added one repo (sandbox-apps). The "park" doesn't have to be something you build from scratch.

Example 2: Lighter Responses — DB Graph MCP × Spreadsheet

DB Graph MCP is the MCP that lets us search and query 991 internal tables in natural language.

The annoying-but-common case here is "give me everything"-style queries:

SELECT * FROM service_main.user WHERE created_at >= '2026-01-01'

When the result is several thousand to tens of thousands of rows, you get either:

A multi-million-token response that triggers immediate session compaction
An MCP error because the payload exceeds the size limit

Or both. The "right" AI behavior is to do LIMIT 100 and analyze a sample — but if the user actually wanted the full list as a CSV, that doesn't help them.

So we built a "export to spreadsheet, return only the URL" mode into DB Graph MCP. You can opt in explicitly, but the MCP also auto-falls back to this mode whenever the result exceeds a row-count threshold. Even if the AI forgets to add a LIMIT and the query is about to return 10,000 rows, the server decides "this is too big to return inline," exports to a spreadsheet, and hands back the URL.

// Conceptual call (the real shape is documented in the tool description)
sql_query_database({
  query: "SELECT * FROM ...",
  output: "spreadsheet"  // ← explicit export mode
})

// Without `output`, the server still auto-falls back over a threshold (e.g. 500 rows)
sql_query_database({
  query: "SELECT * FROM ..."
})
// → server detects row count → spreadsheet export + URL response

// Either way, the response shape is the same
{
  url: "https://docs.google.com/spreadsheets/d/{...}/edit",
  rows: 12483,
  columns: ["id", "email", "created_at", ...],
  exported_reason: "row_count_exceeded"  // set on auto-fallback
}

The response is just a URL plus metadata. The real data never enters the context. "Light if you're careful" becomes "light even when you're not" — and that's what makes it feel safe in day-to-day operation.

This pattern works because a surprisingly large fraction of real use cases are just "I want this data somewhere I can use it later" — not "let's analyze this in chat with AI." Things like:

Save it to a spreadsheet I can stare at later
Share it with another team
VLOOKUP it against another sheet

For those, MCP's job ends at "write the query, drop the result somewhere." That's enough.

If the user genuinely does want AI-side analysis, you do still need the data in context. The standard workflow becomes a two-step: LIMIT 100 for sample analysis, then output: spreadsheet for the full export once the conclusion is clear.

How Much Did It Save?

Every MCP we run logs every tool call. After rolling these patterns out, total token consumption across all tools dropped 70–90%.

Bonus: Google Workspace OAuth Pairs Beautifully With This

A note on choosing where to "park" data: if your MCP authenticates via Google Workspace OAuth, this whole design becomes much easier.

The reason is that you get two things from a single OAuth flow — two birds with one stone:

Authentication for MCP itself — figuring out who's using the tool
Authorization for Workspace apps — scoped access to Spreadsheet / Drive / Gmail / Calendar

Once the user has logged into the MCP, you don't have to ask for any additional permissions to write to the park location. Which means you can:

Use the operating user's own permissions
To save files to that user's My Drive
Without the MCP itself owning a write-anywhere service account

Files end up in the user's drive, not on a shared service account. "Accidentally world-readable" or "visible to people who shouldn't see it" stops being a realistic accident — it's structurally prevented.

You also dodge the operational cost of issuing a separate GCP service account, storing its key safely, and managing its IAM policy out of band. The safety property genuinely comes for free.

There's one catch though:

The AI agent has to be able to read the spreadsheet URL it got back.

Returning a URL alone doesn't help the AI access the underlying data. Stock tooling in Claude Code can't read a Spreadsheet directly, so you need a separate Workspace-operating MCP.

At airCloset we run a dedicated MCP that wraps the Google Workspace APIs (Drive / Sheets / Gmail / Calendar). Combined with the export pattern above, it gives us a clean flow: "drop results into a spreadsheet → call into the Workspace MCP later if the AI wants to actually read them."

DB Graph MCP → exports to Spreadsheet → returns URL
                                          ↓
              Workspace MCP ← invoked when the AI decides it needs to read the data

From the user's side, this naturally produces the rhythm of "dump it into a spreadsheet first, ask AI to analyze only when needed."

Wrap-Up

A few small tricks for keeping self-built MCP server token consumption under control:

Move the parts that tend to grow off the MCP wire
Park them somewhere — Git server, Spreadsheet, GCS — and only pass keys/URLs through MCP
Pick a park that pairs well with Google Workspace OAuth — you get safety almost for free
If you want the AI to read parked data later, run a Workspace-style MCP alongside

It's an unflashy design move, but the difference in MCP usability before and after is dramatic.

If you're running self-built MCP servers internally and feeling the token squeeze, give it a try.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more