Scott Lepper

Posted on May 19

Build MCP Servers that don't suck...tokens.

#ai #mcp #typescript #opensource

First-generation MCP servers were great. They gave AI agents access to a ton of external apps and data — Jira, Confluence, GitHub, Linear, you name it. But most of them just wrapped REST APIs. And that causes a ton of context bloat, hallucinations, and token burning.

Combining a few strategies from the ultra-mcp-toolkit, you can reduce that bloat dramatically — and save money.

Generating a cost-efficient MCP server is easy. Just install the skill and off you go.

Here's what "dramatically" looks like

Real benchmark, live Jira instance, reproducible:

Per-call response size

scenario	naive	with toolkit	savings
fetch 1 simple ticket	20.3KB	1.2KB	17.5×
investigate rich ticket	270.7KB	15.5KB	17.5×
JQL search ~10 tickets	20.5KB	3.5KB	5.8×

That rich-ticket row is the one that hurts. 270 KB → 15.5 KB. ~67k tokens down to ~3.9k tokens. Same content; the full payload still lands on disk and the agent can fetch it via a ref: path only if it actually needs the detail.

Tool-list cost (paid every conversation)

approach	bytes	~tokens	savings
naive (one tool per op)	38.9KB	9,947	1×
consolidated tools	25.1KB	6,427	1.5×
consolidated + filtered	~6 KB	~1,600	5×
code-api mode	401B	100	99×

You read that right. Tool listings drop from ~10k tokens to ~100 tokens. On every. single. conversation.

Why MCP servers leak tokens

Four anti-patterns show up almost everywhere:

Returning raw API JSON. A Jira issue carries iconUrls, nested self URLs, schema metadata, expand hints, three different shapes of the same status field. The agent needs none of it.
One MCP tool per endpoint. A typical CRM has ~80 endpoints → 80 tool descriptions in the listing → ~10k tokens before the user types anything.
Asking the LLM to filter or paginate. The model can't reliably page through huge structures, and the chunking logic itself costs tokens. Filtering belongs server-side.
No discipline on what gets kept. Denylist trimming (delete result.iconUrl) silently breaks the day the API adds a new noisy field. Allowlists keep the contract stable.

The fix, in three strategies

1. Allowlist-style trim projections

import { pick } from "ultra-mcp-toolkit/trim";

const issueSummary = (raw) => {
  const r = raw as { key: string; fields: Record<string, unknown> };
  return {
    key: r.key,
    ...pick(r.fields, ["summary", "status", "priority", "assignee"]),
  };
};

Register the trim once. Every response routes through it. New API fields default to dropped. The model sees what it needs; the full response lives on disk as a ref: the agent can dereference on demand.

2. Consolidated tools (action-discriminated)

Instead of 80 tools, expose ~15 — each taking an action arg:

{ action: "get", issueIdOrKey: "PROJ-1" }
{ action: "create", projectKey: "PROJ", summary: "..." }
{ action: "transition", issueIdOrKey: "PROJ-1", transition: "Done" }

Same operations, 1/5th the tool-list cost. The toolkit's dispatcher handles per-action Zod validation, manifest routing, and a full: true escape hatch when the model genuinely needs the raw response.

3. Code-api mode (the 99× lever)

Expose a single MCP tool that hands the agent a path to a bundled CLI plus a socket address:

node <cli-path> issue.get --issueIdOrKey=PROJ-1
# stdout: trimmed summary as JSON
# final line: `ref: /path/to/full-response.json`

The agent drives the whole API from its shell. Tool list stays at one tool forever, no matter how many operations exist. For shell-capable agents (Claude Code, Cursor, anything with bash), it's pure win.

Quick start

npm install ultra-mcp-toolkit

The toolkit ships a Claude Code skill that auto-loads when you work on an MCP server. Install it:

npm run install-skill

That's it. The skill walks the agent through manifest design, trim projections, dispatcher wiring, and server boot — the patterns that produce the numbers above.

Working from a non-Claude agent (Codex CLI, Cursor, Aider, Continue, Zed)? Point it at the skill markdown directly — AGENTS.md shows you how.

What's in the box

Operation manifest — declare endpoints as pure data; powers MCP tools, CLI, and code-api bridge from one source of truth.
Trim registry — type-safe allowlist projections.
Content-addressed sandbox — full responses land on disk; the model sees a ref: only.
Page cache — versioned-id disk cache for stable keys (PR diffs by SHA, Confluence pages by version).
Pooled retry-aware HTTP transport — undici + 429-aware retry honoring Retry-After.
Atomic streaming downloads — sha256-verified, path-traversal-safe.
Consolidated tool dispatcher — Zod-validated, action-discriminated.
CLI scaffolding — bridge mode + direct mode, free with createCli.
Bundled Claude Code skill — installs in one command.

Production proof

Used in ultra-jira-mcp and ultra-bitbucket-mcp. The benchmark numbers above come from the Jira server running against a real Jira Cloud instance — every byte measured is one a production agent would actually receive.

If you're building an MCP server for any enterprise API — Jira, Confluence, GitHub, Linear, Notion, ServiceNow, Salesforce, whatever — and your token bill or context window is starting to bite, give it a try.

⭐ github.com/scottlepp/ultra-mcp-toolkit — issues, PRs, and benchmark contributions welcome.

What's the most token-bloated MCP server you've shipped or seen? Drop it in the comments — I'm collecting horror stories.

Top comments (1)

Gilder Miller • May 19

A Salesforce MCP wrapper I saw returned the full object schema on every tool call. We're talking 80KB responses for a simple get contact by ID query. The agent spent half its context window just parsing metadata it never used.
The allowlist projection approach is the right call. Most APIs return way more than any agent actually needs. 🫠