DEV Community

Cover image for I routed 60 MCP tools through a single proxy — here's what I learned about token waste and security
Donnyb369
Donnyb369

Posted on

I routed 60 MCP tools through a single proxy — here's what I learned about token waste and security

I've been building MCP servers for Claude Desktop for a few months now. At one point I had five servers running: filesystem, GitHub, SQLite, a knowledge graph, and Brave Search. Sixty tools total, all piped into one LLM.

It worked. But three things kept going wrong.

The token problem

Every time Claude makes a tool call, it sends the full schema of every available tool in the context window. Sixty tools means sixty JSON schema definitions, every single request. I measured it: over 4,800 tokens of schema overhead per request, before Claude even starts thinking about your question.

That's money. At API rates, those wasted tokens add up fast across a workday of tool calls.

The security problem

I found out the hard way that my claude_desktop_config.json was passing environment variables to child processes — and a bug in how I was merging env vars meant the entire system PATH, including tokens and API keys, was getting passed through. One of my GitHub tokens ended up in a log file. Twice.

MCP servers run as child processes with whatever permissions your user account has. There's no audit trail, no rate limiting, no secret scrubbing. If a tool call returns sensitive data, it goes straight into the LLM context with no filtering.

The context rot problem

Claude would read a file, modify it three tool calls later, then reference the stale version from its context. The file had changed on disk but Claude was still working with the old content. I called this "context rot" — the LLM's view of the world drifts from reality over a long session.

So I built a proxy

MCP Spine sits between Claude Desktop and all your MCP servers. One proxy, one connection, all traffic flows through it.

Claude Desktop ◄──stdio──► MCP Spine ◄──stdio──► filesystem
                                      ◄──stdio──► GitHub
                                      ◄──stdio──► SQLite
                                      ◄──stdio──► memory
                                      ◄──stdio──► Brave Search
Enter fullscreen mode Exit fullscreen mode

Here's what it does at each layer:

Security proxy — validates every JSON-RPC message, scrubs secrets from tool outputs (AWS keys, GitHub tokens, bearer tokens, private keys, connection strings), rate limits tool calls, blocks command injection and path traversal, and writes an HMAC-fingerprinted audit trail to SQLite.

Schema minifier — strips verbose descriptions, defaults, and metadata from tool schemas before they reach the LLM. The type information and required fields stay intact. Real measured savings on 12 representative tools:

Level Savings
0 (off) 0%
1 (light) 11%
2 (default) 32%

The best individual tool (read_file) went from 586 characters down to 242 — a 59% reduction. The savings compound: with 60 tools, Level 2 saves roughly 1,500 tokens per request.

State guard — watches files on disk with SHA-256 hashes. When Claude references a file that's changed since it last read it, Spine injects a version pin into the response: "this file has changed since you last saw it." No more context rot.

Semantic router — uses local embeddings (ChromaDB + MiniLM) to figure out which tools are relevant to the current task. Instead of showing all 60 tools, it shows the 5-10 that matter. This is optional and currently experimental — the ML dependencies add startup time, so I made them lazy-loading.

What I learned building it

Environment variable handling is a minefield. The biggest bug I hit was env=self.config.env or None in the subprocess spawn. When a server config had custom env vars (like GITHUB_TOKEN), this replaced the entire process environment instead of extending it. Every server that needed a custom env var was silently missing PATH, HOME, and everything else. The fix was one line: {**os.environ, **self.config.env}. But it took hours to diagnose because the error messages were about missing executables, not missing env vars.

Windows is a different world. Python's asyncio on Windows uses a Proactor event loop that can't do connect_read_pipe / connect_write_pipe on stdio handles from piped processes. The workaround is raw binary I/O with run_in_executor for reads. I also had to handle paths with spaces and parentheses (my project lives in MCP (The Spine)), UNC paths, and the MSIX sandbox that Claude Desktop runs in.

npx is slow, node is fast. Spawning MCP servers via npx @modelcontextprotocol/server-github takes 10-15 seconds because npx checks for updates every time. Switching to node C:\path\to\node_modules\...\dist\index.js connects in under a second. This matters because MCP clients have handshake timeouts.

Thread safety in audit logging is easy to get wrong. The semantic router runs a background thread for model loading. That thread calls the audit logger, which tries to use a SQLite connection created in the main thread. SQLite doesn't allow cross-thread connection sharing. Fix: check_same_thread=False plus a threading.Lock() around all DB operations.

The numbers

Running on Windows with Python 3.14 and Claude Desktop:

  • 6 MCP servers connected through one proxy
  • 60 tools total, routed and minified
  • 32% average schema token savings (up to 59% on verbose tools)
  • 135+ tests, CI green on Windows + Linux
  • Sub-second server connections (with node direct path)

Try it

pip install mcp-spine
Enter fullscreen mode Exit fullscreen mode

Configure your servers in a TOML file, point Claude Desktop at Spine, and all your MCP traffic gets security hardening, token savings, and an audit trail.

GitHub: github.com/Donnyb369/mcp-spine
PyPI: pypi.org/project/mcp-spine

It's open source, local-first, and works on Windows and Linux. No cloud, no accounts, no telemetry.


I'm an independent developer building open-source MCP tooling. If you're using MCP servers with Claude Desktop or any other LLM client, I'd love to hear what problems you're hitting. Drop a comment or open an issue on GitHub.

Top comments (2)

Collapse
 
globalchatapp profile image
Global Chat

The 4,800-token schema overhead is the part most posts skip. The proxy solves it, but only if downstream orchestrators can actually parse the aggregated schema you emit. On our own MCP server we watch LangGraph and CrewAI probe the endpoint, pull the schema once, and never place a tool call. A proxy hides the individual servers but still has to present one spec the orchestrator will accept. Did you standardize the shape your proxy emits (tool_name conventions, example invocations) or pass through whatever the upstream servers gave you? Curious whether a unified output shape moved the probe-to-invoke ratio.

Collapse
 
donnyb369422e67b98e4b668da profile image
Donnyb369 • Edited

Great question. Spine passes through the exact tool schemas from upstream servers — no renaming, no reshaping. Each server's tools keep their original names and schemas, so read_file from the filesystem server stays read_file, not spine_filesystem_read_file.

The reason: Claude Desktop (and most MCP clients) already expect the standard MCP tool format — name, description, inputSchema with JSON Schema. If I reshaped them into a unified convention, I'd break compatibility with any client that knows how to call the upstream tools directly.

What the minifier does is strip the verbosity without changing the shape. Level 2 removes parameter descriptions, defaults, and metadata but keeps type, required, and the schema structure intact. The tool name and top-level description stay untouched. So the schema Claude sees is structurally identical to what the upstream server emits — just lighter.

On the probe-to-invoke ratio — I haven't tested with LangGraph or CrewAI specifically. Spine was built for Claude Desktop's stdio transport where there's a single client and no probe/retry cycle. If orchestrators are pulling the schema and then bailing, I'd guess the issue is either the tool count (60 tools is a lot to reason about) or the descriptions not being specific enough for the orchestrator to match against. The semantic router helps with that — it filters to 5-10 relevant tools per task context instead of presenting all 60 at once.

Would be curious to hear what the orchestrators are choking on in your case — is it the schema size, the tool count, or something about the format?