DEV Community

Debby McKinney
Debby McKinney

Posted on • Edited on

MCP Code Mode: How We Can Cut Token Costs By Writing Less Prompts and More TypeScript

Every week, more MCP servers pop up. More tools. More "connect everything to your LLM" demos.

Then you actually plug 8-10 MCP servers into a real product and hit the wall:

  • Requests drag
  • Bills spike
  • The model forgets what the user asked because it's busy reading 150 tool definitions

GitHub logo maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

With Bifrost MCP Code Mode, the team asked a simple question:

What if, instead of sending all tools to the model, we just sent three?

The Problem with MCP at Scale

When you start with MCP, it feels great. You declare tools, the LLM calls them. Hook in anything - GitHub, Notion, Google Drive, internal APIs.

The trouble starts when you go from 2-3 servers to 8-10+.

A typical setup:

  • 5-10 MCP servers (YouTube, web, Gmail, calendar, docs, internal APIs)
  • 10-30 tools per server

You quickly end up with well over 100 tools exposed to the model on every single request. Even for a simple "hi".

Three concrete problems

1. Tool definitions overload the context

Most MCP clients send on every request:

  • The user prompt
  • System instructions
  • All tool definitions (name, description, parameters, schemas) for every MCP server

Cloudflare called this out directly: most agents "use MCP by exposing the tools directly to the LLM," which means the model reads a massive JSON catalog before it even looks at the user's question.

As you add more MCP servers, this catalog dominates your prompt.

2. Intermediate results burn tokens twice

Anthropic showed a second issue: the way we chain tools forces every intermediate result to travel back through the model even when it's just being passed from tool A to tool B.

Fetch a document, summarize it, use the summary to query another system - each step involves large data blobs going through the LLM again, burning context and latency.

3. LLMs are better at code than JSON tool calls

Both Cloudflare's "Code Mode" and Anthropic's "Code execution with MCP" point at the same reality:

LLMs are very good at writing TypeScript/JavaScript against typed APIs. They're noticeably worse at emitting pristine JSON tool calls and managing multi-step workflows with dozens of separate hops.

So the current pattern is backwards:

We drown the model in tool JSON, then ask it to manually orchestrate everything using the one thing it's worse at.

What Inspired Us

We didn't invent "code-first agents for MCP." Two big pieces influenced Bifrost MCP Code Mode:

Cloudflare's Code Mode

Instead of exposing every MCP tool directly, they:

  • Expose a single code execution tool
  • Give the sandbox an environment with bindings to MCP servers
  • Let the model write TypeScript that calls those bindings directly

Result: The LLM focuses on writing code. Code runs in a sandbox and talks to MCP servers. Tool definitions no longer dominate the prompt.

Anthropic's "Code execution with MCP"

Anthropic's team described: present MCP servers as code APIs in a filesystem, then let the model:

  • List directories
  • Read TypeScript files that describe tools
  • Write code that imports and uses those APIs in a code execution environment

This tackles both issues:

  • Agents can load only the tools they need, on demand
  • They can process data inside the execution environment before sending anything back to the model

What they were hearing from users

Bifrost users kept saying:

  • "Can Bifrost just handle all my MCP connections?"
  • "I don't want to tune tool lists per model/agent"
  • "Just give me a gateway that's fast and cheap"

They were already positioning Bifrost as a low-overhead LLM gateway with minimal latency and full feature set (routing, observability, policies, MCP).

So the thought was:

If we're already the MCP/LLM gateway, why not bake Code Mode into the gateway itself?

That's what Bifrost MCP Code Mode is: Code Mode as a gateway feature that works across all your MCP servers.

How Bifrost MCP Code Mode Works

Say you have these MCP servers wired into Bifrost:

  • youtube
  • web
  • gmail
  • gmeet
  • notion

Each exposes ~20 tools. Normal MCP flow means ~100 tools in context on every request.

With Bifrost MCP Code Mode enabled, the model only sees three tools:

  1. mcp_listFiles
  2. mcp_readFile
  3. mcp_executeCode

Everything else is "hidden behind" those.

A virtual file system of your MCP servers

Internally, Bifrost builds a virtual file system (VFS) representing all code-mode MCP servers.

By default, server-level binding gives each MCP server a single .d.ts file:

servers/
  youtube.d.ts
  web.d.ts
  gmail.d.ts
  gmeet.d.ts
  notion.d.ts
Enter fullscreen mode Exit fullscreen mode

Each *.d.ts file is a TypeScript declaration file describing that server's tools.

For example, servers/youtube.d.ts might look like:

typescript

export declare function listChannels(params: {
  search: string;
  limit?: number;
}): Promise<Channel[]>;

export declare function listVideos(params: {
  channelId: string;
  limit?: number;
}): Promise<Video[]>;

export declare function getVideoSummary(params: {
  videoId: string;
}): Promise<VideoSummary>;
Enter fullscreen mode Exit fullscreen mode

The model now sees MCP tools as functions instead of opaque "tool JSON."

Alternatively, tool-level binding gets one file per tool:

servers/
  youtube/
    listChannels.d.ts
    listVideos.d.ts
    getVideoSummary.d.ts
  web/
    search.d.ts
Enter fullscreen mode Exit fullscreen mode

The three generic tools

mcp_listFiles()

  • Returns a directory tree of the VFS
  • Lets the model discover which servers exist

mcp_readFile(path, fromLine, toLine)

  • Reads tool signatures for only the servers actually needed
  • Paginates through large files without blowing context

mcp_executeCode(code)

  • Runs TypeScript in a sandboxed environment
  • Provides bindings that line up with the .d.ts files

Inside the sandbox:

import * as youtube from "servers/youtube";
import * as web from "servers/web";
import * as gdocs from "servers/gdocs";

// Model writes code against these imports
// Bifrost wires them to real MCP tool calls
Enter fullscreen mode Exit fullscreen mode

The actual loop: what the LLM does

From the model's perspective:

1. Inspect the VFS

Call mcp_listFiles() to see what's available:

json

{
  "files": [
    "servers/youtube.d.ts",
    "servers/web.d.ts",
    "servers/gdocs.d.ts"
  ]
}
Enter fullscreen mode Exit fullscreen mode

2. Load relevant APIs

Call mcp_readFile("servers/web.d.ts", 0, 400) to learn how to search. Call mcp_readFile("servers/youtube.d.ts", 0, 400) for YouTube APIs.

3. Write code that orchestrates everything

import * as web from "servers/web";
import * as youtube from "servers/youtube";
import * as gdocs from "servers/gdocs";

export default async function main() {
  const companyResult = await web.search({
    query: "Which company launched the Bifrost LLM Gateway?",
    limit: 1,
  });

  const companyName = companyResult[0]?.name;
  if (!companyName) return { error: "No company found" };

  const channels = await youtube.listChannels({
    search: companyName,
    limit: 1,
  });

  if (!channels.length) {
    return { error: `No channels found for ${companyName}` };
  }

  const videos = await youtube.listVideos({
    channelId: channels[0].id,
    limit: 5,
  });

  const doc = await gdocs.createDoc({
    title: `${companyName} - Latest YouTube Report`,
    data: videos.map(v => ({
      id: v.id,
      title: v.title,
      thumbnail: v.thumbnail,
    })),
  });

  return { companyName, docUrl: doc.url };
}
Enter fullscreen mode Exit fullscreen mode

4. We execute the code

Bifrost runs this in a sandbox where web.search, youtube.listChannels, etc. are backed by real MCP tool calls.

The full fan-out to MCP servers happens inside the sandbox, not through dozens of LLM turns.

5. Compact result back

The model sees:

{
  "companyName": "Example Corp",
  "docUrl": "https://docs.google.com/..."
}
Enter fullscreen mode Exit fullscreen mode

And answers the user naturally.

Choosing your binding level

Server-Level (Default)

  • One .d.ts per MCP server
  • Best for: moderate tool counts (5-20 per server)
  • Trade-off: larger files, simpler discovery

Tool-Level

  • One .d.ts per individual tool
  • Best for: servers with 30+ tools
  • Trade-off: more files, maximum context efficiency

Both use the same three-tool interface, so the LLM adapts automatically.

Mixing Code Mode and classic MCP

Not every server needs Code Mode.

You can:

  • Put web, youtube, gdocs, gmail into code mode
  • Keep small utilities (datetime, math) as classic tools exposed directly

The LLM sees:

  • mcp_listFiles, mcp_readFile, mcp_executeCode
  • Plus a small curated set of direct tools

Adopt Code Mode incrementally instead of all-or-nothing.

Example: One Workflow, Two Traces

Task: "Check which company launched Bifrost LLM Gateway and make a summary of their last 5 YouTube videos, create a Google Doc."

Assume:

  • 10 MCP servers connected
  • ~15 tools each = ~150 total tools

Normal MCP flow

Turn 1: Prompt + 150 tools → LLM calls web.search

Turn 2: Prompt + search result + 150 tools → LLM calls youtube.listChannels

Turn 3: Prompt + results + 150 tools → LLM calls youtube.listVideos

Turn 4: Prompt + results + 150 tools → LLM calls youtube.getVideoSummary 5x

Turn 5: Prompt + summaries + 150 tools → LLM calls gdocs.createDoc

Turn 6: Prompt + doc result + 150 tools → Final answer

Result:

  • 6 LLM turns
  • 150 tools in context every time
  • All intermediate results flow through the model

Bifrost Code Mode flow

Turn 1: Prompt + 3 tools → LLM calls mcp_listFiles

Turn 2: Prompt + listFiles result + 3 tools → LLM calls mcp_readFile for web, youtube, gdocs

Turn 3: Prompt + readFile results + 3 tools → LLM returns code block

We execute that code - it calls web.search, youtube.listChannels, youtube.listVideos, youtube.getVideoSummary, gdocs.createDoc inside the sandbox.

Turn 4: Prompt + code execution result + 3 tools → Final answer

Result:

  • 3-4 LLM turns
  • Only 3 tools in base context
  • Tool definitions loaded on demand
  • Intermediate results stay in execution environment

The model spends context on the task, not re-reading tool catalogs.

How It Benefits You

1. Dramatically less token overhead

You're sending only three tools up front. TypeScript definitions load on demand. Intermediate data processes inside the sandbox.

This means:

  • Lower cost
  • More headroom for actual user context
  • Less chance of context overflow

2. Lower latency

Less prompt bloat means faster model eval.

More importantly: complex multi-step workflows collapse into a single executeCode call instead of 5-10 tool-call hops.

3. Better tool orchestration

By letting the model write code, you get normal programming features:

  • Loops over collections
  • If/else logic
  • Retries and error handling
  • Helper functions for data shaping

Cloudflare and others argue this makes agents more capable and reliable than hacking logic into prompt instructions.

4. Gateway-level simplicity

You don't have to:

  • Build your own code-mode proxy
  • Hand-roll sandboxes
  • Maintain a separate MCP wrapper deployment

Instead:

  • Point your MCP servers at Bifrost
  • Flip Code Mode on for servers you want
  • Let the gateway handle VFS, .d.ts generation, sandbox wiring

5. Incremental adoption

Mix code-mode and classic MCP servers:

  • Start by putting "heavy" servers (web, docs, file APIs) into Code Mode
  • Keep small, trusted tools as direct calls
  • Gradually migrate more as you get comfortable

You can also:

  • Observe generated code
  • Put guardrails around sandbox permissions
  • Iterate on schemas without changing your client app

Related Work & Acknowledgments

Bifrost MCP Code Mode stands on the shoulders of:

Cloudflare – "Code Mode: the better way to use MCP"

Introduced translating MCP into typed code interface, letting LLMs write TypeScript against bindings in a sandboxed environment.

Anthropic – "Code execution with MCP"

Showed how presenting MCP servers as code APIs in a filesystem with code execution can drastically reduce token usage.

What is added:

  • Gateway-level implementation across all MCP servers connected to Bifrost
  • Three-tool interface (listFiles, readFile, executeCode) tailored to that gateway role
  • Ability to mix code-mode and classic MCP per server for gradual adoption

Try it:

If you're dealing with MCP at scale, this might be worth trying.

Top comments (0)