Why MCP tool calling doesn't work well for AI agents — and what Cloudflare, Anthropic, and Pydantic are doing instead

#ai #mcp #agents #llm

The problem with tool calling

Cloudflare recently published a detailed blog post about CodeMode, and their framing of the problem is worth reading.

Their core argument: LLMs have been trained on millions of real-world code examples. But tool-calling schemas? Barely any training data exists. As they put it:

"Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it."

The other issues they describe:

Tool overload: "If you present an LLM with too many tools, or overly complex tools, it may struggle to choose the right one or to use it correctly."
Token waste: "The output of each tool call must feed into the LLM's neural network, just to be copied over to the inputs of the next call, wasting time, energy, and tokens."
No composition: with tool chains, there's no variables, no loops, no error handling between steps.

The solution: let AI write code

Cloudflare's answer: convert MCP tools into TypeScript APIs. Let the agent write code that calls them.

They're not alone:

Anthropic documented Programmatic Tool Calling
HuggingFace built SmolAgents — code-first agents
Pydantic built Monty — a Python subset interpreter in Rust for AI agents

Compare tool chaining vs code:

// Code: one block, full control
const tokyo = await getWeather("Tokyo");
const paris = await getWeather("Paris");
const colder = tokyo.temp < paris.temp ? "Tokyo" : "Paris";
const warmer = colder === "Tokyo" ? "Paris" : "Tokyo";
const flights = await searchFlights(colder, warmer);
flights.filter(f => f.price < 500)

With tool chaining, this needs multiple round-trips. Each result goes back through the LLM. No variables between steps. No error handling.

The missing piece for TypeScript

Pydantic solved the execution problem for Python with Monty. But most agent SDKs — Vercel AI, LangChain.js — are TypeScript. There wasn't an equivalent.

I adapted the same idea for TypeScript and open sourced it — Zapcode, a TypeScript interpreter written in Rust.

Docker cold start:  200-500ms
V8 isolate:         5-50ms
Zapcode:            2µs

Snapshot size:      < 2 KB
Memory:             ~10 KB per execution

Install:

npm install @unchartedfr/zapcode-ai  # TypeScript
pip install zapcode-ai               # Python

Use with any LLM:

import { zapcode } from "@unchartedfr/zapcode-ai";
import { generateText } from "ai";

const { system, tools } = zapcode({
  system: "You are a travel assistant.",
  tools: {
    getWeather: {
      description: "Get current weather for a city",
      parameters: { city: { type: "string" } },
      execute: async ({ city }) => fetch(`/api/weather/${city}`).then(r => r.json()),
    },
  },
});

const { text } = await generateText({
  model: anthropic("claude-sonnet-4-20250514"),
  system,
  tools,
  maxSteps: 5,
  messages: [{ role: "user", content: "Weather in Tokyo?" }],
});

LLM writes TypeScript → Zapcode sandboxes it → tool calls suspend the VM → you resolve them → execution resumes.

Security

Same philosophy as Monty — deny by default. No filesystem, network, env vars, eval, or imports. The only escape is external functions you explicitly register.

Zero unsafe code in Rust. 65 adversarial tests across 19 attack categories.

Trade-offs

TypeScript subset, not full ECMAScript
No regex execution
No npm packages
Experimental — APIs may change

If this is useful

If you're building AI agents and dealing with the same problem, hope this saves you some time. If your agents generate syntax I don't support yet, please open an issue — that's the most helpful feedback I can get.

https://github.com/TheUncharted/zapcode