The problem with tool calling
Cloudflare recently published a detailed blog post about CodeMode, and their framing of the problem is worth reading.
Their core argument: LLMs have been trained on millions of real-world code examples. But tool-calling schemas? Barely any training data exists. As they put it:
"Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it."
The other issues they describe:
- Tool overload: "If you present an LLM with too many tools, or overly complex tools, it may struggle to choose the right one or to use it correctly."
- Token waste: "The output of each tool call must feed into the LLM's neural network, just to be copied over to the inputs of the next call, wasting time, energy, and tokens."
- No composition: with tool chains, there's no variables, no loops, no error handling between steps.
The solution: let AI write code
Cloudflare's answer: convert MCP tools into TypeScript APIs. Let the agent write code that calls them.
They're not alone:
- Anthropic documented Programmatic Tool Calling
- HuggingFace built SmolAgents — code-first agents
- Pydantic built Monty — a Python subset interpreter in Rust for AI agents
Compare tool chaining vs code:
// Code: one block, full control
const tokyo = await getWeather("Tokyo");
const paris = await getWeather("Paris");
const colder = tokyo.temp < paris.temp ? "Tokyo" : "Paris";
const warmer = colder === "Tokyo" ? "Paris" : "Tokyo";
const flights = await searchFlights(colder, warmer);
flights.filter(f => f.price < 500)
With tool chaining, this needs multiple round-trips. Each result goes back through the LLM. No variables between steps. No error handling.
The missing piece for TypeScript
Pydantic solved the execution problem for Python with Monty. But most agent SDKs — Vercel AI, LangChain.js — are TypeScript. There wasn't an equivalent.
I adapted the same idea for TypeScript and open sourced it — Zapcode, a TypeScript interpreter written in Rust.
Docker cold start: 200-500ms
V8 isolate: 5-50ms
Zapcode: 2µs
Snapshot size: < 2 KB
Memory: ~10 KB per execution
Install:
npm install @unchartedfr/zapcode-ai # TypeScript
pip install zapcode-ai # Python
Use with any LLM:
import { zapcode } from "@unchartedfr/zapcode-ai";
import { generateText } from "ai";
const { system, tools } = zapcode({
system: "You are a travel assistant.",
tools: {
getWeather: {
description: "Get current weather for a city",
parameters: { city: { type: "string" } },
execute: async ({ city }) => fetch(`/api/weather/${city}`).then(r => r.json()),
},
},
});
const { text } = await generateText({
model: anthropic("claude-sonnet-4-20250514"),
system,
tools,
maxSteps: 5,
messages: [{ role: "user", content: "Weather in Tokyo?" }],
});
LLM writes TypeScript → Zapcode sandboxes it → tool calls suspend the VM → you resolve them → execution resumes.
Security
Same philosophy as Monty — deny by default. No filesystem, network, env vars, eval, or imports. The only escape is external functions you explicitly register.
Zero unsafe code in Rust. 65 adversarial tests across 19 attack categories.
Trade-offs
- TypeScript subset, not full ECMAScript
- No regex execution
- No npm packages
- Experimental — APIs may change
If this is useful
If you're building AI agents and dealing with the same problem, hope this saves you some time. If your agents generate syntax I don't support yet, please open an issue — that's the most helpful feedback I can get.
Top comments (0)