zecheng

Posted on Mar 19 • Originally published at lizecheng.net

How to Safely Execute AI-Generated Python Code in Agent Workflows (No Docker Required)

#ai #webdev #productivity #buildinpublic

Pydantic Monty is a minimal, Rust-written Python bytecode VM that executes AI-generated code in 0.004ms with zero filesystem or network access by default — the missing infrastructure piece for production-grade AI agent workflows.

If you've tried to build a real AI agent that writes and runs its own Python code, you've hit the same wall: you can't hand arbitrary code to CPython. Docker solves the isolation problem but adds 195ms of startup latency per execution. For an agent making dozens of code-execution decisions per task, that compounds to seconds of overhead per workflow. Sandbox cloud services (Modal, E2B) cut it to ~1000ms. Still too slow. Monty cuts it to 0.004ms because it runs inside your existing process — no container spawn, no network call.

Samuel Colvin, the creator of Pydantic, built it and released v0.0.1 on January 27, 2026. It hit 2,600 GitHub stars within 48 hours. As of v0.0.8 (March 10, 2026), it is experimental but already usable for constrained code-execution use cases.

Why AI Agents Need a Code Execution Layer at All

The standard way to give an AI agent "tool use" is sequential function calling: the model selects a tool, gets a result, selects the next tool, gets the next result, and so on. This works, but it is expensive — each round-trip is a separate LLM inference call.

There is a faster pattern: instead of sequential tool calls, ask the model to write a short Python script that calls your tools as functions, then execute that script in one shot. Colvin calls this "CodeMode." The measured result: tasks that previously required 4 LLM round-trips complete in 2 calls when using CodeMode with asyncio.gather() for parallel tool invocations.

The problem is obvious. Giving an LLM the ability to write and execute arbitrary code means giving it access to your filesystem, your network, your environment variables, and your process. A sandboxed CPython is still CPython — the entire standard library is one import os away.

Monty solves this by not running CPython at all.

How Pydantic Monty's Security Model Actually Works

Monty's security philosophy is "start from nothing, move right." The default execution environment has zero capabilities:

Every capability the AI code can use must be explicitly granted through external functions — host-defined callables you register before execution. The VM can only call what you've explicitly allowed.

import pydantic_monty

code = """
result = search_web(query="pydantic monty benchmarks")
summary = summarize(text=result)
return summary
"""

m = pydantic_monty.Monty(
    code,
    inputs=["query"],
    external_functions=["search_web", "summarize"],
)

m.run(
    inputs={"query": "pydantic monty"},
    external_functions={
        "search_web": my_search_function,
        "summarize": my_summarize_function,
    }
)

The AI can call search_web and summarize because you registered them. It cannot call requests.get() or subprocess.run() because those are not registered — and the underlying modules don't exist in the VM at all.

Configurable execution limits — memory allocation, stack depth, CPU time — are enforced at the VM level. Hit the threshold, execution cancels. No runaway loops.

Startup Latency: The Real Performance Comparison

The performance story is straightforward once you understand the architecture. Monty doesn't spawn a process — it runs as a library inside your existing Python, Rust, or JavaScript process.

Execution method	Startup latency	Notes
Pydantic Monty	0.004–0.06ms	Embedded in host process
Docker container	~195ms	Process + container init
Pyodide (WebAssembly)	~2800ms	WASM initialization
Modal / E2B (cloud sandbox)	~1000ms+	Network round-trip

The 4.5MB package size with no external dependencies also means Monty ships inside your existing binary. No sidecar process, no daemon to manage.

For agent workflows where code execution is on the hot path — not an occasional capability but a step in every task — this latency profile changes what's architecturally feasible.

Installation and First Execution

Install via pip or uv:

uv add pydantic-monty
pip install pydantic-monty

JavaScript/TypeScript:

npm install @pydantic/monty

A minimal execution:

import pydantic_monty as monty

m = monty.Monty('x * y', inputs=['x', 'y'])
result = m.run(inputs={'x': 5, 'y': 3})
print(result)  # 15

Type checking against stubs is optional but recommended for AI-generated code — it catches type errors before execution rather than at runtime:

type_stubs = """
x: int
y: int
"""

m = monty.Monty(
    'x * y',
    inputs=['x', 'y'],
    type_check=True,
    type_check_stubs=type_stubs,
)

What Python Subset Monty Supports Right Now

Monty is experimental. v0.0.8 supports:

Functions (sync and async), closures, comprehensions, f-strings
asyncio, typing, partial os (stub), sys (stub)
Full math module (50+ functions, added in v0.0.8)
Dataclasses injected from the host
Bigint literals, PEP 448 generalized unpacking
Controlled in-memory filesystem abstraction

Not yet supported (on the roadmap): class definitions, match statements, context managers, generators, re/datetime/json modules.

The missing class support sounds limiting until you consider the primary use case. LLMs generating code for tool orchestration rarely need to define classes — they need to call functions, transform data, and return results. The subset Monty supports covers most CodeMode patterns.

Serializable Execution State: Why This Matters for Durable Agents

One underrated feature: Monty can serialize both parsed bytecode and live execution state.

code_bytes = m.dump()

m2 = monty.Monty.load(code_bytes)
result = m2.run(inputs={"x": 10})

Execution state snapshots are single-digit kilobytes. This enables agent workflows that survive process restarts — you can store the execution state in a database, resume it in a different process, and the agent picks up exactly where it left off. For long-running background agents, this is the difference between "restartable" and "durable."

The PydanticAI Integration Coming Up

The production use case Colvin is building toward is PydanticAI's CodeMode — an official integration that will let PydanticAI agents generate and execute Python code through Monty. Colvin confirmed this directly on Hacker News: "That's exactly what we built this for: we're implementing code mode."

The pattern, once it ships: define your tools as Python functions. Give the agent a task. The agent writes a Python script that orchestrates those tools in whatever sequence or parallel structure the task requires. Monty executes it. The agent gets results. All tool access is controlled by what you registered — the AI cannot reach anything you didn't explicitly expose.

This is architecturally different from both "give the AI a list of tools and let it call them sequentially" and "give the AI shell access." It's a controlled middle ground that makes complex tool orchestration fast without opening your system.

Key Takeaways

Docker's 195ms startup latency is a structural problem for agent code execution — not a solvable config issue, but a fundamental constraint of process-level isolation that Monty sidesteps by running inside your process.
The "deny by default" security model is the right architecture for AI-generated code — allowlists of registered external functions are auditable; blocklists of dangerous stdlib functions are not.
CodeMode reduces LLM round-trips — the same task that requires 4 sequential tool calls typically requires 2 LLM calls when the model writes code that chains tool calls in parallel.
Serializable execution state enables durable AI agents — Monty's kilobyte-scale state snapshots are storable in a database, making long-running agent workflows restartable across process boundaries.
Monty is v0.0.8 and experimental — missing class support, re/datetime/json modules, and production hardening. Use it for constrained CodeMode patterns today; wait for PydanticAI integration for production workflows.

What This Means for Builders

If you're building AI agents with tool use, benchmark CodeMode against sequential function calling for your specific workflow. The 2x LLM call reduction is real, but only valuable if your tool calls are parallelizable.
If you're evaluating sandboxing options, Monty is worth testing against Docker for use cases where you control the tool surface. The latency win is significant; the stdlib limitations are real constraints you'll need to design around.
If you're building on PydanticAI, watch the code-mode branch — Monty is its intended execution backend and the integration is under active development.
If you need class definitions or standard library access, Monty v0.0.8 is too limited for your use case today. Check back at v0.1.x.

Built with IntelFlow — open-source AI intelligence engine. Set up your own daily briefing in 60 seconds.

DEV Community