DEV Community: Vagner Bessa

Now I can replay it: offline regression testing for multi-turn AI agents

Vagner Bessa — Tue, 30 Jun 2026 15:56:44 +0000

ReplayGate is conversation-level regression testing for multi-turn AI agents. The regressions it hunts live between turns: the agent books before the user confirmed, re-asks for something it was already told, forgets a constraint set three turns back.

Per-turn assertions look at one reply at a time and sail right past those. So ReplayGate makes a flight recorder's bet: capture a real conversation once, exactly, then replay it offline and assert the cross-turn properties that matter.

In update #1 I built the record half: capture an agent's LLM and tool calls into a deterministic fixture, and deliberately defer the part that pays it off. This update is that payoff. I can now replay those recorded conversations offline.

Replaying the recording

Replay re-runs the agent over the fixture's user turns, but the LLM client answers from the recording instead of the network. The match is the same sha256 over (model, system, messages, tools) from #1, so the agent runs its real logic and only the model and tool results are served from the log:

def replay_conversation(fixture, agent_factory, tools):
    rec_llm = RecordingLLMClient(inner=None, mode="replay", recording=fixture.llm_recording)
    rec_tools = ToolRecorder(tools, mode="replay", recording=fixture.tool_recording)
    agent = agent_factory(rec_llm, rec_tools)
    # ...walk the recorded user turns, calling agent.respond, with no network

A diff_conversations then compares the recorded conversation against its replay, turn by turn, on assistant text and tool calls. The CLI wraps both:

$ replaygate replay ./fx
replay OK — 2 turns reproduced offline, zero network

"Zero network" is literal: the replay path imports no provider SDK and reads no API key. To prove that rather than assert it, I replayed real recordings with every key unset (ANTHROPIC_API_KEY, OPENAI_API_KEY, and the rest), and they reproduced with an empty diff. A recorder you can't replay blind is just a logger.

Five providers, one seam

The recorder generalized for almost nothing because of the decision in #1: the agent depends on a one-method LLMClient protocol, create(model, system, messages, tools), not a vendor SDK. A recorder slots in front of the real client, and "the real client" can be anything that satisfies the protocol. So I wrote five. Anthropic goes through its official SDK; OpenAI, OpenRouter, Ollama, and Google Gemini go through one OpenAI-compatible client (the last three are just different base URLs and keys). The SDK imports are lazy, so the core package stays offline and dependency-free, and the test suite still never opens a socket.

A record-live command points the booking agent at any of them:

$ replaygate record-live booking_happy ./fx --provider ollama --model qwen2.5:7b
recorded booking_happy live via ollama → ./fx

Here's the part I didn't expect. I recorded the same two-turn scenario, where the user asks for slots and then says "yes, book 3pm", six times each against Anthropic's claude-opus-4-6, OpenAI's gpt-5.4, and a local qwen2.5:7b. Every run calls search_slots on turn one. Turn two is where they split:

provider / model	turn 2, over 6 runs
Anthropic `claude-opus-4-6`	re-asked for confirmation 6, never booked
OpenAI `gpt-5.4`	booked all 6
Ollama `qwen2.5:7b`	booked 5, searched again 1

Two things fall out of that. Two frontier models read the identical yes, book 3pm and reach opposite decisions: gpt-5.4 books on every run, claude-opus-4-6 re-asks for confirmation on every run, the same two user turns. And the small local model, qwen2.5:7b, disagrees with itself, booking five times and searching again on the sixth. That table is a single batch, not a fixed property: swap one model for another, or just run the same one twice, and the trajectory moves. That is the whole reason to record instead of re-run. As I put it in update #1, you can't diff two runs that never produce the same bytes, so you capture one and replay that. The recording is the fixed point; the live model isn't.

To be clear about what this is and isn't: these are recordings, not the diff catching a regression yet. But it's the exact signal the cross-turn checks exist for. Pin the trajectory you want, change the model or the prompt, replay, and a divergence is a regression you'd otherwise meet in production.

What I still can't do

The deferral to name plainly: replay today proves faithful reproduction. diff_conversations compares a recording against its own replay and confirms they match. What it does not yet do is assert cross-turn invariants across a changed agent, run user_confirmed_before over the replayed conversation and fail when the agent books before the user confirmed. That assertion, a replaygate regress command, and the OpenTelemetry span wiring I deferred in update #1 are the line between a faithful recorder and an actual regression test. It's the next update, and it's the entire point of the project.

What I learned

Two things worth keeping. The provider-agnostic seam from update #1 paid a dividend I didn't have to work for: deterministic replay and a five-provider recorder both fell out of the same one-method protocol, at nearly zero cost. The seam was the whole design, again. And an independent review earns its keep precisely when you treat its findings as claims to test rather than edits to merge: one finding I dug into and rejected would have regressed the very thing it flagged.

What's next

The detector: wire the cross-turn invariants into a replaygate regress command, run them over the replayed conversation, and fail CI when the agent trips one. That's where the deliberately broken agent I planted in update #1, its inject_regression seed, finally gets caught, and where the deferred OpenTelemetry timing spans get their first consumer.

ReplayGate is open source at github.com/bessavagner/replaygate.

Starting ReplayGate: recording an agent before I can replay it

Vagner Bessa — Tue, 30 Jun 2026 15:56:04 +0000

I'm starting a second project in public, so this is update one of a fresh log. It's ReplayGate: conversation-level regression testing for multi-turn, channel-native AI agents. The problem it chews on is one anyone who has shipped an LLM agent knows, and the regressions that bite hardest aren't in any single reply. The agent books before the user confirmed, three turns back. It re-asks for something it was already told. It forgets a constraint the user set earlier in the session. A prompt tweak that looks harmless trips one of these, and per-turn assertions sail right past it, because they only ever look at one reply at a time. You can't diff two runs that never produce the same bytes, either.

ReplayGate's bet is the same one a flight recorder makes: capture the whole conversation once, exactly, on the channel it actually ran on, then replay it as many times as you want and assert the cross-turn properties that matter. This first slice is the record half plus the foundation everything else hangs on. Replay, the cross-turn divergence detection, and the CI gate are the next plan. I'll get to why I split it that way.

Here's the whole machine I'm building toward, and, honestly, how little of it exists yet. The key thing to read off it: ReplayGate is the harness in the box. It doesn't contain your agent, it brackets it (the amber box up top is your LLM agent under test; I ship a booking-assistant example only so there's something to record against). The green pieces are this update's record half; everything dashed is the next plan.

The one contract everything depends on

Before any capture code, I wrote the trace contract: a small tree of Pydantic v2 models (Message, ToolCall, Turn, Conversation) that every other module imports and nothing gets to bypass. A Conversation is just turns, and it carries the query helpers I'll need when I start asserting things about agent behavior:

class Conversation(BaseModel):
    id: str
    scenario: str
    channel: Literal["direct", "whatsapp"]
    session_meta: SessionMeta
    turns: list[Turn] = Field(default_factory=list)
    agent_version: str = ""
    model: str = ""

    def all_tool_calls(self) -> list[ToolCall]:
        return [tc for turn in self.turns for tc in turn.tool_calls]

    def user_confirmed_before(self, turn_index: int) -> bool:
        ...

user_confirmed_before is the tell for where this is going. The agent I'm testing is a booking assistant, and the invariant I care about is "never book before the user confirms." That's a cross-turn property, exactly the kind of thing that's invisible to a single assertion and obvious to a recording you can walk start to finish.

The channel field on the model is there for the same reason. Agents don't run as tidy Python calls; they run on WhatsApp, on voice, on a webhook, where message ordering, chunking, and a session window quietly expiring mid-flow can break a conversation that passed every unit test. Modeling the channel as part of the trace means the same recording can be replayed as it actually happened, not as an idealized function call. The direct adapter ships in this slice; WhatsApp is on the roadmap, next to the agents I'd actually want to regression-test.

Record the model, not the wire

The interesting design decision is where to capture. The obvious move is to record HTTP (vcrpy cassettes, fake the server). I deliberately didn't. ReplayGate records at the application seam: it wraps the agent's LLM client and its tool registry, and logs calls there.

The whole reason that works is a one-method protocol the agent depends on instead of a concrete SDK, the same provider-agnostic LLMClient abstraction I keep reaching for:

class LLMClient(Protocol):
    def create(self, model, system, messages, tools) -> LLMResponse: ...

Because the agent only knows that protocol, I can slot a recorder in front of the real client. In record mode it calls through and logs the exchange under a stable key; in replay mode it answers from the log and never touches the network:

def create(self, model, system, messages, tools) -> LLMResponse:
    key = request_key(model, system, messages, tools)
    if self._mode == "replay":
        for entry in self._recording:
            if entry["request_key"] == key:
                return LLMResponse.model_validate(entry["response"])
        raise KeyError(f"no recorded LLM response for request_key {key[:12]}…")
    response = self._inner.create(model, system, messages, tools)
    self._recording.append({"request_key": key, "request": {...}, "response": response.model_dump()})
    return response

The key is a sha256 over the model, system prompt, messages, and tools, serialized with sort_keys=True so it's stable across runs. Tools get the same treatment in a ToolRecorder keyed on (name, args). Recording at this layer means the fixture is readable JSON about what the agent actually did, not opaque HTTP bodies, and the replay matching keys on meaning, not byte order. That's the payoff for not faking the wire.

This is the same lesson RegWatch's ingestor taught me from the other direction, fake the client, not the server, and it's why the entire ReplayGate test suite runs with zero network calls. The Anthropic SDK is a dependency; it is never imported in a test.

A black box needs somewhere to write

Capture without storage is a stunt, so the other half of this slice is persistence. Each recorded conversation becomes a fixture directory, not a blob:

conversation.json     # the trace contract, serialized
llm_recording.json    # every LLM exchange, keyed
tool_recording.json   # every tool call + result
spans.jsonl           # OpenTelemetry-aligned timing spans
meta.json             # scenario, agent version, model, recorded_at

The spans go to a DuckDB store, with attributes aligned to OpenTelemetry's GenAI semantic conventions (gen_ai.request.model, gen_ai.agent.name, and friends) so the timing data speaks a standard vocabulary instead of one I invented. write then read round-trips through real DuckDB in a test, no mock, because a store you can't read back is just a delete with extra steps.

The honest part: I record spans I don't write yet

Here's the deferral I want to name out loud rather than bury. The record orchestrator builds a full fixture (turns, LLM log, tool log, metadata) and sets spans=[]. The span store is built and tested; the wiring that emits spans during a record run isn't.

That's on purpose. Nothing consumes spans until the replay-and-compare work in the next plan, and threading OTel instrumentation through the capture loop before there's a consumer is how you ship a half-feature that drifts out of sync with its only user. So the store lands now with its own tests, and the instrumentation lands next to the thing that reads it. A [TODO] in code is a smell; a tracked deferral with a reason is a decision. This is the second kind.

Where the boring code bit me

I built this strictly test-first: one machine-readable plan, twelve tasks, each a failing test before a line of implementation, in the plan format I've written about before. Eleven tasks went green without drama. The twelfth, the CLI, did not.

I'd wired the recorder behind a single Typer command and the test invoked it as a subcommand: record booking_happy ./out. It blew up:

Usage: record [OPTIONS] SCENARIO_NAME OUT_DIR
╭─ Error ─────────────────────────────────────────────╮
│ Got unexpected extra argument(s) (./out)            │
╰─────────────────────────────────────────────────────╯
SystemExit: 2

The cause is a Typer sharp edge: when an app has exactly one command, Typer collapses it into a single-command app, so record is parsed as the first positional argument, not a subcommand name. record landed in scenario_name, booking_happy in out_dir, and the real path fell off the end. The fix is one no-op callback that forces Typer back into multi-command mode:

@app.callback()
def main() -> None:
    """ReplayGate CLI."""

Mildly maddening, completely undramatic, and exactly the kind of thing a test-first loop surfaces in seconds instead of in a demo. With that in place:

$ replaygate record booking_happy ./fx
recorded booking_happy → ./fx
$ python -m pytest -q
....................                          [100%]
20 passed

Twenty tests, ruff clean, fully offline.

The bug I planted on purpose

The reference booking agent ships with an inject_regression flag. Flip it on and the agent books an appointment even when the model never signaled a confirmation step, the precise cross-turn failure user_confirmed_before exists to catch. Right now it's just a seed: the recorder will happily capture both the good run and the broken one. Catching the difference is what the replay half is for, and wiring that payoff is the whole reason I built the detector's vocabulary into the contract first.

What I learned

Two things worth keeping. First, choosing the capture seam is the entire design: recording at the application layer instead of HTTP is what makes the fixture mean something and the replay deterministic, and it cost nothing because the agent already talked to a protocol, not a vendor. Second, a deferral you can defend in a sentence ("the store has no consumer until the next plan") is a feature of build-in-public; a deferral you can't is just a gap you're hiding.

What's next

The replay half: feed llm_recording/tool_recording back in replay mode, run the recorded conversation against the current agent, and diff the two into a ConversationDiff. That's where inject_regression finally gets caught, where the OTel spans get wired in, and where a replaygate regress command and a CI gate turn this from a recorder into an actual regression test.

ReplayGate is open source from day one, the contract, the record/replay wrappers, and the CLI above live at github.com/bessavagner/replaygate. The open question I keep circling: for multi-turn agents, is recording-and-replaying a real conversation the right regression primitive, or have you had better luck with per-turn LLM-judge evals, or with generated user-simulations? If you've ever tried to catch a cross-turn bug, the "it acted before the user confirmed" kind, three turns deep, I'd genuinely like to compare notes before I commit to the replay design.

Scraping a fragile legacy site into a clean time series

Vagner Bessa — Sat, 27 Jun 2026 16:11:24 +0000

Some data is only available through a website that was clearly never meant to be read by a program. No API, no bulk download, no CSV export — just a form, a dropdown, a search button, and a table that appears after a spinner. If you want the data as a dataset, you have to drive the page like a human and scrape what comes back.

That's fine for a handful of rows. It gets interesting when you need years of them. I wanted the full monthly history of the public Brazilian vehicle price table (FIPE — Fundação Instituto de Pesquisas Econômicas), broken down by brand, model, and year, across every reference month the site still served. That's tens of thousands of lookups against a legacy ASP site that times out, throws modal dialogs, and occasionally just stops responding. A naive loop would die somewhere in hour two and lose everything.

This post is about how to make that kind of scrape survivable: how to checkpoint so a crash costs you minutes instead of hours, how to deal with the modal dialogs a legacy site throws at you, how to retry hard without giving up, and how to land the result as clean columnar data you can analyze instead of a pile of half-broken HTML.

When the site fights back

The site I was scraping is a classic of the genre. You pick a reference month from a dropdown, type a FIPE code into a text field, the page fires an AJAX request, a second dropdown populates with the matching model-years, you pick one, hit search, and a result table renders. Repeat for the next code. Repeat for the next month.

Three things make this hostile to automation:

It's stateful and slow. Every step depends on the previous one having finished. The model-year dropdown is empty until the AJAX call for the code you typed comes back, and that call can take a few hundred milliseconds or a few seconds depending on the server's mood.
It throws modal dialogs. Ask for a code that doesn't exist in the selected month and you don't get an empty result — you get a modal alert that sits on top of the page and blocks every other interaction until you dismiss it. Miss it, and your next click lands on the overlay instead of the control you wanted.
It falls over. Run thousands of requests in a row and you'll hit timeouts, refused connections, and the occasional stale element. Not often enough to be useless; often enough that "just let the loop run" is not a plan.

So the design goal is narrow: finish the job despite the site, and don't lose progress when it breaks. Everything below is in service of that.

Checkpointing: make the scrape idempotent

The single most important decision is that the scrape must be resumable. If it dies on row 9,000 of 27,000, restarting it should pick up at row 9,001 — not row 1, and not by re-downloading everything to find out where it stopped.

The trick is to make the output file itself the checkpoint. Each reference month writes to its own parquet file, and rows are appended in the same order as a stable list of FIPE codes. On startup, for each month, I read whatever is already on disk, look at the last code I successfully wrote, find that code's position in the master list, and resume from the next one:

overwrite = True
# resume from where the last run left off
try:
    data = pd.read_parquet(data_file, engine="pyarrow")
    last_code = data["codigo_fipe"].values[-1]
    for row, idx in zip(fipe_codes.values, fipe_codes.index):
        if row[0] == last_code:
            start = idx + 1
    overwrite = False
except (FileNotFoundError, EmptyDataError):
    start = 0

for code, modelo, marca in fipe_codes.values[start:]:
    ...

There's no separate state file, no database, no "progress.json" to keep in sync with reality. The data is the progress. If the parquet exists, the last row in it is the high-water mark; if it doesn't, we start from zero. That property — that re-running produces the same result without redoing finished work — is idempotency, and it's the thing that turns a fragile multi-hour job into one you can kill and restart without a second thought.

A couple of details matter for this to actually be safe:

Order has to be stable. The resume logic only works because the list of codes is read from a fixed file (fipe_codes_carros.csv) in the same order every run. If the iteration order changed between runs, "the last code I wrote" would tell you nothing about what's left.
Catch the empty case. A freshly created but empty file raises EmptyDataError, not FileNotFoundError. Treating both as "start from zero" avoids a crash-on-resume that would otherwise look like the data being corrupt.

Modals, waits, and retries

With resumption in place, the loop only has to survive long enough to make progress between crashes. That comes down to handling the page's three failure modes.

Dismiss the modal, then continue

When a requested code doesn't exist for the selected month, the site pops a modal alert instead of returning an empty table. The fix is to detect it, read its message, close it, and move on to the next code rather than letting the blocked overlay derail everything that follows:

try:
    modelos = crawler.items_of(select_modelo)
except ElementClickInterceptedException:
    # the code may not exist for this reference month
    alert = crawler.element(".modal.alert", "css selector")
    message = crawler.child_of(alert, ".content.ps-container", "css selector")
    block = crawler.child_of(message, "p", "tag name")
    text = block.get_attribute("innerText")
    if "não localizado" in text:
        crawler.click("btnClose", "class name")
        time.sleep(2)
        continue

The ElementClickInterceptedException is the tell: Selenium tried to interact with a control and something — the modal overlay — got in the way. Catching that specific exception and inspecting the dialog text lets the scraper distinguish "this code legitimately doesn't exist this month" (skip it) from a real failure (let it bubble up).

Wait for the page, don't sleep blindly

A legacy AJAX page is the textbook case for explicit waits. The wrong move is a fixed sleep long enough to cover the worst case — that's slow when the server is fast and still flaky when it's slow. The right move is to poll for the condition you actually care about and proceed the moment it's true. Selenium's WebDriverWait does exactly this:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

# wait until the result table is actually present, up to 10s
wait = WebDriverWait(driver, 10)
table = wait.until(
    EC.presence_of_element_located((By.TAG_NAME, "table"))
)

An explicit wait is "loops added to the code that poll the application for a specific condition," as the Selenium docs put it — it returns as soon as the table exists, and raises TimeoutException if it never does. That timeout is a feature: it's the signal that tells the outer loop the page is wedged and the crawler should be restarted.

Retry hard at the top level

Individual waits handle a slow page. They don't handle the crawler process getting into a bad state — a dropped connection, a browser that stops responding. For that, the whole scrape runs inside a bounded retry loop that catches the transient exceptions, rebuilds the crawler, and tries again, with a finite budget so a permanently-broken site can't spin forever:

trials = 100
keep_on = True
while keep_on:
    crawler = thatscraper.Crawler(headless=True, quit_on_failure=False)
    crawler.timeout = 10
    try:
        keep_on = get_fipe_carros(crawler, data_file, fipe_codes_file)
    except TimeoutException as err:
        if trials > 0:
            print("Timeout. Restarting crawler...")
            trials -= 1
        else:
            keep_on = False
    except (ConnectionRefusedError, MaxRetryError) as err:
        if trials > 0:
            print("Connection refused. Restarting crawler...")
            time.sleep(5)
            trials -= 1
        else:
            keep_on = False
    finally:
        crawler.quit()

Two things make this work together with the checkpoint logic. First, a fresh crawler is built on every iteration, so a wedged browser is thrown away rather than reused. Second — and this is the payoff for all the checkpoint work — when get_fipe_carros restarts, it reads the parquet files back and resumes from the last code it wrote. A crash on trial 12 doesn't cost the work from trials 1 through 11. The retry budget (trials) bounds the total damage a permanently-down site can do, so the job fails loudly instead of hanging forever.

If you'd rather not hand-roll the retry bookkeeping, tenacity wraps the same pattern — bounded attempts, exponential backoff, jitter — in a decorator. The principle is identical: retry the transient stuff, cap the attempts, and make sure each retry resumes from a checkpoint rather than starting over.

Landing a clean time series

Scraping gets you HTML. Analysis needs a table. The bridge is short and worth getting right, because the shape you land the data in determines how painful every later step is.

Each rendered result table goes straight from its HTML into a DataFrame with pandas.read_html, which parses an HTML <table> into a list of frames:

table_element = crawler.element("table", "tag name")
to_table = pd.read_html(table_element.get_attribute("outerHTML"))
header = extract_clean_header(to_table)   # strip accents, lowercase, snake_case
values = extract_values(to_table)
search_result = make_dataframe(values, header)

Then comes the cleaning that turns scraped strings into typed columns. The prices arrive as Brazilian-formatted currency text — R$ 45.046,00, with a dot for thousands and a comma for decimals — which is a string, not a number, until you fix it. A single regex pass per column handles it:

# "R$ 45.046,00" -> 45046.00
main["preco_medio"].replace(
    regex={r"(R\$\s)": "", r"\.": "", r",": "."},
    inplace=True,
)
main[["preco_medio"]] = main[["preco_medio"]].astype("float")

The goal of this step is tidy data in Hadley Wickham's sense: each variable a column, each observation a row, one type of observational unit per table. A scraped page is the opposite — values fused into display strings, headers carrying accents and punctuation, types implied rather than stated. Pulling those apart now means every downstream query is a one-liner instead of a parsing exercise.

Finally, the result lands as Apache Parquet rather than CSV. Parquet is a columnar format: it stores types, compresses well, and reads fast when you only need a couple of columns out of many — which is exactly the access pattern for "average price over time by brand." Writing and reading it from pandas is one call each:

table.to_parquet(data_file, engine="pyarrow", index=False)
df = pd.read_parquet(data_file, engine="pyarrow", columns=["marca", "preco_medio"])

That columns= argument is the columnar payoff: reading two columns out of eight touches only those two columns on disk. For a multi-file historical dataset you scan repeatedly while exploring, that adds up.

What the data shows

Once every month is a tidy parquet file, the analysis is the easy part. Reading all of them, parsing the prices, and taking the median price per brand per reference month gives a clean monthly time series spanning August 2020 to January 2023 — 30 reference months, assembled from the scraped tables.

The trend is unmistakable and consistent across segments: every brand I tracked rose between roughly 34% and 47% over the window, with the steepest climb through 2021 — the period of the global used-car price surge. I use the median rather than the mean here on purpose: the raw data has a long right tail (a handful of collector cars priced in the millions of reais) that would drag a mean around month to month. Median per brand per month is the robust summary, and it's a one-line groupby once the data is tidy.

It's also worth auditing completeness. A scrape that ran for hours never lands perfectly square — some brand/month cells are richer than others, and a thin month is a hint that the run was interrupted there. A quick heatmap of rows per brand per month makes gaps visible at a glance:

Both charts are generated by self-contained scripts that read the scraped parquet files directly; if the files aren't present they fall back to a clearly-labelled synthetic series, so the plots reproduce from a clean checkout.

Why this shape keeps coming back

Strip away the specifics and this is a pattern you'll meet again any time you pull data from a source that wasn't built to give it to you:

Make the job idempotent. Use the output as the checkpoint so a restart resumes instead of redoing. This is the single highest-leverage decision in a long scrape.
Wait for conditions, not clocks. Explicit waits are faster and more robust than fixed sleeps, and their timeouts double as your "the page is wedged" signal.
Retry the transient, bound the attempts. Rebuild the worker on each retry, cap the total, and fail loudly rather than hanging.
Land tidy, columnar data. Do the string-to-type cleaning once, store it as parquet, and every later analysis is a one-liner.

I used exactly this approach to assemble a multi-year history of the public FIPE vehicle price table — tens of thousands of rows scraped from a legacy ASP site that timed out constantly — into a set of clean monthly parquet files I could query in seconds. The site fought back the whole way; the checkpointing meant it never actually won.

References and further reading

Selenium — Waiting Strategies — official docs on explicit waits (WebDriverWait, expected conditions) vs. implicit waits
tenacity — general-purpose Python retry library with bounded attempts, exponential backoff, and jitter
pandas read_parquet — reading Parquet into a DataFrame, including column projection
pandas read_html — parsing HTML tables straight into DataFrames
Apache Parquet documentation — the columnar storage format and why it reads fast for column-subset queries
Hadley Wickham, "Tidy Data" — Journal of Statistical Software 59(10), the canonical definition of tidy data
FIPE — Tabela de preços médios de veículos — the public Brazilian vehicle price table this data comes from

Running LLM-Generated Code Without Getting Burned

Vagner Bessa — Thu, 25 Jun 2026 11:41:57 +0000

Language models are good at writing code. Ask one to compute a correlation, reshape a dataset, or plot two columns against each other, and it will happily produce a few lines of Python that do exactly that. What it can't do on its own is run that code, look at the result, and use it to answer your question. Closing that loop — letting a model write code, execute it, and read the output back — is what turns a chatbot into something that can actually do data analysis.

It's also where things get dangerous. The moment you execute text a model generated, you're running untrusted code on your machine. This post is about how to do that without handing an attacker (or a confused model) the keys to your server.

Why running model-written code is dangerous

The problem isn't that models are malicious. It's that "run this Python" is an enormous capability, and a model can be steered into misusing it — by a prompt injection hidden in a document it's analyzing, by a jailbreak, or simply by hallucinating something destructive. Once arbitrary code runs in your process, it can:

Read secrets — environment variables, API keys, the contents of nearby files, your database credentials.
Reach the network — exfiltrate data to a remote host, or pull down a second-stage payload.
Exhaust resources — an infinite loop or a runaway allocation that takes the host down.
Escape into the host — delete files, spawn processes, modify the system.

So the design goal is narrow and specific: allow general computation while denying general capability. You want the model to be able to run numpy and matplotlib, but not to open a socket, read /etc/passwd, or fork-bomb the box.

The isolation spectrum

There's no single "sandbox" primitive. There's a spectrum, trading strength for cost and complexity:

In-process restriction (e.g. RestrictedPython) rewrites or limits what Python code can do. It's lightweight but leaky — Python's introspection makes airtight in-process sandboxing notoriously hard. Treat it as a speed bump, not a wall.
OS-level isolation — Linux namespaces, cgroups, and seccomp filters confine a process's view of the filesystem, network, and syscalls. This is what containers are built on, and what Anthropic's sandbox-runtime applies without a full container.
Containers (Docker, Podman) bundle that isolation into a disposable unit with its own filesystem and resource limits. The pragmatic default for most teams.
MicroVMs (Firecracker) and gVisor (gvisor.dev) add a hardware-virtualization or kernel-emulation boundary that a plain container can't offer — the standard choice when you're running other people's untrusted code at scale.
WebAssembly (Pyodide) runs Python compiled to WASM with no host filesystem or network by default — strong isolation, at the cost of a constrained runtime.

For most applications, a disposable container hits the sweet spot: strong enough, cheap enough, and easy to reason about.

A minimal Docker sandbox

You don't need a framework to get started. Here's the shape of a locked-down container using the Docker SDK for Python — every flag here is doing security work:

import docker

client = docker.from_env()

def run_untrusted(code: str) -> str:
    return client.containers.run(
        image="python:3.12-slim",
        command=["python", "-c", code],   # the model's snippet
        network_disabled=True,            # no network egress at all
        mem_limit="256m",                 # cap memory
        pids_limit=128,                   # cap process count (anti fork-bomb)
        read_only=True,                   # read-only root filesystem
        cap_drop=["ALL"],                 # drop every Linux capability
        remove=True,                      # discard the container afterwards
        stderr=True,
    ).decode()

The container is created for one snippet and thrown away. It has no network, a hard memory ceiling, a capped process count, a read-only root, and no Linux capabilities. If the model writes something hostile, the blast radius is a throwaway container with nowhere to go and nothing to steal.

If you'd rather not hand-roll this, llm-sandbox wraps the same idea in a small API, with Docker, Podman, and Kubernetes backends:

from llm_sandbox import SandboxSession

with SandboxSession(lang="python", keep_template=True) as session:
    result = session.run(
        "import numpy as np; print(np.mean([1, 2, 3]))",
        libraries=["numpy"],
    )
    print(result.stdout)  # "2.0"

Capturing results, including plots

Running code is only half of it — you need the output back in a form the model and the user can use. stdout and stderr are easy. Plots are the interesting part: a data agent that can't show a chart isn't much of a data agent.

The trick is to run the snippet inside a session that captures matplotlib figures and hands them back as images. llm-sandbox does this when plotting is enabled — the model writes ordinary plotting code and calls plt.show(), and the infrastructure turns the figure into a base64-encoded PNG you can stream into the chat. Conceptually:

# inside the sandbox session, with artifact capture turned on
result = session.run(
    "import matplotlib.pyplot as plt\n"
    "plt.plot([1, 2, 3], [2, 4, 6]); plt.show()"
)
for image in result.plots:        # captured figures
    save_png(image.content_base64)

The model never has to know about your capture mechanism. It writes normal code; the harness handles turning a figure into a displayable artifact.

Defense in depth

The container is the hard boundary, but it shouldn't be the only one. A small amount of belt-and-suspenders pays off:

Constrain the model with the prompt. Tell it which libraries are allowed and what's off-limits. This keeps it inside the lines before the container would have to stop it:

  You may use only: pandas, numpy, matplotlib, seaborn, scipy.
  Never import os, subprocess, socket, or requests.
  Do not read or write files outside the working directory,
  and never make network calls.

Allowlist libraries, don't blocklist them. Decide what can be installed; reject everything else.
Cap code length and execution time. A wall-clock timeout on each run prevents a clever or accidental hang. (Anthropic's hosted code-execution tool, for instance, enforces a per-cell time limit and returns a timeout result rather than blocking — see the code execution tool docs.)
Degrade gracefully. If execution isn't available — the feature is off, or there's no Docker socket — return a friendly error instead of crashing the agent. The model should be able to keep answering questions even when it can't run code.

None of these replace the sandbox. They reduce how often it has to do its job, and they shrink the surface the model can probe.

Don't want to run your own? Use a managed sandbox

Running disposable containers in production — pooling, scaling, cleaning up — is real work. Several services exist specifically to take it off your hands:

E2B runs AI-generated code in Firecracker microVMs with a code-interpreter SDK; sandboxes start in well under a second.
Modal offers serverless sandboxes you can spin up per request.
Model providers ship their own server-side execution. Anthropic's code execution tool and OpenAI's Code Interpreter both run the model's code in a hosted sandbox and return results and files — no infrastructure on your side at all.

A nice property of building behind a small execute(code) interface is that the backend becomes a swappable detail: run a local container in development, delegate to a provider's sandbox in production, and the agent code barely changes.

Keeping it fast: a warm pool

One practical wrinkle: cold-starting a container with pandas, numpy, matplotlib, and friends adds seconds of latency to every request. The fix is a pre-warmed pool — create a handful of ready containers at startup, hand one out per request, and reclaim idle ones when traffic drops. You trade a little idle memory for a much snappier interaction, which matters a lot when a user is waiting on a chart.

Why this is worth getting right

This is a recurring shape in modern AI engineering: give a model real computational power without giving it dangerous reach. The durable answers are the same ones that show up across every implementation — disposable containers for isolation, a warm pool for latency, an allowlist plus prompt guardrails for defense in depth, and a backend abstraction so you can run locally or in the cloud without rewriting the agent.

I used exactly this approach to add a code-execution data agent to an internal analytics application, letting it answer open-ended questions with custom charts while staying safely contained. The specifics differ from project to project, but the principles travel — and they're worth internalizing before you wire a language model up to a Python interpreter.

References and further reading

llm-sandbox — lightweight Python sandbox runtime (Docker / Podman / Kubernetes) · docs
E2B — Firecracker-backed sandboxes for AI agents · docs
Firecracker — lightweight microVMs (the AWS Lambda / E2B substrate)
gVisor — a user-space kernel for stronger container isolation
Pyodide — CPython compiled to WebAssembly
RestrictedPython — in-process Python restriction (a speed bump, not a wall)
Anthropic sandbox-runtime — OS-level filesystem/network restriction without a container
Anthropic code execution tool and OpenAI Code Interpreter — hosted, provider-run sandboxes