Divy Yadav

Posted on May 27

I Built a Stateful Research Agent Inside a Sandbox. Here's What the Numbers Actually Looked Like.

#python #devops #ai #automation

Three steps into a multi-page research task, the agent lost everything.

Not a crash. Not a thrown exception.

The function returned, context reset, and the pricing data it had just collected vanished.

This failure is predictable: stateless execution environments were never built to hold state across browser sessions that run for twenty minutes.

You hit it eventually, usually at the worst moment.

The two standard workarounds are both annoying. Stuffing state into the prompt works until token costs starts becoming an issue. An external state store solves the problem but now you are maintaining another service.

I had been using E2B for short-lived code execution. It handles that well, and they have added persistence features over time, including early-stage snapshot support. But for agents that need to pause mid-task and resume from a different process, state management is still mostly on you.

Someone in my Discord mentioned TensorLake. I opened the docs and decided to build against this specific problem.

In this article, I will walk you through the steps using which you can build a desktop using an agent in a sandbox.

Let's start with setting up.

Visual Explanation First

Setup

What caught my attention first: named sandboxes with suspend() and resume() that preserve the full VM state, not just files, but running processes and open browser sessions. Sub-second resume, according to their docs.

Ten minutes from zero to running:

pip install tensorlake
tl login   # or TENSORLAKE_API_KEY env var

Free tier, no credit card.

from tensorlake.sandbox import Sandbox

sandbox = Sandbox.create(
    name="research-agent",
    cpus=2.0,
    memory_mb=4096,
    secret_names=["OPENAI_API_KEY"],
    image="tensorlake/ubuntu-vnc",
)

The tensorlake/ubuntu-vnc image is what gives you a real desktop and Firefox inside the VM. You need an actual browser because modern pricing pages heavily use client-side rendering and bot detection that stops headless scrapers cold. Firefox inside a sandbox just looks like a person browsing.

Important: Playwright is not pre-installed in ubuntu-vnc. Install it before the agent runs:

sandbox.run("pip", ["install", "playwright"])
sandbox.run("playwright", ["install", "chromium"])

Two to three minutes on first setup. After that, packages persist across suspend/resume so you pay the cost once.

Latency: What I Actually Measured

First sandbox was running in roughly 800-900ms from the Sandbox.create() call to status running.

Here is where time actually goes:

Sandbox creation:        ~800ms          (named sandbox, first time)
Sandbox resume:          ~400ms          (from suspended state)
LLM call (GPT-4o):       2,000-4,000ms   (per step, dominates everything)
Browser screenshot:      ~300ms          (capture + transfer)
Page load in sandbox:    1,000-2,000ms   (varies by site)
File read/write:         <50ms           (block-based storage)
Sandbox suspend:         ~200ms

The LLM calls dominate by a large margin. Sandbox overhead is not the bottleneck. The main optimization is batching browser operations before each model call rather than interleaving individual round trips.

Tensorlake publishes a SQLite filesystem benchmark claiming 1.6-1.9x faster I/O than E2B and Modal. Self-reported numbers. I could not independently verify them. What I can say is that the block-based storage felt responsive for frequent small writes, which is exactly the pattern a research agent uses when checkpointing after every step.

Computer Use: What Worked and What Didn't

The desktop API itself is clean:

with sandbox.connect_desktop(password="tensorlake") as desktop:
    png_bytes = desktop.screenshot()
    desktop.move_mouse(640, 400)
    desktop.click()
    desktop.type_text("pinecone.io")
    desktop.press("Return")

Screenshot as PNG bytes, decode it, figure out where to click, send coordinates. Each browser interaction takes 1-3 seconds depending on page load. Slow compared to an API call. But it works on pages that block scrapers, because from the server's side it is just a person using Firefox.

The problem: coordinates assume a fixed layout, and layouts do not stay fixed.

Weaviate's pricing page ran an A/B test between two of my agent's steps. The toggle moved 30px down. The agent clicked empty space. No error, no exception. Just a screenshot showing nothing happened, and twenty minutes of debugging before I identified the offset.

The fix: pass screenshots to GPT-4o Vision to identify element positions dynamically rather than hardcoding coordinates. Adds about 2 seconds per interaction, handles layout drift reliably. Worth it for reliability; too slow for high-frequency operations.

When the DOM is accessible, Playwright inside the sandbox is the better path:

result = sandbox.run(
    "python",
    ["-c", """
import asyncio
from playwright.async_api import async_playwright

async def get_pricing():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto("https://pinecone.io/pricing")
        pricing_text = await page.inner_text(".pricing-section")
        print(pricing_text)
        await browser.close()

asyncio.run(get_pricing())
"""]
)

The hybrid strategy I landed on:

Situation	Approach	Why
Site with bot detection	Vision + coordinates	Playwright gets blocked
Accessible DOM	Playwright directly	Faster, no coordinate drift
Unknown or variable layout	Screenshot + GPT-4o Vision	Resolves position dynamically
High-frequency operations	Playwright only	Vision adds ~2s per call

Use vision as a fallback, not a first tool. Vision handles layout variation. Playwright handles speed. Neither does both well.

Statefulness: The Part That Actually Mattered

After three steps (Pinecone free tier limits noted, $70/mo Starter plan recorded, Weaviate docs started), I called sandbox.suspend().

The sandbox froze. Filesystem, memory, running browser: all paused. Twelve minutes later, from a different terminal:

sandbox = Sandbox.connect("research-agent")
sandbox.resume()

About 400ms. The Weaviate pricing tab was still open. Tensorlake's suspend/resume preserves the full VM state, including memory and running processes.

Everything written to /workspace/research_notes.json was intact.

The workflow I settled on: write state explicitly after each meaningful step, then suspend.

# After each step, before suspending:
sandbox.write_file(
    "/workspace/state.json",
    json.dumps({
        "pinecone_pricing": pinecone_data,
        "weaviate_started": True,
        "next_url": "https://weaviate.io/pricing"
    }).encode()
)
sandbox.suspend()

# On next invocation, from any process:
sandbox = Sandbox.connect("research-agent")
sandbox.resume()
state = json.loads(bytes(sandbox.read_file("/workspace/state.json")))
# picks up from state["next_url"]

The state file is the continuity mechanism. Not elegant, but it removes the need for an external database and the filesystem is fast, durable across suspend, and readable from any reconnecting process.

Scaling and Failure Handling

Sandbox.create() is a blocking synchronous call. For parallel workloads, wrap in concurrent.futures:

from tensorlake.sandbox import Sandbox
from concurrent.futures import ThreadPoolExecutor

def research_competitor(name, url):
    sandbox = Sandbox.create(
        name=f"research-{name}",
        cpus=1.0,
        memory_mb=2048,
        secret_names=["OPENAI_API_KEY"],
        image="ubuntu-vnc",
    )
    # ... agent logic ...
    result = sandbox.read_file("/workspace/report.json")
    sandbox.terminate()
    return result

competitors = [
    ("pinecone", "pinecone.io/pricing"),
    ("weaviate", "weaviate.io/pricing"),
    ("qdrant", "qdrant.tech/pricing"),
]

with ThreadPoolExecutor(max_workers=5) as executor:
    reports = list(executor.map(lambda c: research_competitor(*c), competitors))

Three concurrent sandboxes ran without delay. I have not tested at twenty or fifty. Their docs mention hundreds per second. Take that at face value until you have load data.

Note: Tensorlake's Python SDK v0.5.8 introduced native async APIs that offer a cleaner alternative to threading for I/O-bound orchestration. If you are on v0.5.8 or later, those are worth reaching for before wrapping synchronous calls in a thread pool.

Patterns worth building from day one:

Idempotent state writes. Write state after each meaningful step. If the agent fails mid-run, the next invocation reads the file and skips completed work. This does not happen automatically.

Checkpoint before risky operations. sandbox.checkpoint() creates a restorable snapshot. By default, snapshots preserve the filesystem state. Preserving full memory state is supported as an explicit option. Either way, you can restore into a fresh sandbox if an operation goes wrong:

# Filesystem snapshot (default)
snapshot = sandbox.checkpoint()

try:
    agent.navigate_to_pricing_page()
except Exception:
    # Restore filesystem state into a new sandbox
    sandbox = Sandbox.create(snapshot_id=snapshot.snapshot_id)

Named sandboxes. If the orchestration process dies, any other process reconnects with Sandbox.connect("sandbox-name") and resumes from the last written state.

Architectural boundary: Tensorlake provides the execution environment and runtime for agents: the VM, the filesystem, the process lifecycle, the networking. It is not an agent framework. Retry logic, circuit breakers, and LLM rate-limit backoff belong in the orchestration layer above it: LangChain, LlamaIndex, a custom harness, or whatever you are using to drive the agent. That separation is deliberate, not a gap.

The Mental Model

The part that shifted how I thought about the design:

┌─────────────────────────────────────────────┐
│                 Your Agent                   │
│    (LLM + tool calling logic)                │
└──────────────────┬──────────────────────────┘
                   │ tool calls
┌──────────────────▼──────────────────────────┐
│           Tensorlake Sandbox                 │
│  ┌──────────────────────────────────────┐   │
│  │ State Layer: /workspace filesystem   │   │
│  │  state.json, research_notes.json     │   │
│  └──────────────────────────────────────┘   │
│  ┌──────────────────────────────────────┐   │
│  │ Execution Layer: processes, scripts  │   │
│  └──────────────────────────────────────┘   │
│  ┌──────────────────────────────────────┐   │
│  │ Computer Use: VNC, screenshots, mouse│   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

The sandbox is not the agent. It is the stable environment the agent operates in. When it resumes, the environment is exactly where the agent left it. The agent's logic lives outside and reconnects to a world that did not reset.

That changes what you can build. An agent that runs for an hour, navigates fifteen pages, and writes a structured report is feasible when the execution environment outlasts the orchestration session. With purely ephemeral execution, it is not.

How It Compares

vs E2B:

Both use Firecracker microVMs. E2B markets sub-200ms cold starts; community reports put real-world p50 closer to 400-600ms. Tensorlake named sandbox creation was ~800ms in my testing.
E2B has added snapshot and pause-resume in recent releases. The statefulness gap is narrower than a year ago. Tensorlake's suspend/resume preserves the full running VM state, including open processes, browser sessions, all in under a second. E2B's memory snapshot support is still described as early-stage.
Tensorlake claims 1.6-1.9x faster filesystem I/O on their own benchmarks. Self-reported. For an independent reference: Tensorlake recently ranked top 2 across all three categories in the ComputeSDK sandbox benchmarks.
Neither provides DOM-level element selection at the SDK layer.

vs Modal:

Modal uses gVisor rather than Firecracker, designed around stateless function execution. Stateful long-running agents work but need more setup. Cold starts are around 1-1.5 seconds per their docs.

vs Stagehand (BrowserBase):

Stagehand has DOM-level selectors (CSS, XPath, natural language) via locator(). For pure browser automation, this is a real ergonomic advantage.
Tensorlake gives you a full VM. Code execution, file management, package installs, and browser use in the same environment. If that combination is what you need, the full VM model is worth the coordinate complexity.
Browser automation only? Stagehand is the more focused tool.

from tensorlake.sandbox import SandboxClient

client = SandboxClient()

for sb in client.list():
    print(sb.sandbox_id, sb.status)

What the Build Produced

By the end of the session, the agent had produced the comparison: Pinecone versus Weaviate pricing, extracted across seven pages, with notes preserved across two suspensions and a full restart of the orchestrating machine.

report_bytes = sandbox.read_file("/workspace/comparison_report.md")
print(bytes(report_bytes).decode("utf-8"))

Accurate. Correct tier names and numbers.

Tensorlake did not solve the hard parts: the retrieval logic, state schema, hybrid browser strategy. It stayed out of the way while those got built. Most of the infrastructure friction came down to state management, and most of that went away once the sandbox filesystem became the state store.

Three Things to Know Before You Start

Speed is a systems problem, not a sandbox problem. LLM calls account for the bulk of per-step latency. Optimize by batching browser operations before each model call, not by chasing sandbox startup time.

Design for interruption from day one. Write state after every meaningful step. Not because the sandbox will crash, but because resuming from a different process after an unexpected interruption is a real scenario, not an edge case.

Computer use is a primitive. The coordinate-based API works, but layout drift will break hardcoded positions. Use Playwright when the DOM is accessible. Fall back to vision when you need a real browser session. Do not automate full workflows with raw coordinates.

Is the sandbox infrastructure production-ready? Yes. Suspend/resume held up, filesystem persistence was consistent, and Firecracker isolation did what it was supposed to.

Is the computer use layer production-ready? Not without additional engineering. The raw coordinate API is a reasonable primitive, but element resolution needs to be built on top of it. A vision-backed click_element() in the SDK would change the story significantly. Until then, budget the time to build that layer yourself.

Worth using? Yes, if you go in with clear expectations about what the platform handles and what it leaves to you. That boundary is sharper than most, which makes it easier to work with once you have internalized it.

You can also check the complete project on my github here:

click_here