DEV Community: Divy Yadav

OpenCode Is Powerful. That's Exactly the Problem.

Divy Yadav — Tue, 21 Jul 2026 12:31:40 +0000

OpenCode is the free, open source alternative to Claude Code, with full shell access to match. Here's how I ran it safely using a sandbox instead of my laptop.

OpenCode is an open-source coding agent with a workflow similar to Claude Code.

You type what you want, it reads your codebase, edits files, runs shell commands on its own. Same basic idea as Claude Code. Free though, open source, and you're not locked into one model.

Which is great, until you actually sit with what "runs shell commands on its own" means.

The first time you hand a terminal agent that kind of access, there's one question you can't quite shake: what happens the moment it runs something you didn't expect?

Most people pick a folder they don't care about and hope for the best. Or they don't think about it at all until something breaks.

I didn't love either option, so before I let OpenCode near anything that mattered, I went looking for a third one.

That's how I ended up moving the agent's shell into a disposable sandbox from Tensorlake, that runs these as a cloud service, while leaving everything else right where it was on my laptop.

Your Laptop
      │
      ▼
   OpenCode
      │
      ▼
Tensorlake Plugin
      │
      ▼
 Disposable Linux Sandbox

Here's what that looked like, and what actually surprised me once I had it running.

What I Kept Picturing Before I Typed Anything

A coding agent's real risk has nothing to do with intelligence. Most of the time it gets things right. What it doesn't do is pause. It doesn't stop to double-check whether the command it's about to fire off is the one you actually meant to approve.

One scenario kept nagging at me before I typed anything. I ask it to clean up build artifacts, and it runs rm -rf ./build. Except ./build turns out to be a symlink into somewhere I still needed, the kind of thing an agent skims right past, and honestly, the kind of thing I don't always catch either when I'm moving fast.

Or I ask it to install a dependency. npm install fires a postinstall script. That script rewrites a config file I never agreed to touch. I've seen postinstall scripts do weirder things than that with a human sitting right there watching.

Neither one is a bug to be honest.

That's just what shell access does when nothing stands between a command and your machine.

Then I thought bigger than my own laptop. Same agent, but now it's refactoring a production monorepo that ten other engineers are actively pushing commits to. Same misread symlink. Same rogue postinstall script. Except now it's not just my afternoon on the line.

bash, write, edit. These aren't agent-specific features. They're shell and filesystem access, handed to a process making its own calls about what to run next.

The instinct is to fix this with more caution. Review every diff. Approve every command. That works right up until it doesn't, because the entire point of an autonomous agent is that you eventually stop reviewing every single step.

The real fix turned out to be a different blast radius.

Put the agent's commands somewhere disposable instead of on my machine, and now a bad command won't harm my system.

Running locally isn't wrong, to be clear. For a throwaway project, or a workflow you already trust, it's exactly what you want. What changes the math is pointing autonomous shell access at something a mistake would actually cost you.

Brain Local, Hands in a Sandbox

One table covers the whole shift. The wiring behind it takes a little longer to explain.

	Local OpenCode	OpenCode + Tensorlake
Commands run	On your laptop	Remotely, in a sandbox
Filesystem	Your actual filesystem	Isolated, disposable
Dependencies	Whatever's on your machine	Disposable environment
A bad command affects	Your machine	The sandbox, not you

That's the outcome. The wiring is simpler than it sounds.

Tensorlake ships a plugin called tensorlake-opencode. The plugin doesn't replace OpenCode.It intercepts specific tool calls and reroutes them. Once I actually understood that distinction, the idea turned out simpler than I expected.

OpenCode on your laptop, tools in a sandbox, that's the actual shape of it. The OpenCode harness itself, the interface, your session, all of that keeps running on your machine exactly like before.

The model call still goes out to whichever provider you've configured, Anthropic, OpenAI, whoever, the same as it always did; that part was never local to begin with, and this plugin doesn't change it.

What changed was where each individual tool call landed:

                Your Laptop
                     │
                     ▼
          OpenCode Harness
                     │
      ┌──────────────┴──────────────┐
      ▼                             ▼
Model Provider              Tensorlake Sandbox
(Claude/OpenAI/etc.)      Shell + Filesystem

The model call still goes wherever you configured your provider. Only the tool calls, the hands, move into the sandbox.

webfetch and websearch are the two exceptions that stay local, since neither touches a filesystem. bash, write, edit, read, ls, glob, and grep are the ones that get rerouted.

Every intercepted command now makes a network round trip instead of running instantly on my machine. Tensorlake's documentation says a sandbox starts up in a few seconds, with the underlying VM image itself booting in hundreds of milliseconds. Their GitHub page and product site separately claim resume from a suspended state also lands under a second. My own first sandbox, the one the plugin spun up automatically on that uname -a call, took 2.3 seconds end to end, per the timestamp the plugin logged, which fits comfortably inside what the docs describe. That's likely the plugin's own provisioning and connection overhead stacked on top of the raw VM boot, not just the VM starting up. Either way, it's not something you sit around waiting on.

Setting It Up

The whole setup turned out to be one config entry and one environment variable, typed in this order.

First, the plugin, added to ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "plugin": ["tensorlake-opencode"]
}

OpenCode installs it automatically, no separate npm install needed. Then the API key, exported in the same shell I was about to launch OpenCode from:

export TENSORLAKE_API_KEY=your_api_key_here
opencode

Nothing happened. No sandbox spun up on launch, which threw me for a second. First instinct was that I'd messed up the config somehow, and I almost went back to check the JSON for a typo before actually rereading what I'd just set up: lazy creation, not eager. Tailing ~/.local/share/opencode/log/tensorlake.log just confirmed the plugin had loaded, nothing more, until I actually asked the agent to do something that needed a sandbox.

The First Real Test

Sandbox creation in this plugin is lazy. It waits for the first tool call that actually needs the filesystem or a shell, not the moment you launch the session.

I didn't want to point it at anything that mattered yet, so the first command I actually gave it was about as low-stakes as they come:

Run: uname -a

That single bash call was what triggered everything. A "Sandbox created" toast showed up in the terminal, and I had the log tailing in a second window the whole time. It read almost exactly what the docs describe: a line saying a new sandbox was being created for the session, then a second line confirming it was live, at 2.3 seconds.

The output said Linux. My laptop runs macOS. Not going to lie, that one word convinced me faster than any architecture diagram would have.

Next I asked it to prove the filesystem side too:

Write "Hello Tensorlake" to /tmp/workspace/test.txt, then read it back.

The write and read both happened entirely inside the sandbox, confirming that filesystem operations were isolated from my machine.

/tmp/workspace was the agent's working directory inside the sandbox, not a folder anywhere on my disk. Nothing in either test would have cost me anything if it had gone wrong.

With the sandbox working, I wanted to move beyond smoke tests and verify the full developer workflow on a real repository.

Clone https://github.com/benjaminp/six.git, find the ensure_str function, 
improve its TypeError message so it names the function it came from, 
then run the test suite and tell me if anything broke.

This was intentionally a tiny change—the goal wasn't to contribute a feature, but to verify that OpenCode could complete the same edit–test loop I'd expect during normal development.

six is a small, well-known Python compatibility library: one source file, a straightforward test suite, and just enough structure to exercise a realistic edit–test workflow.

OpenCode cloned the repository, located ensure_str, modified the function, and ran the test suite entirely inside the sandbox.

The original implementation raised TypeError("not expecting type '%s'" % type(s)).

The edit made it TypeError("ensure_str: not expecting type '%s'" % type(s))—a deliberately small edit that was easy to verify with the existing test suite. Because the tests only verify that a TypeError is raised rather than checking the exact message text, this made for a safe, minimal edit, and the run confirmed it: 184 passed, 16 skipped, 0 failed.

Same as before the change.

That's the part that mattered to me. Not the specific repo or the specific one-line fix, but that a clone, a real edit, and a full test run all happened inside the sandbox, on an actual project, without me once worrying about what would happen to my own machine if something in that chain went sideways.

Three Things That Clicked Once It Was Running

I expected the isolation. The other two took me by surprise.

Isolation. A bad command genuinely has nowhere real to land. A runaway install, an rm aimed at the wrong path, a dependency that half-installs and leaves things broken. All of it happens in a sandbox I can throw away, not my actual working tree.

Reproducibility mattered more once I pictured someone else on the team running this same setup. Whatever's actually installed on my laptop stops being relevant, because the agent isn't running on my laptop anymore. Register one image with the right toolchain baked in, point every future session at it, and the "works on my machine" conversation just stops happening.

Then there's persistence, the one I hadn't planned for. A named sandbox doesn't vanish when OpenCode restarts. It just sits there, parked. Come back later and the same working directory, the same installed packages, the same warm caches are all still exactly where I left them. Nothing rebuilt, nothing reinstalled.

A disposable CI container vanishes the second a run finishes. This one was still there the next day, same as I'd left it.

Took me an hour of lost installed state before that actually registered.

Configuring the Sandbox for Real Work

Once I trusted it with something more than a test file, I went back and actually sized it properly. The plugin reads a small set of environment variables once, at the moment the first sandbox gets created, so they need to be set before you launch OpenCode, not after:

export TENSORLAKE_CPUS=4
export TENSORLAKE_MEMORY_MB=8192
export TENSORLAKE_DISK_MB=20480
export TENSORLAKE_IMAGE=my-custom-image

Variable	Default	Controls
`TENSORLAKE_CPUS`	2	vCPUs
`TENSORLAKE_MEMORY_MB`	4096	RAM in MB
`TENSORLAKE_DISK_MB`	10240	Disk in MB
`TENSORLAKE_IMAGE`	platform default	The image the sandbox boots from

If you're using a Personal Access Token instead of a project-scoped key, you'll also need TENSORLAKE_ORGANIZATION_ID and TENSORLAKE_PROJECT_ID.

TENSORLAKE_IMAGE is the one actually worth using. Bake your language runtime, system packages, and project dependencies into a registered image once:

tl sbx image create Dockerfile --registered-name my-custom-image
export TENSORLAKE_IMAGE=my-custom-image

Every session after that starts already warm. I stopped watching the agent reinstall my stack from scratch every time I opened a new session.

Note: export only lasts for the current shell, so I added these to my shell profile once I knew I'd be using this setup regularly.

The Operational Questions I Actually Had

Before trusting this with anything real, I wanted answers to a few things that don't come up in a quick demo. Worth being upfront about one thing here: most of what follows is documented at the Tensorlake SDK and platform level. I haven't confirmed that the OpenCode plugin specifically exposes or manages each of these the same way, only that the underlying sandboxes it creates support them. Where that distinction matters, I've called it out below.

What if I just walk away mid-session? Named sandboxes auto-suspend after their idle timeout rather than terminating, at the platform level. Tensorlake's product site says the meter stops the moment it suspends, and their GitHub page puts resume at under a second, with filesystem, memory, and running processes exactly where you left them, not rebuilt. I haven't independently timed a resume myself, and I haven't confirmed whether the OpenCode plugin surfaces any control over this behavior or just inherits the platform default, so take the specific number as Tensorlake's claim about the platform, not a tested claim about this plugin.

What if the agent kicks off something long-running, like a build or a test suite? The underlying Tensorlake SDK supports starting background processes inside a sandbox that outlive the single command that launched them. I haven't tested whether the OpenCode plugin's bash interception uses this pattern specifically, or whether a long-running command inside OpenCode just holds the tool call open for the duration. Either way, the sandbox itself isn't the bottleneck.

What if my connection drops mid-command? Tensorlake has written publicly about this exact failure mode at the platform level: when the transport hiccups mid-run, they retry and reap orphaned sandboxes so a flaky connection doesn't cost you the whole task. Again, this is a platform-level guarantee. I haven't tested how the plugin itself behaves if your local connection drops mid-tool-call.

What if the sandbox itself crashes outright, not just an idle timeout? Here's where I'll be straight with you: I didn't find documentation covering a hard crash mid-command specifically, at either the platform or plugin level, and I didn't manage to force one during testing either. If you're planning to run this against something you really can't afford to lose, that's worth confirming directly with Tensorlake rather than taking my word for it.

Where I'd Actually Use This

By the end I had a rough rule for myself. Picture a coding agent refactoring a production monorepo while another engineer is pushing commits at the same time. The question was never whether the model was smart enough. It was whether I wanted its shell commands landing on my own workstation.

This isn't the right setup for every OpenCode session, and I don't pretend otherwise:

My situation	What I'd do
Personal throwaway project	Run locally
Client repository	Use Tensorlake
Production codebase	Use Tensorlake
Letting the agent experiment freely with shell commands	Use Tensorlake
Shared engineering environment	Use Tensorlake

The sandbox earns its place anywhere the blast radius of a wrong command actually matters. If I'm the only person who'll ever touch the repo and I'd shrug off losing it, I'm solving a problem I don't have yet.

Worth a quick word on why this isn't just Docker or a Codespace with extra steps. A Docker container shares your host's kernel, which is fine for packaging an app but a thinner isolation boundary than a full VM if you're worried about what an autonomous agent might run.

A local VM gives you the stronger boundary but is heavy and slow to spin up and tear down for something you might want to throw away every few minutes. GitHub Codespaces and Dev Containers solve a different problem: a consistent, persistent dev environment tied to a specific repo, not a fast, disposable environment built to be created and discarded on every session.

The MicroVM approach here is closer to a VM's isolation with something closer to a container's boot speed, and it's built specifically to be thrown away.

If You Want to Try This Yourself

Here's the order I'd do it in, knowing what I know now.

Step 1: Install OpenCode, add tensorlake-opencode to opencode.json. This just gets the plugin loaded, nothing else happens yet.

Step 2: Export TENSORLAKE_API_KEY, launch OpenCode, confirm the plugin loaded via the log file. This is the "is it actually on" check.

Step 3: Ask it to run something trivial, like uname -a. This is what triggers sandbox creation and gives you visible proof the command ran remotely.

Step 4: Set TENSORLAKE_CPUS, TENSORLAKE_MEMORY_MB, and TENSORLAKE_DISK_MB to match your real workload, before your next session starts, not mid-session.

Later: Build a custom image with your actual toolchain baked in, once the default image starts feeling like it's missing things you keep reinstalling.

Closing

I went looking for a third option: an agent with full autonomy, running somewhere that wasn't my laptop. Got that part working fast. What actually surprised me was how little I had to think about afterward.

I stopped reading every command before approving it once this was running. Wasn't carelessness. Just nothing left on my machine for a bad command to reach.

Turns out the part that mattered was never whether the agent could run commands. It was where they landed. Not stopping mistakes. Just making sure they happen somewhere disposable instead of somewhere that costs me.

If you've started letting an autonomous coding agent anywhere near your shell, moving that shell execution into a disposable sandbox is one of the highest-leverage changes you can make, and Tensorlake's own numbers put the setup at a few minutes, not an afternoon. The config entry and the environment variable up above are the whole thing.

References

Tensorlake OpenCode Integration: Full setup, configuration, and lazy sandbox creation model
OpenCode: The open source, terminal-first coding agent, MIT-licensed and model-agnostic
tensorlake-opencode on npm: The plugin package referenced in this article
Plugin source on GitHub: Tool interceptors, session manager, lifecycle handling
Tensorlake Sandbox Lifecycle: The suspend, resume, and snapshot model underneath this integration
Tensorlake Sandbox Images: Building and registering a custom image

Master These 6 AI Concepts to Become an AI Engineer ( A Visual Explanation)

Divy Yadav — Wed, 15 Jul 2026 09:27:36 +0000

Job postings asking for AI skills are up 143% in a single year. The people filling those roles didn't spend years on advanced math. They learned six ideas, in the right order, and started building.

Most people think you need a math degree, or years of computer science training, to become a serious AI developer.

You don't.

The gap between someone who can barely get a chatbot to behave and someone building real, production AI systems isn't intelligence. It isn't years of study. It's six specific concepts, learned in an order that actually makes sense, instead of scattered across a hundred confusing tutorials.

Here they are, explained assuming you know nothing yet.

A quick honest note before we start

This roadmap covers how to build AI-powered applications, chatbots, assistants, and tools that use existing AI models well. That's different from becoming a machine learning researcher who builds new models from scratch, which genuinely does need heavy math and years of study.

Most people who say "I want to become an AI developer" mean the first path. That's also where almost all the current job demand is. AI-related job postings are up 143% year over year, and the engineers filling those roles are, overwhelmingly, application builders, not research scientists.

This is that path.

Level 1: Beginner

1. The Context Window

Here's the concept that trips up almost every beginner, because it isn't obvious until something breaks.

An AI model can only "see" a limited amount of text at once. That limit is called the context window. Picture a whiteboard in a meeting room. You can write a lot on it, but once it's full, anything written past the edge simply isn't there anymore. It's not that the AI forgot. It never saw it.

This matters the moment you try to have a long conversation, or feed the AI a huge document. Past a certain point, older parts of the conversation quietly fall off the edge of the whiteboard, and the AI starts responding as if they never happened.

Understanding this one limit explains most of the "why did the AI suddenly get confused" moments beginners run into.

Level 2: Intermediate

2. RAG (Retrieval-Augmented Generation)

An AI model only knows what it learned during training. Ask it about your company's internal policies, and it has nothing, because it never read them. RAG fixes this.

Think of it as the difference between a closed-book exam and an open-book one. Without RAG, the AI answers purely from memory. With RAG, it's allowed to flip open a specific book—your documents, your database, your knowledge base—and check before answering.

In practice, this happens in three steps:

Chunking: Your documents get broken into smaller pieces and stored in a searchable format.
Retrieval: The AI searches that storage for the pieces most relevant to the question.
Augmentation and Generation: It answers using both what it already knew and what it just found.

Virtually every AI chatbot that answers questions about a specific company's internal information is running RAG under the hood. It is arguably the single most valuable skill on this entire list, because it is the difference between a generic AI and one that actually knows your business.

3. Fine-Tuning

Prompting hands the AI a fresh set of instructions every single time. Fine-tuning is different: it actually reshapes how the model behaves, permanently, by training it further on a narrow, specific set of examples.

Think of the difference between handing someone a manual before every task versus sending them through months of specialized training. The manual works for most jobs. Specialized training changes how someone thinks about the job itself.

Pro Tip: Most developers should reach for RAG first. Fine-tuning costs more, takes longer, and is usually only worth it when you need a very specific style, format, or behavior that no amount of careful instruction alone can reliably produce.

4. AI Agents

Everything so far has been about the AI answering questions. An agent is about the AI actually doing things.

Picture the difference between a consultant who gives you advice and an assistant who actually goes and books the flight. A consultant-style AI just answers. An agent actually takes real action instead of just describing what action you should take: searching the web, running code, sending an email, or updating a database.

The AI does this by deciding, on its own, which step to take next based on what it just learned. Ask it to fix a bug, and it can read the error, try a fix, check if that fix worked, and try something else if it didn't—all without you typing a new instruction after every single step.

This is where AI development starts feeling genuinely powerful, and also where it starts requiring real care, which is exactly why the next two concepts exist.

Level 3: Advanced

5. MCP (Model Context Protocol)

Imagine every country having a different shape of electrical socket, and every single appliance needing its own custom adapter just to work when you travel. That was the old way AI agents connected to outside tools: every connection was custom-built, one at a time, for every tool and every AI model.

MCP works like a universal socket standard.

Once a tool speaks MCP, any AI agent that understands MCP can plug into it directly. No custom adapter needed. Anthropic introduced this standard, and it has quickly become the common way agents connect to databases, files, and other services, regardless of which AI model is actually doing the work. If you're building an agent that needs to reach outside tools, MCP is very likely how that connection gets made.

6. Harness Engineering

This is one of the most critical operational concepts for production engineering. Picture a stunt performer on a film set. They're genuinely skilled. But nobody lets them attempt a dangerous stunt without a safety harness, a spotter, a hard limit on how many takes they get, and a director watching every single shot.

The performer's skill isn't what keeps them safe on set. The equipment and process wrapped around them is.

An AI agent works the same way. The model is the skilled performer. The harness is everything wrapped around it:

A hard limit on how much it's allowed to spend before it has to stop.
Checkpoints that save progress so a crash doesn't waste hours of work.
Guardrails on which actions it's actually allowed to execute.
Logs so a person can audit exactly what happened if something goes wrong.

This concept exists because of one striking number: 88% of AI agent projects never make it into real production use. Not because the models were bad, but because nobody built the harness around them.

How these six AI concepts fit together

None of these exist alone in a real system. A production AI assistant, the kind companies actually pay for, typically uses several of these at once: RAG to pull in company-specific knowledge, an agent to actually take action, MCP so that agent can reach real tools, and a harness watching the whole thing to make sure nothing goes wrong.

Learning them one at a time makes sense. Using only one at a time, in a real product, almost never does.

Key Takeaways

The context window is the AI's limited working memory. Understanding its limit explains most confusing AI mistakes.
RAG lets an AI answer from your own documents instead of just what it learned in training. It is the most valuable skill on this list.
Fine-tuning permanently reshapes a model's behavior. Reach for it only when instructions genuinely can't get you there.
An AI Agent takes real action instead of just answering, deciding its own next step based on what it just learned.
MCP is the universal standard that lets agents plug into outside tools without a custom connection built for every single one.
Harness engineering is what turns a smart model into a reliable system. The missing harness is usually why agent projects fail to reach production.

The part worth remembering

Nobody starts as an advanced AI developer. Every person building serious production AI systems today started exactly where you might be starting now—watching a conversation quietly lose track of what was said three messages ago and wondering why.

What separates a beginner from someone advanced isn't talent. It's this list, learned in order, applied to something real. The math and the deep model architecture that everyone assumes is required? Most working AI developers never touch it. They learned six ideas, built things with them, and kept going.

You can start with the context window today. Nothing on this list requires anything you don't already have.

References

[Boost]

Divy Yadav — Fri, 10 Jul 2026 12:39:27 +0000

Divy Yadav

Jul 10

Why Your AI Experiments Keep Starting From Scratch (And How Tensorlake Fixes It)

#ai #webdev #programming #productivity

12 min read

Why Your AI Experiments Keep Starting From Scratch (And How Tensorlake Fixes It)

Divy Yadav — Fri, 10 Jul 2026 08:58:57 +0000

How memory checkpoints and sandbox forking let you build once, checkpoint the warm state, and run as many parallel workers as you need. No reinstalls. No reloads.

Running eight parallel ML experiments sounds efficient. Watching each one reinstall numpy from scratch does not.

The logs told the story: eight sandbox workers, each spending its first 40 seconds on pip install numpy and loading the dataset before a single line of training code ran. The training scripts were different.

The setup wasn't. I had paid for the same 40 seconds of work eight times.

The training scripts were the candidates. The setup was not.

That is not a performance bug. It is a mental model problem.

This article is the fix.

"Starting Fresh" Is Not the Same as "Starting Clean"

Most experiment setups quietly make an assumption: real worker isolation means booting from a clean image every time. No shared state. No residue from previous runs.

That assumption is not wrong. It is just more expensive than it needs to be.

Workers running parallel experiments need independent filesystems and independent process trees. They need a known starting state. What they do not need is to reinstall numpy, reload the same dataset from disk, or re-seed a Python environment that was identical across all of them at the start of the iteration.

"Starting fresh" became a proxy for isolation.

But the actual requirement is narrower: get every worker to the same state, then let them diverge. That is a different primitive than booting from a base image.

Filesystem Snapshots vs Memory Snapshots

Before I discovered how Tensorlake handles this, I assumed all snapshots worked the same way Docker images do.

A filesystem snapshot captures disk state only. When you restore from one, the sandbox cold boots: the VM initializes from scratch, the OS comes up, Python starts, your process begins again. The installed packages are there on disk, so you skip the apt install and pip install steps. But you still pay for boot time and for every piece of process initialization that was not baked into the image.

The interpreter loads. Modules get imported. Data gets pulled into RAM.

I learned this the hard way.

My first implementation used filesystem snapshots because they were the default. I launched eight workers from the same filesystem snapshot and watched every one of them cold boot, import NumPy, load the dataset, and rebuild the execution environment.

I had skipped the package installation step, but I still paid for everything that came afterward.

Then I read the docs again and saw CheckpointType.MEMORY.

A memory snapshot captures the filesystem plus the entire VM memory state, including all running processes at the exact moment of capture. Restore from one and the sandbox resumes warm. The Python interpreter is already running. Modules imported before the checkpoint are already loaded. Variables that were in memory are still there.

It is closer to Unix fork() than to a Docker image. You are not restoring a filesystem. You are branching from a specific execution point.

Tensorlake exposes filesystem and memory checkpoints as distinct primitives, making it possible to choose between cold restores and warm restores depending on the workload.

from tensorlake.sandbox import Sandbox, CheckpointType

# Cold-boot restore: captures disk state only
snapshot = sandbox.checkpoint(checkpoint_type=CheckpointType.FILESYSTEM)

# Warm restore: captures disk + VM memory + running processes
snapshot = sandbox.checkpoint(checkpoint_type=CheckpointType.MEMORY)

The default when you call sandbox.checkpoint() with no arguments is filesystem. For the fork pattern described in this article, you want memory.

One tradeoff to understand upfront: memory snapshots lock the resource configuration at capture time.Image, CPU count, memory limit, entrypoint, and secrets all come from the snapshot when you restore.

You cannot change them at restore time.

If you need different resource allocations per worker, use filesystem snapshots and accept the cold boot. For most parallel experiment setups, the locked resources are fine since you sized the base environment for the workers when you created it.

Secrets are the exception.

They survive the checkpoint but are not locked to it. Secrets are passed as environment variables to run(), so you can pass a new value at restore time and overwrite whatever was captured in the snapshot. Rotating an API key across forked workers does not require a new snapshot.

The Fork Pattern

Once you have a memory snapshot, the fork pattern is straightforward. Here is what it looks like end to end:

Every worker starts from the same warm execution point. They share nothing at runtime. The base sandbox can be terminated right after the checkpoint. The snapshot_id persists independently.

Step 1: Create one base sandbox and do all shared setup inside it.

from tensorlake.sandbox import Sandbox, CheckpointType

base = Sandbox.create(
    image="tensorlake/ubuntu-minimal",
    cpus=2,
    memory_mb=4096,
)

# Install shared dependencies once
base.run("pip", ["install", "numpy", "pandas", "--break-system-packages"])

# Verify the environment is ready
base.run("python3", ["-c", "import numpy; import pandas; print('Environment ready.')"])

Step 2: Capture the warm state.

snapshot = base.checkpoint(checkpoint_type=CheckpointType.MEMORY)
print(f"Snapshot captured: {snapshot.snapshot_id}")

# The base sandbox can be terminated. The snapshot persists independently.
base.terminate()

Step 3: Restore N workers in parallel from the same checkpoint.

from concurrent.futures import ThreadPoolExecutor

def run_experiment(script: str) -> str:
    # Each worker restores a fresh copy of the warm environment
    worker = Sandbox.create(snapshot_id=snapshot.snapshot_id)
    result = worker.run("python3", ["-c", script])
    worker.terminate()
    return result.stdout

candidates = [script_v1, script_v2, script_v3, script_v4, script_v5]

with ThreadPoolExecutor(max_workers=len(candidates)) as pool:
    results = list(pool.map(run_experiment, candidates))

Each Sandbox.create(snapshot_id=...) call restores an independent copy of the warm environment. Workers share no filesystem and no runtime state. They just started from the same point.

The snapshot.snapshot_id is a persistent string. Save it to a file. Use it in the next session, in the next iteration of your loop, in a different process entirely. The warm state survives until you explicitly delete the snapshot.

Where This Actually Matters

ML Experiment Racing

Andrej Karpathy published the autoresearch repo in early 2026.

The core idea: an LLM reads your current best training script, proposes N code modifications, you race all N in parallel sandboxes, keep the winner, and loop. The loop runs overnight. Each accepted modification becomes the new baseline for the next iteration.

The naive implementation pays the setup cost N times per iteration. If the training script needs numpy, scipy, and a small dataset loaded into memory, that is real time per candidate, per loop. With 8 iterations and 3 candidates each, you have paid identical setup cost 24 times.

With Tensorlake's Snapshot Fork Pattern, you do the setup once.

Take a memory checkpoint. Fork all candidates from that checkpoint.

They start warm and only diverge where the proposed modification changes behavior.

The structure of one iteration looks like this:

# Build the warm baseline once per iteration
base = Sandbox.create(image="tensorlake/ubuntu-minimal", cpus=2, memory_mb=4096)
base.run("pip", ["install", "numpy", "scipy", "--break-system-packages"])
base.run("python3", ["-c", "import dataset; dataset.load_into_memory()"])

# Checkpoint once: all candidates share this starting point
snapshot = base.checkpoint(checkpoint_type=CheckpointType.MEMORY)
base.terminate()

# Race candidates from the warm snapshot in parallel
# Each worker starts where the base left off, not from a fresh image
with ThreadPoolExecutor(max_workers=len(candidates)) as pool:
    results = list(pool.map(lambda s: run_in_sandbox(snapshot.snapshot_id, s), candidates))

# Keep the winner, discard the rest, repeat
winner = min(results, key=lambda r: r["val_loss"])

The snapshot.snapshot_id is reused across all candidates in the same iteration. Setup cost is paid once per loop, not once per candidate.

RL Rollouts with Reproducibility

RL rollouts have a hard requirement: same seed, same action sequence, same trajectory. Every time, without exception. That guarantee breaks the moment workers share any state. A shared pip cache can introduce version skew between runs. A shared /tmp carries residual files from previous episodes. Even calling env.reset() correctly does not help when state outside the environment object persists between episodes.

With per-rollout sandboxes forked from a common snapshot, the isolation is structural rather than enforced by convention. There is no shared filesystem between workers.

The seed goes directly into the Python script that runs inside the sandbox, not into the host process. Keeping it there means the host's random state stays completely out of the episode:

import json
from tensorlake.sandbox import Sandbox

def gym_harness(seed: int) -> str:
    # Script runs inside the sandbox, not the host
    return f"""
import gymnasium as gym, json

env = gym.make("CartPole-v1")
obs, _ = env.reset(seed={seed})
env.action_space.seed({seed})

trajectory, total_reward = [], 0.0
for _ in range(200):
    action = env.action_space.sample()
    next_obs, reward, terminated, truncated, _ = env.step(action)
    trajectory.append((obs.tolist(), int(action), float(reward), bool(terminated)))
    total_reward += reward
    obs = next_obs
    if terminated or truncated:
        break

print(json.dumps({{"seed": {seed}, "total_reward": total_reward, "steps": len(trajectory)}}))
"""

def run_rollout(snapshot_id: str, seed: int) -> dict:
    worker = Sandbox.create(snapshot_id=snapshot_id)
    result = worker.run("python3", ["-c", gym_harness(seed)])
    worker.terminate()
    return json.loads(result.stdout)

One specific gotcha here: env.reset(seed=seed) only seeds the observation and transition RNG. The action space has its own separate RNG that requires env.action_space.seed(seed) independently. Miss the second call and trajectories vary across runs with no obvious error. The mismatch is silent and will look like flaky results until you trace it to the gymnasium source.

Parallel Browser Agents

Browser warmup is slow: login flows, OAuth, session cookies, waiting for JavaScript-heavy pages to stabilize. If you are running 20 browser workers in parallel, you do not want each one repeating that from scratch.

Tensorlake's ubuntu-vnc sandbox image makes this practical because the authenticated browser itself becomes part of the checkpoint.

Complete the auth flow once, checkpoint the authenticated browser state, then fork your workers from there:

from tensorlake.sandbox import Sandbox, CheckpointType

# Authenticate once in a single browser sandbox
browser_base = Sandbox.create(image="tensorlake/ubuntu-vnc", cpus=4, memory_mb=4096)
# ... drive Chrome via CDP, complete login, reach stable page state ...

# Checkpoint the authenticated browser in memory
auth_snapshot = browser_base.checkpoint(checkpoint_type=CheckpointType.MEMORY)
browser_base.terminate()

# All 10 workers start with the logged-in browser state already in memory
workers = [Sandbox.create(snapshot_id=auth_snapshot.snapshot_id) for _ in range(10)]

The same pattern that eliminates repeated pip install across ML workers eliminates repeated OAuth flows across browser workers. The primitive is the same. Only the warmup content changes.

Suspend vs. Snapshot: Two Different Things

Suspend and snapshot both preserve sandbox state. They solve different problems and it is worth being clear about which one you reach for.

Suspend pauses this specific sandbox and holds its state for later resumption under the same sandbox ID. It uses no compute while suspended. When you resume, the sandbox comes back in under a second with the same process IDs, filesystem, and in-memory state intact. Named sandboxes auto-suspend on timeout instead of terminating, which means you do not lose a long-running agent session because a task ran slightly over its time budget.

Snapshot captures a reusable artifact you can restore into new sandboxes. The artifact persists after the source sandbox is terminated. Restore from it once or many times.

Suspend is for pausing and resuming a single workstream. Snapshot is for branching from a known state into N parallel workers.

The sandbox keeps running while the snapshot is being captured. The snapshot artifact persists regardless of what happens to the source sandbox afterward.

Decision Matrix

Why This Works: The Underlying Principle

Unix fork() exists for the same reason. Process creation is expensive, so instead of spawning a child from scratch, you copy the parent's entire memory state and let the child diverge from there.

Git branches for the same reason: you checkpoint a known state and branch from it because re-deriving full history for every branch would be wasteful.

AI experiment infrastructure got filesystem snapshots first. But filesystem snapshots only moved the cost boundary to disk. The running process, loaded modules, in-memory data, and interpreter state all had to rebuild on every new worker anyway.

Memory checkpointing moves that boundary further. The running process is part of what gets captured. Load a 500MB dataset into a pandas DataFrame before the checkpoint, and every forked worker starts with that DataFrame already in RAM. Pre-compile JIT functions and the cache is there too. The interpreter does not restart. Nothing reloads.

The expensive part of an experiment is rarely the experiment. It is everything that had to be true before the experiment could start.

That cost is not fixed. It just looked fixed because the tooling treated it that way.

What the Platform Handles

Three numbers that matter for the fork pattern specifically.

Speed. The tensorlake/ubuntu-minimal image starts up in a few hundred milliseconds; tensorlake/ubuntu-systemd (full init system) takes around one second. Forked workers restore from a memory snapshot warm, skipping boot and process initialization entirely. Suspended named sandboxes resume in under a second without losing any memory or filesystem state. The published SQLite benchmark (100k inserts, 2 vCPU / 4GB) shows Tensorlake at 2.45s against Vercel at 3.00s, E2B at 3.92s, Modal at 4.66s, and Daytona at 5.51s.

Scale. Tensorlake supports fanning out to thousands concurrent sandbox environments — the number the Harbor integration targets for RL rollouts and eval pipelines running from a single snapshot. The overall project limit is 5 million sandboxes.

Isolation. Sandboxes run on MicroVMs backed by Firecracker and CloudHypervisor. LLM-generated code never shares a kernel with other tenants or with your host process. Tensorlake is SOC 2 Type II and HIPAA compliant, with EU data residency and zero data retention options available.

Getting Started with Tensorlake

Install the SDK and grab an API key from cloud.tensorlake.ai (free tier available):

pip install tensorlake
export TENSORLAKE_API_KEY=your_api_key

Minimal working fork pattern you can run right now:

from tensorlake.sandbox import Sandbox, CheckpointType
from concurrent.futures import ThreadPoolExecutor

# Step 1: Warm the base environment
base = Sandbox.create(cpus=2, memory_mb=4096)
base.run("pip", ["install", "numpy", "--break-system-packages"])

# Step 2: Checkpoint the warm state
snapshot = base.checkpoint(checkpoint_type=CheckpointType.MEMORY)
base.terminate()

# Step 3: Fork workers from the snapshot
def run(script: str) -> str:
    w = Sandbox.create(snapshot_id=snapshot.snapshot_id)
    out = w.run("python3", ["-c", script]).stdout
    w.terminate()
    return out

scripts = ["print('v1')", "print('v2')", "print('v3')"]

with ThreadPoolExecutor(max_workers=3) as pool:
    results = list(pool.map(run, scripts))

print(results)

All three workers start warm. snapshot.snapshot_id is a persistent string you can store and reuse across sessions.

Conclusion

Those eight pip install numpy calls were not a performance bug.They were a modeling mistake. I had built my experiment loop around "start fresh" when what I actually needed was "start identical."

The difference is real. Starting fresh means paying full setup cost per worker. Starting identical means paying it once and forking from there.

Tensorlake's memory checkpoints provide one implementation of this pattern: build the environment once, capture the execution state, and fork as many isolated workers as you need from the same warm baseline.

You do the expensive shared work once, freeze the execution state, and branch as many workers as you need from that exact point. Each worker gets its own isolated environment. The setup cost is paid once.

This applies to most parallel AI workflows: ML candidate racing, RL rollouts, browser agent pools, CI evaluation pipelines. The pattern is the same across all of them.

If I could go back to those eight workers all stuck on pip install numpy, the fix would have been four lines. Create the base. Warm it. Call checkpoint(MEMORY). Restore from there. The 320 seconds of identical setup would have happened once. The other seven copies would never have existed.

That is what the pattern is for.

References

Tensorlake Sandboxes: Introduction: Overview of how MicroVM sandboxes work, boot times, and supported images
Tensorlake Snapshots Documentation: Full reference for checkpoint(), snapshot types, restore, clone, and delete operations
Tensorlake Sandbox Lifecycle: State machine for ephemeral vs named sandboxes, suspend/resume, and timeout behavior
Tensorlake Agentic Autoresearch Loop: Full working example of the Karpathy-inspired autoresearch pattern using Tensorlake sandboxes
Tensorlake RL Reproducible Environments: Deterministic rollout guarantees with per-rollout sandbox isolation and seed embedding
Tensorlake Agentic Swarm Intelligence: Map-reduce pattern for parallel agent swarms in isolated sandboxes
Tensorlake Drive Chrome over CDP: Running Chrome inside ubuntu-vnc sandboxes and driving them via the Chrome DevTools Protocol
Tensorlake Harbor Integration: Running Harbor evaluations and RL rollouts with 10k+ concurrent sandbox environments
Tensorlake SQLite Benchmark: Public benchmark comparing sandbox filesystem performance across Tensorlake, Vercel, E2B, Modal, and Daytona
karpathy/autoresearch: The original autoresearch repo by Andrej Karpathy that inspired the ML experiment racing pattern
Tensorlake SDK Reference: Complete Sandbox handle method reference for Python and TypeScript

AI Document Processing: What Production Systems Actually Need

Divy Yadav — Wed, 08 Jul 2026 11:44:47 +0000

Getting extraction working on test documents is the easy part. Here is what breaks before you hit month two.

Three weeks, two engineers, one prompt. They tested it on twenty invoices from their biggest vendor: 95% field accuracy, clean JSON, ready to ship.

So they shipped it.

Six weeks later, the pipeline was silently dropping line items from any vendor whose invoice didn't match the layout it had been tested on. No crashes. No errors. Just missing data, quietly piling up into a four-month reconciliation mess before anyone noticed.

The team hadn't gotten extraction wrong. They'd gotten the problem wrong. The real issues were layout variation across vendors, no validation layer to catch silent failures, and nowhere for exceptions to go.
**
That's where most AI document processing projects actually die. **

Not in the demo. In month two, once documents start arriving from vendors nobody tested against.

This piece covers what production-ready AI document processing actually requires, and the four failure modes I keep watching teams hit once they're past the pilot.

If you want more such information about AI, consider subscribing to my newsletter, where you will get noise-free AI information every week
Link for the newsletter: Newsletter

What AI Document Processing Actually Is

AI document processing means using machine learning models and LLMs to pull structured data from unstructured documents, without manually configuring templates for each document format.

Inputs vary widely: native PDFs, scanned images, photographs of physical forms, multi-page mixed files. The output is structured JSON or database-ready records that downstream systems can consume directly.

The processing chain has four stages:

Ingestion. Document enters the pipeline, type gets identified, file is prepared for parsing.
Layout understanding. The system maps spatial structure: columns, rows, headers, key-value pairs.
Extraction. The LLM reads the structured layout and identifies target fields based on context and semantics.
Output. Structured JSON flows into a database, ERP, data warehouse, or API endpoint.

OCR and LLM-based extraction are not competing approaches. They're two separate layers in the same pipeline. OCR converts pixels to text. The LLM then operates on that text to understand meaning and pull specific fields. The critical dependency is that the OCR layer has to preserve layout for the LLM layer to work correctly.

This is the first thing most teams miss, and it's where a lot of production failures originate.

The Layout Problem Nobody Explains Well

Here is what happens when you run a standard OCR tool on a bank statement with multi-column transactions:

The tool reads the page left to right, top to bottom, and produces a flat text stream. A transaction table with date, description, debit, credit, and running balance columns gets compressed into something like: 07/01/2019 Deposit = 131 $209.54 $654.82. Which value is the debit? Which is the balance? The spatial relationship that made the table readable is gone.

An LLM trying to extract structured data from that flattened output isn't working with a document. It's working with noise. It may produce plausible-looking results, and it might even get most fields right on the documents you tested. But accuracy will be inconsistent across different banks, different statement periods, and different vendor layouts. You won't know how inconsistent until something downstream breaks.

Layout is where meaning lives in documents. Flatten the layout at the parsing stage and you have destroyed the document's semantics before extraction even begins.

The fix is layout-aware parsing: an OCR layer that preserves spatial structure rather than discarding it. When transaction rows stay aligned, column relationships stay intact, and headers remain separated from body content, the LLM gets input it can reason over correctly. The extraction layer becomes reliable because the parsing layer did its job.

This distinction matters operationally. A pipeline that gets layout right will generalize to new document variants from the same class. One that doesn't will fail silently, and the failures will look random because the actual root cause is buried upstream.

Why Traditional Pipelines Break at Scale

Most document processing systems start with templates. You define where each field lives on the page, and the system reads that coordinate.

It works exactly as long as nothing changes.

The moment a vendor updates their invoice layout, moves the PO number two columns right, or switches accounting software, the template breaks. Extraction fails silently or returns garbage, and someone has to rebuild the template before processing can resume. Then again next quarter.

At enterprise scale, this becomes a maintenance treadmill. Organizations processing documents from hundreds of vendors end up maintaining hundreds of templates. Every new vendor requires a new one. Every layout change requires a manual update. The engineering overhead compounds continuously, and the system gets more fragile, not more capable, as document variety grows.

The two approaches diverge early, and the divergence is the whole story:

TEMPLATE-BASED                       AI-BASED

Document                             Document
   │                                    │
   ▼                                    ▼
Fixed rules                          Layout-aware parsing
   │                                    │
   ▼                                    ▼
Works until layout changes           LLM extraction
   │                                    │
   ▼                                    ▼
Breaks silently                      Accuracy validation
                                         │
                                         ▼
                                      Deploy

One path has a dead end built in. The other has a checkpoint built in.

A few specific patterns show up over and over with the template path:

Variability at the margins. Templates work on documents you designed them for. Edge cases fall outside the template and fail quietly. Handwritten annotations on printed forms, scans from older equipment, PDFs generated by different accounting platforms. All of these are edge cases in production, not theoretical ones.

The rule explosion. Teams try to handle variability by adding more rules. The rule set grows. Rules start conflicting. Testing becomes a manual slog, and regressions appear in unexpected places.

Accuracy rot. Template accuracy is static. It doesn't improve as you process more documents. It only degrades as document variety increases. There is no feedback loop.

Why AI-Native Pipelines Also Fail in Production

Replace the template with an LLM and you've solved the variability problem. LLMs can read documents they've never seen and infer field locations from context. But a new set of problems shows up in its place.

Silent accuracy drift. An LLM might extract a date in one format from one vendor and a different format from another. Both look like dates. Both are technically correct. But your downstream database expects ISO 8601, and the inconsistency breaks a batch job on a night when nobody is watching. You find out when someone files a support ticket about bad data.

Confidence without calibration. LLMs don't know what they don't know. A model will extract a field with the same apparent confidence whether it found it clearly labeled or inferred something plausible from surrounding context. Without a validation layer, you can't distinguish reliable extractions from reasonable guesses.

No prompt version control. A prompt change that fixes extraction for one document type breaks it for another. Without a baseline to compare against and a way to measure impact across your full document set, you're making changes blind and discovering regressions through downstream complaints.

An extraction system with no accuracy measurement is not a production system. It's a demo that happens to be running in your infrastructure.

These aren't edge cases. They are the failure modes that show up in almost every AI document processing project that gets far enough along to encounter real production traffic.

What Production-Ready Actually Requires

Most demos skip the same four things.

Layout-preserving parsing. The OCR layer needs to output text the LLM can reason over correctly. For complex documents (multi-column tables, nested forms, scanned images), that means preserving spatial structure, not just converting pixels to characters.

Validation with measurable accuracy. Every extraction should produce a confidence score. Low-confidence extractions should route to review rather than propagate downstream. Accuracy should be measured against verified ground truth so you have a number you can track over time, not a subjective sense that things look fine.

Human-in-the-loop routing. Not every document should process automatically. Documents that fail validation or produce low-confidence extractions need a clear path to human review, and corrections from that review need to feed back into the system. Exception handling is not an afterthought; it's a core part of the architecture.

Prompt management with version control. Prompts evolve. Changes need to be tracked, their impact measured across the full document set, and rollback available when a change causes regression. A prompt is not a one-time configuration. It's a versioned artifact.

None of these are model problems. They're systems problems. And they're the ones that determine whether a document processing pipeline survives contact with production.

How Unstract Handles the Production Problem

Most platforms optimize for the demo. They make extraction easy to set up on a clean document. What happens next, with real documents, real volume, real variability, gets left to your engineering team to figure out.

Unstract is an open-source intelligent document processing platform built around the opposite assumption: that extraction is the easy part, and the hard parts are everything required to keep it working.

The platform has two distinct layers. Here's how a document actually moves through the system, from PDF to structured output:

Two things matter about this shape. The OCR layer and the extraction layer are separate concerns, handled by separate components, so a failure in one doesn't get misdiagnosed as a failure in the other. And low-confidence output never reaches a downstream system unreviewed. It loops back through a human first.

Try for Free - Unstract Playground (no signup required)

The Parsing Foundation: LLMWhisperer

LLMWhisperer is a document parsing engine built specifically for LLM-based extraction pipelines. Its job is the step before extraction, converting PDFs, scanned images, and complex documents into text output that preserves enough of the original structure for a language model to reason over correctly.

The practical difference from standard OCR shows up immediately. A conventional tool flattens a multi-column bank statement into a single text stream, destroying the spatial relationships that give the data its meaning. LLMWhisperer solves this through layout-aware parsing. Rather than flattening the page, it analyses the spatial structure and reconstructs the text in a way that preserves the relationships between elements. Table rows stay intact. Multi-column sections are handled correctly. Headers, footers, and summary blocks are separated from transaction data.

LLMWhisperer also includes auto-compaction, which removes low-value tokens like repetitive headers and footers before text reaches the LLM. These techniques can reduce token usage by up to 7x, which matters once you're processing thousands of documents instead of twenty.

Try for Free - LLMWhisperer Playground (no signup required)

Where Pipelines Get Built: Agentic Prompt Studio

This is the part worth understanding in detail.

The traditional way to build an extraction project: define a schema manually, write prompts per field, run extraction, eyeball results, adjust prompts, repeat. There was no automated accuracy scoring, no mismatch matrix, no field-level comparison against ground truth. You eyeballed the output, spot-checked values, and made judgment calls about whether the extraction was good enough to ship.

The bottleneck wasn't capability. LLMs could extract accurately when given the right prompts. The bottleneck was everything around the LLM: the human time required to define what to extract, write how to extract it, and verify that it was actually working across the full range of documents the pipeline would encounter in production.

The Agentic Prompt Studio replaces that manual loop with an AI agent-driven pipeline. You bring the documents. The agents handle the rest: analyzing structure, inferring schema, generating extraction prompts, running extractions, and scoring accuracy against verified outputs, automatically, in sequence, without manual input at each step.

Six agents, two stages, run one after the other:

Documents
    │
    ▼
Summarizer Agent      reads each document, finds the fields
    │
    ▼
Uniformer Agent        merges duplicate fields across documents
    │
    ▼
Finalizer Agent         outputs a clean JSON Schema
    │
    ▼
  Schema
    │
    ▼
Pattern Miner Agent     finds the labels and patterns per field
    │
    ▼
Prompt Architect Agent  writes the extraction prompt
    │
    ▼
Critic Dry-Runner Agent stress-tests it before you ever run it
    │
    ▼
Validated Prompt

That's the shape. Now the detail behind each stage.

Schema generation runs through the first three agents. The Summarizer Agent analyzes each document on its own. It identifies field names, data types, descriptions, and example values. Because it processes each variant on its own, no field quietly gets dropped. The Uniformer Agent takes those summaries and finds commonalities, recognizing that similarly-named fields are the same, merging duplicates, picking consistent names. The Finalizer Agent converts everything into a standard-compliant JSON schema with proper data types, required fields, nested structures, and validation rules.

Prompt generation runs through the next three. The Pattern Miner Agent digs through your samples to find extraction clues: the labels that precede fields, the formatting patterns, where fields tend to sit in each layout. The Prompt Architect Agent constructs a detailed extraction prompt with structured instructions, field-level guidance, disambiguation rules, edge case handling, and output format. The Critic Dry-Runner Agent stress-tests the prompt before you ever run it. It simulates an extraction, validates the output against the schema, and identifies potential failure points.

The result is a validated extraction prompt you didn't have to write, tested against the document variants you actually need to handle in production.

Accuracy tracking closes the loop that most platforms leave open. The Verification Set compares new results with baselines and shows what improved, regressed, and by how much. Every prompt version provides an instant document accuracy score. Track trends over time and identify which edit caused a regression.

The Mismatch Identification feature lets you see which fields matched and which didn't for each document, find the source location of any value, and get an overview of extraction quality with a Project Accuracy score. So when you fix a prompt, you're not guessing whether it helped. You can see exactly which document types improved and which ones quietly got worse.

Once a project reaches your accuracy threshold, it exports as a Tool. This packages the schema, prompts, and configuration into a deployable unit that can be connected to a workflow or exposed directly as an API endpoint. You're not rebuilding anything for deployment. You're shipping exactly what you tested.

An Invoice Pipeline, End to End

To make this concrete, here is what processing invoices from 50 different vendors looks like with this pipeline.

Ingestion: Documents arrive via S3, Google Drive, or a direct API call. LLMWhisperer parses each one and preserves layout regardless of whether it's a native PDF or a scanned image from older equipment.

Schema + Prompts: The Agentic Prompt Studio runs across your sample invoices. Six agents, in two sequential pipelines, build a schema covering vendor name, invoice number, line items, totals, payment terms, and due date, then generate and validate extraction prompts across all 50 layout variants.

Validation and Routing: Every extraction runs against the schema. Low-confidence fields route to Human-in-the-Loop review rather than passing through automatically, and corrections from human reviewers feed back into the accuracy baseline.

Output: Validated JSON flows into destinations like Snowflake, PostgreSQL, or your ERP via native connectors. The pipeline runs 24/7 without manual intervention.

When a new vendor appears with a novel layout, you add sample documents to the project, re-run the agents, and redeploy. The entire iteration cycle takes minutes, not days.

That asymmetry matters. Template-based systems front-load simplicity and back-load cost: every new vendor is a new maintenance burden. AI-based systems with proper tooling invert that: the setup cost is fixed and the marginal cost of each new document type drops over time.

When NOT to Build This

Worth saying directly because most articles about AI tooling skip it: not every document processing problem needs this level of infrastructure.

If you are processing documents from a single vendor with a fixed, predictable layout and low volume, a well-built template or simple rule-based extractor is cheaper and faster to maintain. AI-based extraction earns its cost when document variety is high, your vendor footprint is large, or layouts change frequently enough that template maintenance becomes expensive.

Here is how to make the call:

Your situation	What to do
Single vendor, fixed layout, low volume	Build a template or simple extractor. Don't over-engineer.
Multiple vendors, recognizable document class	Start with Agentic Prompt Studio on one class first.
High volume, high variety across document types	Full AI pipeline with validation and HITL routing.
Scanned documents, inconsistent quality	Fix the OCR/parsing layer before touching anything else.
Regulated environment (HIPAA, GDPR, SOC2)	Require on-premises or private cloud deployment explicitly.
Document layouts change frequently	Template maintenance won't scale. AI approach required.

Start with one document type, one vendor class. Get the accuracy baseline established. Then expand. The teams that start with the most complex use case and expect immediate results are the same ones who end up with six months of bad data and an emergency cleanup project.

What's Coming

Multi-modal models are changing the extraction layer in a way that's worth tracking. Current pipelines convert documents to text before the LLM sees them. Multi-modal models can reason directly over the visual representation: tables are processed as tables, signature blocks as signature blocks, stamps as stamps. This removes a translation step and with it a class of errors that stem from OCR misrepresenting visual structure.

Unstract is also building self-improving prompts: multi-agent pipelines that analyze extraction accuracy as feedback and fine-tune prompts automatically, removing the one step that still requires manual input.

Both of these raise the accuracy ceiling without requiring more engineering effort per document type. Organizations with operational capability built now will absorb those improvements without rebuilding their pipelines from scratch.

The Real Diagnostic

Before you conclude your extraction model is failing, check these four things in order:

Is your OCR output preserving layout? Inspect the raw text your LLM is actually working with. If table columns are flattened into a single stream, the problem starts at the parsing layer, not the prompt. Fix that first.
Do you have field-level accuracy scores against verified ground truth? If you're eyeballing outputs and calling it good, you don't know what your actual accuracy is. Build a verification set before you make any other changes.
What happens to documents that fail validation? If you don't have a defined exception path, errors are propagating downstream. Find them before your finance team does.
Have you tested across the full range of layouts you'll encounter in production? Twenty clean test documents from your primary vendor is not a production test. Add the edge cases deliberately, because your vendors will not warn you when they update their invoice template.

Most teams find the problem in step one.

Extraction accuracy is easier to measure than most engineers think. The hard part is not getting the model to extract. It's knowing whether it extracted correctly, across every document variant your pipeline will encounter at 2 AM when nobody is watching.

That's the gap between a demo and a production system. Production document processing isn't won by picking a smarter model. It's won by building a system that knows when the model is right, and catches it when it isn't.

*Unstract is available as open-source (AGPL-3.0), managed cloud, and on-premises. The Agentic Prompt Studio is available in beta on all Unstract Cloud and on-premises plans.

Multi-Aspect E-Commerce Semantic Engine Using Qdrant Multivectors

Divy Yadav — Wed, 01 Jul 2026 05:33:24 +0000

How I built a multi-vector semantic search engine that splits user intent before touching the database, using ColBERT, SigLIP, and BGE within a single Qdrant point.

Six months ago I was building a semantic search engine for a small e-commerce catalog. Everything looked fine. I was using a good embedding model, cosine similarity, and the search results seemed reasonable.

Then I searched for:

"Waterproof black hiking boots with good arch support."

The first result was perfect.

The second was a waterproof jacket.

The third was a pair of black Chelsea boots.

Technically, the search wasn't wrong. Every result was semantically similar to the query.

But if I were a customer, I wouldn't care. I wanted hiking boots with arch support, not products that happened to share a few similar words.

I tried a better embedding model. I changed the chunking strategy. I added more metadata. The results improved a little, but the real problem never went away.

The problem wasn't the model.

It was the assumption that one embedding could represent an entire product.

A product has technical specifications, images, and customer reviews. Those are different kinds of information, yet I was compressing all of them into a single vector.

So I rebuilt the system.

Instead of one embedding per product, I stored separate vector spaces for specifications, images, and review findings. I split each query into different types of intent before embedding it and added a lightweight personalization layer on top.

This article walks through how I built it, why I chose this architecture, and when this extra complexity is actually worth it.

The mental model: before you read a single line of code

Here's the full pipeline in one view. Read this once, keep it in mind, and the rest of the article will click much faster.

How to read this:

Think of the system as three specialists evaluating the same product.

When a user searches for "Waterproof black hiking boots with good arch support", the query is split into three questions:

Is it waterproof? → Specs expert (ColBERT)
Does it look like black hiking boots? → Visual expert (SigLIP)
Do customers praise the arch support? → Review expert (BGE)

Each expert searches a different vector field stored inside the same Qdrant point. Text and review retrieval nominate candidates first, the visual channel performs final scoring, and personalization reranks the results.

Instead of forcing one embedding to represent everything, the system evaluates independent pieces of evidence and combines them at ranking time.

The core problem: one vector can't represent everything

When someone searches "waterproof black hiking boots with good arch support," they're not asking one question. They're asking three at the same time:

The first is a spec question: "Is this waterproof?" That's answered by product descriptions and technical data sheets.

The second is a visual question: "Does it look like black hiking boots?" That's answered by images. Not text descriptions of images. Actual images.

The third is a social question: "Do people say the arch support is good?" That's answered by customer reviews.

No single embedding model can capture all three signals with equal fidelity. When you pool them together, you get a vector that's roughly in the neighborhood of all three but isn't precise about any of them.

It's why you get Chelsea boots in your hiking boot results. Both are black footwear. The visual signal is doing fine. The arch-support signal is completely lost.

The fix is to stop treating a product as one thing and start treating it as a collection of distinct signals, each searchable independently.

Why I chose Qdrant for this

I evaluated Pinecone, Weaviate, and Qdrant before committing. The deciding factor was named multivector fields.

I needed to store multiple different types of vectors per document, each with its own dimensionality and comparison function. Pinecone doesn't support this natively. Weaviate's multivector story was still evolving when I started.

Qdrant's API is clean: you define named vector configs at collection creation time, each with different dimensions, distance metrics, and HNSW settings.

The other thing that sold me was update_vectors. In a live catalog:

Products get new images. The visual matrix needs to grow without a collection rebuild.
Reviews arrive daily. Each new review adds rows to the review matrix.

Qdrant handles both by letting you retrieve an existing vector field, concatenate new vectors, and push the updated matrix back. The collection stays up. Running queries keep working.

Versions matter here. I'm on Qdrant v1.15.3 with qdrant-client>=1.15.0. Multivectors have been available since v1.10, but the query_points prefetch API that makes multi-stage search work cleanly became stable in v1.14.

What the system stores: one point, three matrices

Every product becomes a single Qdrant point. That point has three named vector fields:

# src/commerce_engine/qdrant_store.py

VISUAL_VECTOR = "visual_vectors"   # 768-d SigLIP image embedding
TEXT_VECTOR   = "text_vectors"     # 96-d ColBERT token matrix
REVIEW_VECTOR = "review_vectors"   # 384-d BGE per-finding embeddings

All three use MAX_SIM as the comparator. Here is the full collection setup:

def vector_params(size: int, *, hnsw_m: int | None = None) -> models.VectorParams:
    hnsw_config = None
    if hnsw_m is not None:
        hnsw_config = models.HnswConfigDiff(m=hnsw_m)
    return models.VectorParams(
        size=size,
        distance=models.Distance.COSINE,
        multivector_config=models.MultiVectorConfig(
            comparator=models.MultiVectorComparator.MAX_SIM
        ),
        hnsw_config=hnsw_config,
    )

def recreate_collection(client, collection, *, profile="baseline", disable_text_hnsw=True):
    client.create_collection(
        collection_name=collection,
        vectors_config={
            VISUAL_VECTOR: vector_params(VISION_DIM),                                       # 768
            TEXT_VECTOR:   vector_params(TEXT_DIM, hnsw_m=0 if disable_text_hnsw else None),# 96
            REVIEW_VECTOR: vector_params(REVIEW_DIM),                                       # 384
        },
        quantization_config=quantization_config(profile),
    )
    create_payload_indexes(client, collection)

The hnsw_m=0 on TEXT_VECTOR is intentional. Setting m=0 disables HNSW graph indexing entirely for that field.

Normally that would be a performance disaster. Here it isn't, for two reasons:

Text vectors are not used for first-stage retrieval. They're used as a ColBERT reranker over the small candidate set that the prefetch stages already returned. At 20–50 candidates, brute-force matrix comparison beats HNSW because the graph traversal overhead costs more than it saves at that N.
It saves index memory. Token-level matrices are larger than pooled vectors. Skipping the HNSW graph for text_vectors reduces memory footprint, which compounds when you're storing 96-d matrices across thousands of tokens per product.

The payload also gets indexed separately so Qdrant can filter before scoring:

def create_payload_indexes(client, collection):
    # Keyword fields
    for field in ["brand", "category", "region", "color", "size", "product_id"]:
        client.create_payload_index(
            collection_name=collection,
            field_name=field,
            field_schema=models.PayloadSchemaType.KEYWORD,
        )
    # Bool field
    client.create_payload_index(
        collection_name=collection,
        field_name="availability",
        field_schema=models.PayloadSchemaType.BOOL,
    )
    # Float fields
    for field in ["price", "eco_score"]:
        client.create_payload_index(
            collection_name=collection,
            field_name=field,
            field_schema=models.PayloadSchemaType.FLOAT,
        )

This matters more than it looks. When indexes are present, Qdrant runs payload filtering before vector scoring. Filtering availability=True or price <= 150.0 eliminates candidates before any embedding comparison runs. Without indexes you get post-filtering, which is slower and produces inconsistent result counts. Index your filterable fields before you query.

The three embedding pipelines

Each vector field in the point comes from a different embedding pipeline. The same product goes through all three before ingestion.

[Photo from AI]

Visual pipeline: SigLIP

Product images go through google/siglip-base-patch16-224.

SigLIP is a vision-language model with aligned image and text embeddings. Aligned means you can encode a text query like "black hiking boots" and compare it directly against an image embedding. No bridge model. No separate alignment step.

# src/commerce_engine/embeddings.py

def image_patches(self, image_path: Path) -> list[list[float]]:
    """Returns shape [1, 768]: normalized SigLIP pooled image embedding."""

    self._load_siglip()
    import torch
    image = Image.open(image_path).convert("RGB")
    siglip_inputs = self._siglip_processor(images=image, return_tensors="pt")
    siglip_inputs = {k: v.to(self.device) for k, v in siglip_inputs.items()}

    with torch.no_grad():
        vision_outputs = self._siglip_model.vision_model(**siglip_inputs)
        pooled = vision_outputs.pooler_output        # shape: [1, 768]
        normalized = torch.nn.functional.normalize(pooled, dim=-1)

    return normalized.cpu().numpy().astype(float).tolist()

And the corresponding text-side query encoding:

import torch

def visual_query(self, query: str) -> list[list[float]]:
    """Returns shape [1, 768]: normalized SigLIP text tower output."""
    self._load_siglip()

    text_inputs = self._siglip_processor(
        text=[query], return_tensors="pt", padding=True, truncation=True
    )

    text_inputs = {k: v.to(self.device) for k, v in text_inputs.items()}

    with torch.no_grad():
        text_outputs = self._siglip_model.text_model(**text_inputs)
        pooled = text_outputs.pooler_output          # shape: [1, 768]
        normalized = torch.nn.functional.normalize(pooled, dim=-1)

    return normalized.cpu().numpy().astype(float).tolist()

Both outputs are L2-normalized before storage and querying. This makes cosine similarity equivalent to dot product, which is faster for Qdrant to compute.

The stored shape for each product is [1, 768], a matrix with one row. It's technically a 2D multivector with one element, but storing it as a matrix rather than a flat vector keeps the interface consistent and makes appending additional image patches straightforward later.

Text pipeline: ColBERT

Product titles and spec sheets go through FastEmbed's answerdotai/answerai-colbert-small-v1. This model produces token-level embeddings, 96 dimensions per token, without pooling.

def text_late(self, texts: list[str]) -> list[list[list[float]]]:
    return [
        embedding.astype(float).tolist()
        for embedding in self.late_model.embed(texts)
    ]

The return shape is [num_texts, num_tokens, 96]. For one product title, you get a matrix of shape [num_tokens, 96]. A spec-heavy product like "TrailForge StormShield Black Hiking Boots. support: nylon shank and molded arch support. upper: black ripstop textile. waterproof_rating: IPX6 waterproof membrane for heavy rain" tokenizes to around 25-30 tokens, so the stored matrix is [~28, 96].

The text document is assembled in the Product model:

# src/commerce_engine/models.py

def text_document(self) -> str:
    specs = " ".join(f"{key}: {value}" for key, value in sorted(self.specs.items()))
    return f"{self.title}. {specs}"

Title first, then all specs as key-value pairs. The spec keys (support, waterproof_rating, upper) are included as text so ColBERT can match a query token like "waterproof" to the spec key, not just the spec value.

Review pipeline: BGE with semantic finding extraction

Raw reviews don't go straight to the embedder. They go through a finding extraction step first. This is where the architecture differs most from a naive approach.

A review like "Waterproof in heavy rain and mud. Good arch support on long hikes. Excellent grip on wet rock." holds three separate factual claims. Embedding the full review as one vector averages all three. A product with ten reviews embedded whole gives you ten averaged blobs. You lose the specific claim that matters to a specific query.

Instead, extract_semantic_findings() pulls out specific claims using regex patterns:

# src/commerce_engine/reviews.py

FINDING_PATTERNS = [
    r"waterproof(?: in [a-z ]+)?",
    r"kept feet dry",
    r"good arch support",
    r"arch support is [a-z]+",
    r"supportive footbed",
    r"excellent grip",
    r"not enough grip",
    r"highly durable",
    r"runs a little small",
    r"runs small",
    r"strong ankle support",
]

def extract_semantic_findings(reviews: list[str]) -> list[str]:
    findings: list[str] = []
    for review in reviews:
        normalized = review.lower().strip()
        matched = False
        for pattern in FINDING_PATTERNS:
            match = re.search(pattern, normalized)
            if match:
                findings.append(match.group(0))
                matched = True
        if not matched:
            findings.append(normalized.rstrip("."))
    return list(dict.fromkeys(findings))  # deduplication

The findings for the TrailForge boot come out as: ["waterproof in heavy rain and mud", "good arch support", "excellent grip"]. Each finding then gets embedded separately with BAAI/bge-small-en-v1.5 into a 384-dimensional vector:

def review_findings(self, findings: list[str]) -> list[list[float]]:
    if not findings:
        return [_unit_vector("empty-review", REVIEW_DIM)]
    return [
        embedding.astype(float).tolist()
        for embedding in self.review_model.embed(findings)
    ]

The stored review matrix is [num_findings, 384]. Three findings become a [3, 384] matrix. A product with ten reviews and 15 extracted findings becomes [15, 384].

When a query includes "good arch support," it gets embedded as a single 384-d vector and compared via MAX_SIM against this matrix. The most similar finding wins. That matching finding is "good arch support" from a review, not an approximation averaged across all review content.

The full ingestion step assembles all three matrices into a single point:

# src/commerce_engine/ingest.py

def product_point(product: Product, embedder: Embedder) -> models.PointStruct:
    text_matrix   = embedder.text_late([product.text_document()])[0]
    review_matrix = embedder.review_findings(
        extract_semantic_findings(product.reviews)
    )
    visual_matrix = embedder.image_patches(Path(product.image_path))

    return models.PointStruct(
        id=point_id(product.id),
        payload=product.payload(),
        vector={
            VISUAL_VECTOR: visual_matrix,   # shape [1, 768]
            TEXT_VECTOR:   text_matrix,     # shape [~28, 96]
            REVIEW_VECTOR: review_matrix,   # shape [num_findings, 384]
        },
    )

One product. One point. Three matrices, each solving a different retrieval problem.

Query decomposition: routing intent before embedding

When a user submits a query, the first step is decomposition. The query does not go directly to all three vector fields.

# src/commerce_engine/query.py

TEXT_KEYWORDS   = {"waterproof", "water-resistant", "rain", "membrane", "insulated", "leather"}
VISUAL_KEYWORDS = {"black", "brown", "charcoal", "red", "blue", "hiking", "boots", "sneakers"}
REVIEW_KEYWORDS = {"support", "arch", "grip", "durable", "comfortable", "runs", "excellent"}

def decompose_query(query: str) -> QueryPlan:
    tokens = tokenize(query)
    text_terms   = [t for t in tokens if t in TEXT_KEYWORDS]
    visual_terms = [t for t in tokens if t in VISUAL_KEYWORDS]
    review_terms = [t for t in tokens if t in REVIEW_KEYWORDS]
    phrase = " ".join(tokens)
    # phrase-level overrides for known compound terms
    if "good arch support" in phrase:
        review_terms.extend(["good", "arch", "support"])
    if "black hiking boots" in phrase:
        visual_terms.extend(["black", "hiking", "boots"])
    if "waterproof" in phrase:
        text_terms.append("waterproof")
    return QueryPlan(
        original_query=query,
        text_terms=list(dict.fromkeys(text_terms)),
        visual_terms=list(dict.fromkeys(visual_terms)),
        review_terms=list(dict.fromkeys(review_terms)),
    )

For "Waterproof black hiking boots with good arch support", the decomposition produces:

text_terms = ["waterproof"] → routed to ColBERT spec matching
visual_terms = ["black", "hiking", "boots"] → routed to SigLIP visual matching
review_terms = ["support", "arch", "good"] → routed to BGE review matching

The QueryPlan model exposes three properties that join these terms back into sub-query strings:

# src/commerce_engine/models.py

@property
def text_query(self) -> str:
    return " ".join(self.text_terms or [self.original_query])
@property
def visual_query(self) -> str:
    return " ".join(self.visual_terms or [self.original_query])
@property
def review_query(self) -> str:
    return " ".join(self.review_terms or [self.original_query])

If no terms match a category, the original query falls back as the sub-query for that field. This prevents empty queries.

I'll be direct about the limitation: this decomposition uses keyword matching, which is coarse. A better production implementation would call an LLM to classify query intent. The architecture is the same either way. What matters is that you route before you embed, not that you route using a fancy classifier. Get the structure right first.

The search: prefetch then final scoring

Qdrant's query_points supports a prefetch parameter. It lets you retrieve candidate sets from multiple vector fields in parallel, then apply a final scoring stage against only those candidates. This is the mechanism that ties the three vector fields together.

# src/commerce_engine/search.py

def search_products(client, collection, request, embedder):
    plan = decompose_query(request.query)
    query_filter = build_filter(request.filters)
    # Embed all three sub-queries
    text_query   = embedder.text_late([plan.text_query])[0]
    review_query = embedder.review_findings([plan.review_query])
    visual_query = embedder.visual_query(plan.visual_query)
    candidate_limit = max(request.limit * 5, 20)
    # Text and review run as prefetch candidates
    prefetch = [
        models.Prefetch(
            query=text_query,
            using=TEXT_VECTOR,
            limit=candidate_limit,
            filter=query_filter,
        ),
        models.Prefetch(
            query=review_query,
            using=REVIEW_VECTOR,
            limit=candidate_limit,
            filter=query_filter,
        ),
    ]

    # Visual query scores and ranks the prefetch candidates
    response = client.query_points(
        collection_name=collection,
        prefetch=prefetch,
        query=visual_query,
        using=VISUAL_VECTOR,
        query_filter=query_filter,
        limit=max(request.limit * 3, 10),
        with_payload=True,
        with_vectors=True,
    )

Text and review prefetch stages each return up to candidate_limit results. Qdrant unions them. The final visual query then scores only those candidates using the VISUAL_VECTOR field. Payload filters apply at every stage.

After Qdrant responds, the code computes per-aspect MAX_SIM scores to get a named breakdown:

from commerce_engine.scoring import maxsim_score

for point in response.points:
        vectors  = point.vector or {}
        doc_visual = vectors.get(VISUAL_VECTOR, [])
        doc_text   = vectors.get(TEXT_VECTOR, [])
        doc_review = vectors.get(REVIEW_VECTOR, [])
        v_score = maxsim_score(visual_query, doc_visual) if doc_visual else 0.0
        t_score = maxsim_score(text_query,   doc_text)   if doc_text   else 0.0
        r_score = maxsim_score(review_query, doc_review) if doc_review else 0.0
        combined_score = v_score + t_score + r_score

The MAX_SIM function itself is three lines of NumPy:

# src/commerce_engine/scoring.py

def maxsim_score(query_matrix, doc_matrix) -> float:
    query = np.asarray(query_matrix, dtype=np.float32)
    doc   = np.asarray(doc_matrix,   dtype=np.float32)
    similarities = query @ doc.T                # [query_tokens, doc_tokens]
    return float(similarities.max(axis=1).sum())# max per query token, then sum

For each query token: find the highest similarity against all document tokens. Sum those per-token maximums. That sum is the MAX_SIM score. This is the full ColBERT late-interaction formula.

Computing it manually after retrieval gives you per-aspect scores (visual=0.81, text=0.44, review=0.69) in the explanation field. That breakdown is the best debugging tool in the system. When a result feels wrong, the aspect scores tell you immediately whether the visual, spec, or review signal failed to match.

Personalization: reranking the same candidates differently

After Qdrant returns results, a personalization layer reranks them based on the requesting user's profile. The Qdrant retrieval step is identical for every user. Only the reranking changes.

# src/commerce_engine/scoring.py

def personalization_boost(payload: dict, profile: UserProfile) -> tuple[float, list[str]]:
    boost = 0.0
    reasons = []
    if payload.get("brand") in profile.preferred_brands:
        boost += 0.25
        reasons.append(f"preferred brand: {payload['brand']}")
    price = float(payload.get("price", 0.0))
    low, high = profile.price_range
    if low <= price <= high:
        boost += 0.30
        reasons.append(f"price in user range: {low:g}-{high:g}")
    if profile.eco_preference and payload.get("is_sustainable"):
        boost += 0.30
        reasons.append("eco preference matched")
    if payload.get("category") in profile.favorite_categories:
        boost += 0.20
        reasons.append(f"favorite category: {payload['category']}")
    return boost, reasons

def rerank(results: list[SearchResult], profile: UserProfile) -> list[SearchResult]:
    reranked = []
    for result in results:
        boost, reasons = personalization_boost(result.payload, profile)
        final_score = result.qdrant_score * (1.0 + boost)
        reranked.append(
            result.model_copy(update={
                "final_score": final_score,
                "personalization_boost": boost,
                "explanation": [*result.explanation, *reasons],
            })
        )
    return sorted(reranked, key=lambda r: r.final_score, reverse=True)

The boost is multiplicative, not additive. A product with a semantic score of 0.90 and a total boost of 0.55 finishes at 0.90 × 1.55 = 1.395. A product with a score of 0.95 and no matching preferences stays at 0.95. Personalization can flip the ranking.

Every matched boost reason gets appended to the result's explanation string, so a top result might show: "preferred brand: TrailForge | price in user range: 80-165 | review evidence: arch support". This matters for two reasons: users trust recommendations they can understand, and you can debug a bad result in seconds by looking at why its score was what it was.

Updating products without rebuilding anything

Production catalogs are not static. A product gets a new image. Reviews come in daily. The system needs to handle both without downtime.

Both update operations follow the same pattern: retrieve the existing vector field, concatenate new vectors, push the updated matrix back with update_vectors.

# src/commerce_engine/updates.py

def append_review(client, collection, product_id, review, embedder):
    point    = _get_point(client, collection, product_id)
    findings = extract_semantic_findings([review])
    new_vecs = embedder.review_findings(findings)
    existing        = (point.vector or {}).get(REVIEW_VECTOR, [])
    updated_matrix  = [*existing, *new_vecs]
    update_named_vectors(client, collection, product_id, {REVIEW_VECTOR: updated_matrix})
    client.set_payload(
        collection_name=collection,
        payload={"reviews": [*point.payload.get("reviews", []), review]},
        points=[point_id(product_id)],
        wait=True,
    )
    return {"product_id": product_id, "findings": findings}

def append_image(client, collection, product_id, image_path, embedder):
    point   = _get_point(client, collection, product_id)
    new_vecs = embedder.image_patches(image_path)
    existing       = (point.vector or {}).get(VISUAL_VECTOR, [])
    updated_matrix = [*existing, *new_vecs]
    update_named_vectors(client, collection, product_id, {VISUAL_VECTOR: updated_matrix})
    return {
        "product_id": product_id,
        "added_patches": len(new_vecs),
        "total_patches": len(updated_matrix),
    }

After append_review, the product's review matrix grows by one row per extracted finding. After append_image, the visual matrix gains one row. No collection rebuild. Running queries keep working.

This matters more than it sounds. If adding a single review required re-ingesting the whole product, you'd batch updates and your review data would always lag. With update_vectors, a webhook can process incoming reviews in real time, and the next search query immediately benefits from the new signal.

What the benchmarks actually showed

The benchmark runs 104 queries against the fixture product set (4 hiking boot variants), generated from product attribute combinations: color + category, brand + category, spec-based, review-fragment, and compound queries. Latencies measured with time.perf_counter against a live Qdrant instance.

Results for the baseline profile (no quantization, HNSW disabled for text):

95.2% Recall@3 means 99 of 104 queries returned the target product in the top 3. The 5 failures are edge cases where the query decomposer routes everything to one aspect but the product's signal lives in a different field.

Honest caveats worth stating:

This is 4 products. The 2.2ms will not hold at 100k products.
Benchmark queries were generated from the same fixture data they retrieve. Some overfitting is baked in.
Binary quantization knocked Recall@3 down noticeably. Compressing 32-bit floats to 1 bit is extreme for ColBERT's token-level matching. Start with scalar INT8 and measure recall before deploying.

When not to build this

Three embedding pipelines, a query decomposer, and prefetch orchestration is not always the right answer. Skip it if any of these apply:

Short, uniform queries like "red shoes" or "wool sweater." Single dense vector handles this fine.
No clean product images. SigLIP needs consistent product photography. Blurry or inconsistent shots produce weak visual signal. Drop that field.
No customer reviews. The review field adds complexity for zero gain. Skip it.
Catalog under 10k products with low query load. The simpler architecture will likely be fast enough and much cheaper to maintain.
Broken baseline retrieval. Bad base embeddings, wrong chunking, or misconfigured filters won't get fixed by adding more vectors. Multivectors amplify signal. If there's no signal to amplify, they amplify noise.

When this is worth building

The complexity pays off when all of these are true:

Queries contain mixed intent. Spec questions, visual questions, and review-based questions in the same search string. Common in outdoor gear, fashion, electronics, and beauty.
Clean product images at scale. 100k+ SKUs with consistent photography. The SigLIP channel earns its storage cost here.
Reviews arrive continuously. If reviews update daily, the update_vectors pattern becomes genuinely valuable versus batch re-indexing.
You have user profile data. The personalization layer needs signal to work with. Without it, the reranking step is adding noise.

Decision matrix

![Photo from AI

](https://dev-to-uploads.s3.us-east-2.amazonaws.com/uploads/articles/52ipdt5rv5se2mch6f2h.png)

Implementation path if you're starting from scratch

Do not skip step 1. The single-vector baseline is not just scaffolding. It is the control you need to measure whether the additional complexity actually improves results for your specific catalog.

I have seen teams add three embedding pipelines on day one because the architecture sounds impressive, then spend three weeks trying to figure out why recall got worse. It was worse because their base embeddings were bad. Fix the foundation before you add floors.

The thing that actually changed after building this

I went back and ran the original "Waterproof black hiking boots with good arch support" query through the finished system. Rank 1 was the TrailForge StormShield. Rank 2 was the EcoTrek TerraDry.

Both were actual hiking boots with documented arch support in customer reviews.

The Chelsea boots were gone.

The insight isn't that multivectors are magic. The insight is that the original setup was asking one number to represent three different kinds of evidence, and one number is not enough. The fix was to stop compressing and start separating.

Most search quality problems in production are not model problems. They are signal representation problems. The right model pointed at mixed-up data will keep returning wrong answers. Separating the signals is the work. The models are just the tools.

"Your search returns technically relevant results that nobody clicks? Ask yourself how many different types of user intent you are compressing into one similarity score."

The full source code, tests, benchmark runner, FastAPI endpoints, Streamlit UI, and real product dataset are all at:

Github

Everything described in this article maps to a real file in src/commerce_engine. No placeholder functions, no pseudo-code, no hand-waved implementation details.

Project at a glance

A quick summary of everything this system does, for anyone who wants the short version before going through the code.

What it is: A multi-aspect semantic search engine for e-commerce that splits every query into visual, spec, and review signals before retrieval.

Stack:

Vector database: Qdrant v1.15.3 with qdrant-client>=1.15.0
Visual embeddings: google/siglip-base-patch16-224 (768-d, via Hugging Face Transformers)
Text embeddings: answerdotai/answerai-colbert-small-v1 (96-d, via FastEmbed, ColBERT late interaction)
Review embeddings: BAAI/bge-small-en-v1.5 (384-d, via FastEmbed, per-finding)
API: FastAPI + Typer CLI
UI: Streamlit
Python: 3.11+

Benchmark results (baseline profile, 4 fixture products):

Mean latency: 2.2 ms / P95: 2.5 ms
Recall@3: 95.2% / Recall@5: 100%
MRR: 0.7204 / NDCG@5: 0.7920

Run it yourself:

git clone https://github.com/dvy246/qdrant-multivector.git
cd qdrant-multivector
uv sync --extra dev
docker compose up -d qdrant
EMBEDDING_BACKEND=deterministic uv run engine init-qdrant
EMBEDDING_BACKEND=deterministic uv run engine ingest --fixtures
EMBEDDING_BACKEND=deterministic uv run engine search \
  "Waterproof black hiking boots with good arch support" --user user_a

References

Qdrant vectors and multivectors documentation - the definitive reference for multivector configuration, MAX_SIM comparator behavior, and named vector field setup.
Qdrant hybrid queries and prefetch API - documents the query_points prefetch parameter used in the multi-stage search flow.
Qdrant payload indexing documentation - covers keyword, bool, and float index schema types used in create_payload_indexes.
Qdrant quantization documentation - covers scalar INT8 and binary quantization config, always_ram, and quantile settings.
FastEmbed ColBERT documentation - covers LateInteractionTextEmbedding, model selection, and token-matrix output format.
Hugging Face SigLIP model card - covers google/siglip-base-patch16-224, the vision and text tower architecture, and pooler output shape.
answerdotai/answerai-colbert-small-v1 - the specific FastEmbed ColBERT model used for 96-dimensional token-level text embeddings.
BAAI/bge-small-en-v1.5 - the BGE model used for 384-dimensional review finding embeddings.
Women's E-Commerce Clothing Reviews dataset - the real product dataset derived from Kaggle, used in the Streamlit demo.

Building Long-Running Claude Managed Agents: Why State Matters More Than Compute

Divy Yadav — Thu, 25 Jun 2026 10:50:30 +0000

A build story with real code, real failures, and the specific reasons one sandbox provider fixed problems I didn't know I had.

At 9:03am on a Tuesday, my research agent said hello and stared at an empty /workspace/.

Six hours of analysis from the night before. Gone.

The cloned repository. The installed packages. The notes it had spent hours writing. Gone.

I had assumed that if an agent stopped working for the night, it could simply continue the next morning. That was wrong.

Over the next three weeks, I rebuilt the same workflow on Tensorlake, Cloudflare, and Daytona to figure out what had happened. The hardest part of running Claude Managed Agents isn't the model. It's everything underneath it.

This is the exact code I ran, the things that broke, and the mistake that cost me two weeks to understand.

What Claude Managed Agents is, before anything else

If you've never built with Claude Managed Agents, the architecture needs a minute. Skip this if you already know it.

Anthropic runs the reasoning. You run the execution.

The agent loop, session state, work queue, and retry logic all live on Anthropic's infrastructure. You configure a Self-hosted Environment in the Claude Console. When your application starts a session, Anthropic queues the work, your orchestrator picks it up, spins up a sandbox, and the model starts issuing tool calls into that sandbox.

Every bash, read, write, grep, and edit call executes inside an environment you own. Anthropic never touches it. You decide what that environment looks like, what it can access, and what happens between sessions.

Anthropic's intelligence is fixed. Your engineering determines whether that intelligence has a stable, stateful environment to work in, or a clean slate that forgets everything the moment it goes idle.

What I was building and why it mattered

I needed an agent that could do real deep-work research on a codebase: clone a repository, read through the module structure, build an understanding of how the pieces fit together, write notes, and propose refactoring strategies. The kind of work that takes a senior engineer a full day and an AI agent about six hours.

The key constraint: the agent couldn't do this all at once. Sometimes I'd kick off a session at 8pm, let it run until midnight, and pick it back up the next morning. The filesystem it had built during that first session — the analysis notes, the installed tools, the half-read source files — had to be there when the next session started. Rebuilding from scratch each time wasn't viable.

That constraint is what drove every provider decision I made.

The requirements I didn't know I had

At the start I thought I needed a Linux environment that could run Claude Managed Agents. By the end, I realized I actually needed three things. I found them all in one place, but not until I had looked in two others first.

A filesystem that survived between work sessions.
Near-zero cost while the agent was idle.
The ability to branch from an already-completed analysis state.

I did not discover all three requirements on day one.I discovered them one mistake at a time.

How a session actually starts: the code before the sandbox

You drive a session through the reference orchestrator using a simple command:

make session PROMPT="Clone the repository at github.com/tensorlakeai/tensorlake. \
  Read through the module structure. Write a summary to /workspace/analysis.md. \
  Note any components that look like they could be simplified."

The orchestrator sends this prompt to Anthropic as a new session. Anthropic picks it up, starts the agent loop, and immediately begins issuing tool calls. Those tool calls arrive at your sandbox. The agent reads files, runs bash commands, writes notes. The session runs until the task is complete or you stop it.

The agent stream looks roughly like this as it runs:

[thinking] The repository appears to be a Python SDK for...
[bash] git clone https://github.com/tensorlakeai/tensorlake
[bash] ls -la /workspace/tensorlake/
[read] /workspace/tensorlake/tensorlake/sandbox.py
[write] /workspace/analysis.md
[thinking] The Sandbox class handles...

Each bracketed event is a tool call going into your sandbox. The session accumulates state inside /workspace/ across all those calls. By the end of a six-hour session, that directory contains the cloned repo, installed packages, analysis files, and intermediate notes. That's the state that needs to survive overnight.

Build 1: Cloudflare

My first assumption was that I needed a platform that could efficiently run Claude Managed Agents. Cloudflare is optimized for high-concurrency execution. My problem turned out to be different.

The agent I was building accumulated hours of filesystem state between bursts of work. Notes, cloned repositories, installed dependencies, and intermediate analysis all needed to survive overnight. Cloudflare's execution model wasn't designed around that requirement.That was the first time I realized I wasn't looking for compute.

I was looking for persistent state.

Build 2: Daytona

The second build solved part of the problem.The agent could accumulate state throughout a session, which initially felt like progress.

Then I wanted to test three different refactoring strategies starting from the same six-hour analysis. Instead of branching from that state, I found myself repeating the setup work each time: rebuilding context, reinstalling dependencies, and re-running analysis before I could begin the actual experiment.

That was when I discovered my second requirement.Preserving state wasn't enough.I also needed a way to branch from an existing state without repeating hours of work.

Build 3: Tensorlake

The first thing that caught my attention was not a feature.

It was an architectural decision.

Most platforms preserve state by keeping compute alive.

This one treated compute and state as separate problems.

The docs described a suspended sandbox that could preserve its state and resume in approximately 0.6 seconds. That was the first time I saw a design that directly addressed the problem I'd been running into.

I wanted to know whether it actually worked.

I started with the problem that had sent me down this path in the first place.Could an agent suspend overnight and resume with its state intact?

It could.

And once I tested checkpointing and branching, I finally had both things I'd been looking for.That's when the architecture started to make sense.

How the webhook architecture works

The deployment model here is different from a traditional always-on server, and understanding it made everything else click.

The orchestrator itself runs inside a Tensorlake sandbox with a public HTTPS endpoint. Anthropic pushes incoming work to that endpoint via webhook. When there's no traffic, the orchestrator sandbox suspends. When a new work item arrives, it wakes in under a second, processes the request, and creates a worker sandbox for that session.

Two independent lifecycles:

The orchestrator sandbox suspends when idle, preserving its memory state including the running uvicorn process. It doesn't accumulate per-session filesystem state, so suspending it between work items costs only storage rates.

The worker sandboxes — one per session — accumulate filesystem state throughout a session and suspend when the session ends. Their state is preserved in storage, not held by running compute.

Neither one has to stay alive to preserve the other's state. On every other platform I'd tried, "preserve state" meant "keep something running." Here it means "checkpoint and stop billing."

How session routing works

When you call make session PROMPT="...", the orchestrator sends a new session to Anthropic. Anthropic validates it, adds it to the work queue, and pushes a work item payload to your orchestrator's webhook endpoint.

That payload contains three things the orchestrator needs: the session ID, the work ID, and the environment ID. The orchestrator wakes, reads the payload, and creates a worker sandbox from your registered image:

from tensorlake.sandbox import Sandbox

sandbox = Sandbox.create(
    name=session_id,
    image="agent-cli",
    cpus=2.0,
    memory_mb=4096,
    timeout_secs=3600,
)

sandbox.start_process(
    "bash",
    ["-lc", "exec python3 /opt/sandbox_entrypoint.py > /tmp/runner.log 2>&1"],
    env={
        "ANTHROPIC_ENVIRONMENT_KEY": environment_key,
        "ANTHROPIC_SESSION_ID": session_id,
        "ANTHROPIC_WORK_ID": work_id,
        "ANTHROPIC_ENVIRONMENT_ID": environment_id,
    },
)

Two things tripped me up here. Credentials belong in start_process(env={...}), not Sandbox.create() — they're session specific, not image-level configuration. And sandbox names must be valid slugs. Neither was hard to fix once I knew what was happening, but both cost me time.

The worker sandbox runs sandbox_entrypoint.py, which attaches to the Anthropic session and begins consuming tool calls. From that point, the agent has a full Linux environment: bash, git, Python, and whatever else you built into the image.

Building the agent image

Every tool the agent needs has to live in the image before the session starts:

from tensorlake import Image

image = (
    Image(name="agent-cli", base_image="tensorlake/ubuntu-minimal")
    .run("apt-get update && apt-get install -y ca-certificates curl git gh python3 python3-pip")
    .run("pip install --break-system-packages 'anthropic>=0.103' 'httpx>=0.27'")
    .copy("sandbox_entrypoint.py", "/opt/sandbox_entrypoint.py")
    .workdir("/workspace")
)
image.build(registered_name="agent-cli")

I added my analysis tools here: Python packages, jq, ripgrep. If it needed to run inside the sandbox, it lived in the image.

How suspend and resume actually work

When a session ends, the worker sandbox suspends. Not terminates. The process state and filesystem are checkpointed to storage. Compute billing stops. The sandbox sits there as a stored snapshot until the next session begins.

Setting RESUME_SUSPENDED_SESSIONS=true tells the orchestrator what to do when the next work item arrives for an existing session: restore the suspended sandbox instead of creating a fresh one from the base image.

I set the flag and expected the sandbox to wake up immediately.

Nothing happened.

For a few minutes I thought the integration was broken. It wasn't. The flag doesn't trigger a resume directly. A new incoming webhook does. The flag just tells the orchestrator which action to take when that webhook arrives. Once I understood that, the behavior made complete sense .Why would a sandbox wake up before there's work to do?

The practical difference between fresh and resumed is simple: a fresh sandbox starts with a clean /workspace/. A resumed sandbox starts with whatever the last session left there.For a research agent that had already spent hours building an understanding of a codebase, those are completely different starting points.

The session ran overnight, suspended, and resumed the next morning with its state intact. The filesystem was exactly where I had left it.

Checkpointing and parallel exploration

Resume solved the overnight persistence problem. Checkpointing solved the one I'd discovered the hard way during the Daytona build.

After the full analysis was complete, I created a checkpoint and launched three sandboxes from it:

snap = sandbox.checkpoint()

children = [
    Sandbox.create(
        snapshot_id=snap.snapshot_id,
        name=f"{session_id}-strategy-{i}",
        cpus=2.0,
        memory_mb=4096,
    )
    for i in range(3)
]

Each child started from exactly the same state as the parent. The cloned repository was already there. The installed dependencies were already there. The analysis notes were already there. Nothing needed to be rebuilt.

Instead of repeating six hours of analysis three separate times, I ran three parallel experiments from the same verified baseline. By the next morning I had three implementations to compare.

The fork I'd dismissed as an edge case turned out to be the feature that mattered most.

Without it, every experiment required repeating hours of work. With it, experimentation became nearly free. This generalizes beyond refactoring: any time you want to test N variations with different prompts, different dependency versions, different approaches — you run the expensive setup once, checkpoint, and branch into as many parallel experiments as you need.

The cost of exploration drops to the cost of the experiments themselves.

Why I chose Tensorlake, stated plainly

My agent needed two things: zero idle compute cost while preserving filesystem state overnight, and the ability to fork mid-session to explore refactoring strategies in parallel. Those two requirements narrowed the field to one option.

I didn't choose Tensorlake because it was the fastest or the cheapest. I chose it because it directly addressed the two problems I was trying to solve.

The suspend/resume workflow preserved state without keeping compute running. Checkpointing made it possible to branch from an existing analysis instead of repeating hours of setup work. I tested both and they worked exactly the way I needed them to.

The hard part wasn't choosing a provider. It was understanding my requirements. Once I understood those, the decision became obvious.

How to Pick the Right Sandbox

	Tensorlake	Daytona	Cloudflare Containers	Cloudflare Isolates
Idle cost	Suspended at storage rate	VM stays alive	Scales to zero after sleep timeout	No filesystem to preserve
Filesystem on resume	State preserved	Full sandbox creation from snapshot	Disk is wiped on restart	No persistent filesystem
Resume speed	~0.6s memory restore	Full sandbox creation	Full VM restart	Milliseconds
Mid-session fork	Yes: checkpoint() + N	Yes, experimental fork method in core SDK (not documented in Claude integration guide)	Not documented	No
Full Linux	Yes	Yes	Yes	No
Self-hosted option	No	Yes	No	No
Concurrent scale	Up to 100k	Not published	Not published	Tens of thousands
Setup effort	Moderate	High	One-click	One-click

Does your agent accumulate state that's expensive to rebuild between bursts?

Tensorlake. This is the exact problem suspend/resume was designed to solve. If your agent starts fresh every session, continue to the next question.

Do you need to explore multiple approaches from the same mid-session state?

Tensorlake. Checkpointing and branching from an existing state eliminates hours of repeated setup work. If not, continue.

Do you need massive concurrency?

Cloudflare isolates are designed for very large-scale concurrent execution.

Do you need self-hosted infrastructure?

Daytona is the strongest fit if infrastructure ownership or data residency is a hard requirement.

How important is setup speed?

Cloudflare offers the fastest path to a prototype. Tensorlake requires more initial setup but provides capabilities that become valuable once agents accumulate long-lived state.

Conclusion

When I started, I thought I was choosing a sandbox provider.What I was actually discovering were my requirements.

First I learned I needed persistent state. Then I learned I needed a way to branch from that state without repeating hours of work.

Tensorlake was the first platform I tried that treated compute and state as separate problems.

Once I understood that distinction, the decision became obvious.The hardest part of running long-lived agents isn't getting them to work. It's making sure their work survives.

References

Tensorlake: Run Claude Managed Agents on Tensorlake Sandboxes — https://docs.tensorlake.ai/sandboxes/claude-managed-agents
Tensorlake: Sandbox Images (Image builder SDK) — https://docs.tensorlake.ai/sandboxes/images
Daytona: Run Claude Managed Agents on Daytona — https://www.daytona.io/docs/en/guides/claude/claude-managed-agents/
Cloudflare: Announcing Claude Managed Agents on Cloudflare — https://blog.cloudflare.com/claude-managed-agents/
Anthropic: Claude Managed Agents Self-Hosted Sandboxes — https://platform.claude.com/docs/en/managed-agents/self-hosted-sandboxes
Concurrency sandbox benchmarks results: https://platform.computesdk.com/scale-invitational

[Boost]

Divy Yadav — Tue, 16 Jun 2026 10:57:41 +0000

Divy Yadav

Jun 15

Why Most Multi-Agent AI Systems Waste 90% of Their Time (And How to Fix It)

#ai #programming #webdev #software

15 min read

Why Most Multi-Agent AI Systems Waste 90% of Their Time (And How to Fix It)

Divy Yadav — Mon, 15 Jun 2026 12:32:52 +0000

Most engineers treat multi-agent speed as a concurrency problem. It is not. The bottleneck is setup time, and memory snapshots change the math entirely.

Most engineers think multi-agent performance is a concurrency problem.

I did too.

So when five AI agents running in parallel barely outperformed a sequential run, I assumed something was wrong with my orchestration.

I was looking in the wrong place.

Each agent was spending more time preparing to work than actually working.

The fix wasn’t more threads, better async code, or a faster model.

It was a memory snapshot.

And once I saw where the time was really going, an entire class of multi-agent bottlenecks suddenly made sense.

Here is what that looks like, what took me three iterations to get right, and where it still has rough edges.

Let’s get the mental model first.

What This Does (30 Seconds)

The idea is straightforward: instead of five agents each spending 90 seconds installing the same tools, install them once, freeze that environment, and stamp out five identical copies.

Each copy runs a different analysis in parallel. A lead LLM reads all five results and tells you what to fix first.

In code:

Creates one Linux VM, installs code analysis tools (bandit, radon) and writes a sample Python project
Freezes the entire VM state into a memory snapshot (filesystem, memory, running processes included)
Forks 5 independent copies, each agent assigned a different analysis task (Security, Complexity, Docstrings, Tests, Structure)
Runs all 5 in parallel via asyncio.gather, finishing in seconds instead of minutes
Feeds all results to a lead LLM that produces a single prioritized fix list

Setup time is paid once, upfront, before any agent runs. The rest of this article explains how.

Why Sandboxes Matter for Agent Workloads

If you have not worked with sandboxes before: think of one as a disposable computer that lives in the cloud.

You spin it up, run whatever code you need inside it, and throw it away when you're done. It has its own filesystem, its own processes, its own network. Nothing it does can touch your machine or any other sandbox running at the same time.

In short: Sandboxes provide the agent with a secure and isolated enviornement

That isolation is the whole point. Your agent can install packages, write files, crash badly, or spin up a browser, and none of it bleeds out. When the task is done, you terminate the VM and it is gone.

The next agent starts clean.

Most agent frameworks treat the execution environment as an afterthought. The LLM call is the interesting part. The environment is just "wherever the code runs."

That works fine for single-turn tasks. It breaks down fast for anything multi-step.

When an agent needs to install packages, write intermediate files, maintain a browser session across multiple pages, or resume a task from a different machine, you need the execution environment to behave like a persistent object, not a function call that resets on every invocation.

Tensorlake gives each agent a MicroVM backed by Firecracker and CloudHypervisor, optimized for fast boot times and strong isolation. Each sandbox is a full Linux VM. It boots in hundreds of milliseconds, persists filesystem and memory state across sessions, and can be snapshotted at any point in its lifecycle.

Tensorlake also lets you spin up multiple sandboxes in parallel for concurrent agent execution, and honestly it is one of my favourite things about it.

it also ranks in the top 5 of SandboxBenchmarks.

What changes the math is a single question: what does the snapshot actually capture?

Two Kinds of Snapshots. Very Different Behavior.

Quick vocabulary before the details. Tensorlake sandboxes have four lifecycle modes.

An ephemeral sandbox runs a task and disappears when done, with no name and no persistence between runs.
A named sandbox outlives the process that created it and can be suspended then reconnected to from any machine. Suspend freezes the VM exactly as it is and resume brings it back to that same state.
A snapshot is that frozen moment saved as a reusable artifact.
A fork is a snapshot restored into a fresh, independent VM.

This project uses the last two.

Suspend and Snapshot both preserve state, but serve different purposes : Suspend is for pausing this sandbox to resume later, while a snapshot is a reusable artifact for retrying from a checkpoint or cloning an environment.

Tensorlake supports two checkpoint types. Most tutorials only mention one.

CheckpointType.FILESYSTEM captures disk state only. Restore from it and the new sandbox does a full cold boot: processes restart from scratch, packages get re-imported. Your pip installs survive. Nothing that was in memory does.
CheckpointType.MEMORY is different. It captures disk state, VM memory, and all running processes. The restored VM resumes mid-stride, exactly as the source was at checkpoint time. No boot sequence. No re-initialization. If Python had already imported bandit, the fork starts with it loaded. The environment is not rebuilt. It is copied.

The checkpoint type is not a performance detail. It determines whether your fork is a clone or a restart.

The default when you call sandbox.checkpoint() with no arguments is filesystem. That is the wrong choice for a parallel swarm where agents share a prepared environment. You want memory.

One more constraint worth knowing upfront: for memory snapshots, resources (CPUs, RAM) are baked into the snapshot at checkpoint time. You cannot override them when creating forks. Set the right cpus and memory_mb on the base sandbox before you checkpoint. Every fork inherits them automatically.

The Architecture

The pattern has five distinct phases. Each one has a single responsibility.

Phase 1 — Base Snapshot: Spins up a single baseline sandbox, installs analysis tools (bandit, radon), writes the target code, and checkpoints the entire running VM state using CheckpointType.MEMORY. The base sandbox is then terminated, leaving behind the reusable snapshot ID.

Phase 2 — Agent Forking: Restores 5 independent sandboxes concurrently from the base snapshot using sandbox.fork(...). Each fork is a warm start that inherits all installed tools, environment settings, and target files.

Phase 3 — Sequential Baseline (Timing): Runs each agent's analysis script (analyze.py) one-by-one inside its respective sandbox to measure sequential time as a benchmark denominator.

Phase 4 — Parallel Swarm: Executes all 5 agents concurrently using asyncio.gather(...). Each agent runs the same analysis script inside its isolated sandbox but with a different focus configuration passed via the PERSPECTIVE environment variable.

Phase 5 — LLM Aggregation: Collects the individual reports (Security, Complexity, Docstrings, Tests, Structure) alongside the timing data, and passes them to the lead LLM (GPT) to synthesize a single prioritized fix list.

Phase 1 runs once. Phases 2 through 4 run every time you want results. The fork is cheap. The base environment build is not, but you only pay that cost once per snapshot.

Phase 1: Build and Snapshot

The base sandbox installs the analysis tools, writes the target codebase into the VM, then snapshots the entire state. Every fork inherits both the tools and the target project automatically.

from tensorlake.sandbox import AsyncSandbox, CheckpointType

async def build_base_snapshot() -> str:
    async with await AsyncSandbox.create(
        name="base-swarm-env",
        cpus=2.0,
        memory_mb=2048,
        timeout_secs=600,
    ) as sandbox:

        # Install analysis tools. These are baked into the snapshot
        # and available to every forked agent at no extra install cost.
        result = await sandbox.run(
            "pip",
            ["install", "bandit", "radon", "--user", "--break-system-packages", "-q"],
            timeout=180,
        )
        if result.exit_code != 0:
            raise RuntimeError(f"pip install failed:\n{result.stderr}")

        # Write a sample Python project with intentional issues for agents to find.
        # All forks inherit this from the snapshot; no need to write per-agent.
        target_files = {
            "/workspace/target/auth.py": b'''
import subprocess
DB_PASSWORD = "hardcoded_secret_123"

def authenticate(user_input):
    return eval(user_input)

def run_command(cmd):
    return subprocess.call(cmd, shell=True)
''',
            "/workspace/target/logic.py": b'''
def classify(a, b, c, d, e, f, g, h):
    if a and b:
        if c or d:
            if not e and f:
                return "path_a"
            elif e and not f:
                return "path_b"
            elif g and h:
                return "path_c"
            else:
                return "path_d"
        elif g:
            return "path_e"
    return "path_f"
''',
        }
        for path, content in target_files.items():
            parent = "/".join(path.split("/")[:-1])
            await sandbox.run("mkdir", ["-p", parent])
            await sandbox.write_file(path, content)

        # Verify tools work before snapshotting.
        # A broken tool in the snapshot means broken forks.
        verify = await sandbox.run(
            "python3", ["-m", "bandit", "--version"]
        )
        if verify.exit_code != 0:
            raise RuntimeError(f"Tool verification failed:\n{verify.stderr}")

        snapshot = await sandbox.checkpoint(
            checkpoint_type=CheckpointType.MEMORY
        )

    # Context manager terminates the base sandbox here.
    if snapshot.status.value != "completed":
        raise RuntimeError(f"Snapshot failed: {snapshot.status.value}")

    return snapshot.snapshot_id

The async with pattern guarantees terminate() is called on exit, including on exceptions. Without it, any exception before a manual terminate() call leaves an orphaned VM running in the background. TensorLake's async documentation shows this pattern explicitly.

result.exit_code comes from CommandResult, the SDK's return type for run(). It has stdout: str, stderr: str, and exit_code: int. Note that stdout is already a string, not bytes, so no .decode() is needed anywhere.

The status check after checkpoint(): SnapshotStatus is an enum, so .value gives you "completed", "in_progress", or "failed". The documentation shows checkpoint() returns a SnapshotInfo with a status field. Checking that status before proceeding is a useful defensive practice. I learned this after a failed snapshot left me debugging downstream agent failures.

Phase 2: Fork and Run an Agent

This is the actual fork. The call is AsyncSandbox.create(snapshot_id=snapshot_id). No special fork() method. No copy-on-write API. Just create() with a snapshot ID. Every call produces a fully independent VM starting from that snapshot's frozen state.

PERSPECTIVES = ["Security", "Complexity", "Docstrings", "Tests", "Structure"]

async def run_agent(agent_id: int, snapshot_id: str) -> AgentReport:
    perspective = PERSPECTIVES[agent_id % len(PERSPECTIVES)]
    t_start = time.time()

    # cpus and memory_mb intentionally omitted.
    # For MEMORY snapshots, resources are inherited from the snapshot
    # and cannot be overridden at restore time.
    async with await AsyncSandbox.create(
        snapshot_id=snapshot_id,
        allow_internet_access=False,  # code analysis is offline; no outbound needed
        timeout_secs=120,
    ) as sandbox:

        await sandbox.write_file(
            "/workspace/analyze.py",
            ANALYSIS_SCRIPT.encode("utf-8")
        )

        result = await sandbox.run(
            "python3",
            ["/workspace/analyze.py"],
            env={"PERSPECTIVE": perspective},
            timeout=60,
        )

    elapsed = time.time() - t_start

    if result.exit_code != 0:
        raise RuntimeError(f"Agent {agent_id} failed:\n{result.stderr}")

    output = json.loads(result.stdout.strip())
    return AgentReport(
        agent_id=agent_id,
        perspective=perspective,
        score=output["score"],
        finding=output["finding"],
        execution_time_s=elapsed,
    )

allow_internet_access=False is safe here because bandit and radon analyze source code and do not make network calls. This parameter is not locked by MEMORY snapshots. TensorLake's networking documentation recommends disabling outbound internet access for untrusted code.

The dispatch script gets written fresh into each forked VM via sandbox.write_file(). Each agent's VM is fully isolated: writing to /workspace/analyze.py in fork 0 has no effect on fork 1. The target project files are already there, inherited from the snapshot.

Since result.stdout is already a Python string, json.loads(result.stdout.strip()) works directly. The .strip() handles the trailing newline from print() inside the sandbox.

Phase 3: Sequential First, Then Parallel

The sequential baseline exists for one reason: to give the speedup calculation a real denominator. Without it, you have a time with no context.

async def run_sequential(snapshot_id: str, count: int) -> SwarmResult:
    reports = []
    for i in range(count):
        reports.append(await run_agent(i, snapshot_id))
    return SwarmResult(mode="sequential", ...)

async def run_parallel(snapshot_id: str, count: int) -> SwarmResult:
    # asyncio.gather returns a list of results when awaited.
    reports = await asyncio.gather(
        *(run_agent(i, snapshot_id) for i in range(count))
    )
    reports.sort(key=lambda r: r.agent_id)
    return SwarmResult(mode="parallel", ...)

asyncio.gather is what TensorLake's async documentation recommends for concurrent sandbox fan-out. The ThreadPoolExecutor approach works too (the sync Sandbox API supports it), but if you are already in an async context, gather is cleaner.

Phase 4:What the Analysis Script Does

The dispatch script runs inside each forked sandbox. It reads the PERSPECTIVE environment variable, routes to the right analysis function, and prints one JSON line to stdout. All five analyses are fully offline, with no network calls needed.

# ANALYSIS_SCRIPT — runs INSIDE each forked sandbox
import json, os, subprocess, ast, pathlib, sys

PERSPECTIVE = os.environ["PERSPECTIVE"]
TARGET = "/workspace/target"

def run_security():
    """bandit: find hardcoded secrets, unsafe eval, shell injection."""
    r = subprocess.run(
        ["python3", "-m", "bandit", "-r", TARGET, "-f", "json", "-q"],
        capture_output=True, text=True
    )
    try:
        data = json.loads(r.stdout)
    except json.JSONDecodeError:
        return {"score": 0, "finding": "bandit parse error"}
    issues = data.get("results", [])
    high = [i for i in issues if i.get("issue_severity") == "HIGH"]
    return {
        "issues": len(issues), "high": len(high),
        "score": max(0, 100 - len(issues) * 10),
        "finding": high[0]["issue_text"] if high else ("Minor issues" if issues else "Clean"),
    }

def run_complexity():
    """radon: cyclomatic complexity per function."""
    r = subprocess.run(
        ["python3", "-m", "radon", "cc", TARGET, "-j"],
        capture_output=True, text=True
    )
    try:
        data = json.loads(r.stdout)
    except json.JSONDecodeError:
        return {"score": 0, "finding": "radon parse error"}
    blocks = [b for file_blocks in data.values() for b in file_blocks]
    complex_blocks = [b for b in blocks if b.get("complexity", 0) > 5]
    avg = sum(b["complexity"] for b in blocks) / len(blocks) if blocks else 0
    top = f"{complex_blocks[0]['name']} (cc={complex_blocks[0]['complexity']})" if complex_blocks else "All within threshold"
    return {
        "functions": len(blocks), "complex_count": len(complex_blocks),
        "avg_cc": round(avg, 2),
        "score": max(0, 100 - len(complex_blocks) * 15),
        "finding": top,
    }

def run_docstrings():
    """ast: count functions and classes that lack docstrings."""
    total, documented = 0, 0
    for path in pathlib.Path(TARGET).rglob("*.py"):
        tree = ast.parse(path.read_text())
        for node in ast.walk(tree):
            if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
                total += 1
                if ast.get_docstring(node):
                    documented += 1
    pct = int(documented / total * 100) if total else 100
    return {"total": total, "documented": documented, "score": pct,
            "finding": f"{documented}/{total} documented ({pct}%)"}

def run_tests():
    """Count test files relative to source files."""
    all_py = list(pathlib.Path(TARGET).rglob("*.py"))
    test_files = [f for f in all_py if f.stem.startswith("test_") or f.stem.endswith("_test")]
    ratio = len(test_files) / len(all_py) * 100 if all_py else 0
    return {
        "source_files": len(all_py), "test_files": len(test_files),
        "score": min(100, int(ratio * 2)),
        "finding": f"{len(test_files)}/{len(all_py)} files are tests ({ratio:.0f}%)",
    }

def run_structure():
    """ast: count functions, classes, imports across the codebase."""
    stats = {"functions": 0, "classes": 0, "imports": 0, "files": 0}
    for path in pathlib.Path(TARGET).rglob("*.py"):
        stats["files"] += 1
        tree = ast.parse(path.read_text())
        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef):          stats["functions"] += 1
            elif isinstance(node, ast.ClassDef):           stats["classes"] += 1
            elif isinstance(node, (ast.Import, ast.ImportFrom)): stats["imports"] += 1
    fpr = stats["functions"] / stats["files"] if stats["files"] else 0
    return {**stats, "functions_per_file": round(fpr, 1),
            "score": min(100, int(fpr * 20)),
            "finding": f"{stats['functions']} functions across {stats['files']} files"}

dispatch = {
    "Security":   run_security,
    "Complexity": run_complexity,
    "Docstrings": run_docstrings,
    "Tests":      run_tests,
    "Structure":  run_structure,
}

fn = dispatch.get(PERSPECTIVE)

if fn is None:
    print(json.dumps({"error": f"Unknown perspective: {PERSPECTIVE}"}))
    sys.exit(1)

result = fn()
result["perspective"] = PERSPECTIVE
print(json.dumps(result))

Two things worth keeping when you adapt this.

Parameters via environment variables: sandbox.run(env={"KEY": "val"}) passes per-command variables and avoids shell escaping issues when values contain spaces or special characters. It also keeps the dispatch script stateless, with no hardcoded perspective names inside the script itself.

JSON to stdout: the orchestrator reads result.stdout.strip() and passes it directly to json.loads(). The script has one job: print exactly one valid JSON line. Any other stdout output (debug prints, progress bars) breaks the parse. Keep it strict.

Phase 5: Lead Agent Synthesis

After all five agents return, a single GPT-4o call synthesizes their findings into a prioritized action list.

def aggregate_with_llm(parallel: SwarmResult, sequential: SwarmResult) -> str:
    client = OpenAI()
    speedup = sequential.total_time_s / parallel.total_time_s

    reports_block = "\n".join(
        f"[{r.perspective}] Score: {r.score}/100 | {r.finding}"
        for r in parallel.reports
    )

    prompt = (
        "You are a senior engineering lead reviewing a parallel code analysis report.\n\n"
        f"Agent Findings:\n{reports_block}\n\n"
        "Benchmark:\n"
        f"  Sequential : {sequential.total_time_s:.2f}s\n"
        f"  Parallel   : {parallel.total_time_s:.2f}s\n"
        f"  Speedup    : {speedup:.2f}x\n\n"
        "Provide: overall codebase health score, top three issues to fix immediately "
        "(with file and severity), recommended next actions, and one sentence on what "
        "the parallel speedup means for running this at scale."
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

The lead agent sees both the analysis findings and the timing benchmark in the same context. That is the reduce step in a map-reduce agent pattern: give the aggregator everything the workers produced, not just the domain data. The call is synchronous because there is nothing left to concurrently await at this point.

Where the Time Actually Goes

Both timelines contain the same agents doing the same work. What changes is when setup happens. These numbers are structural projections based on typical pip install times and sandbox warm-restore behavior, not measured results. Your numbers will vary by workload and network conditions. Run the demo to measure your case.

Without memory snapshots:

Agent 0: [setup ~90s][work ~8s]
Agent 1: [setup ~90s][work ~9s]
Agent 2: [setup ~90s][work ~8s]
Agent 3: [setup ~90s][work ~9s]
Agent 4: [setup ~90s][work ~8s]

Sequential total: ~490s
Parallel total:   ~100s  (setup still paid by each fork separately)

With memory snapshots (MEMORY type):

Base build:  [setup ~90s][checkpoint ~3s]  ← paid once, outside the loop
Agent 0: [warm fork ~1s][work ~8s]
Agent 1: [warm fork ~1s][work ~9s]
Agent 2: [warm fork ~1s][work ~8s]
Agent 3: [warm fork ~1s][work ~9s]
Agent 4: [warm fork ~1s][work ~8s]

Sequential total: ~48s
Parallel total:   ~10s

The speedup ratio looks similar on paper. The absolute time is not. At five agents the gap is 450 seconds versus 5 seconds of overhead. At fifty agents it is 4,500 seconds versus 50 seconds.

Setup time does not scale down with parallelism. It multiplies. The snapshot moves it outside the loop entirely.

The benchmark captures four numbers: sequential total time (the denominator), parallel total time (wall-clock from first fork to last return), speedup (sequential divided by parallel), and efficiency (speedup divided by agent count, multiplied by 100).

Efficiency is the one most benchmarks skip. A 4.2x speedup across five agents is 84% parallel efficiency: 16% is lost to fork startup, scheduling, and I/O contention. That number matters when you scale from five agents to fifty.

What the Code Does Not Handle

The demo covers the happy path. Three things to add before production:

LLM rate limits. Twenty or thirty concurrent agents all hitting the OpenAI API will trigger rate limit errors. The demo has no retry logic. Add exponential backoff before you scale.
Snapshot storage. Snapshots may incur charges depending on your plan. Use Sandbox.delete_snapshot(snapshot_id) when done. The demo has a CLEANUP_SNAPSHOT_ON_EXIT flag at the top of the file.
Agent error isolation. If one run_agent() coroutine raises inside asyncio.gather, the whole batch fails. In production, wrap each coroutine with asyncio.create_task() and handle errors per-agent.

When to Use This Pattern (And When Not To)

Use it when:

Multiple agents need the same environment
Their tasks are independent (no inter-agent communication mid-run)
Setup time is a meaningful fraction of total runtime
Reproducibility matters: every fork starts from an identical state

Skip it when:

Agents need to share state during execution. Forks are fully isolated. If agent 2 needs to react to what agent 1 found, use shared storage or message queues instead.
The task is fast enough for a single agent. Forking five sandboxes for a 3-second job adds overhead, not speed.
Environment setup takes under 5 seconds. The snapshot overhead only pays off when setup is the actual bottleneck.

Your situation	Right choice
Multiple agents, shared dependencies, independent outputs	Memory snapshot, fork N copies
Single agent, long task, needs to pause and resume	Named sandbox with suspend/resume
Pure browser automation, no code execution	Stagehand or BrowserBase
Stateless task, resets every run	Ephemeral sandbox, no snapshot needed
Environment setup under 5 seconds	Filesystem snapshot or skip snapshots

On filesystem performance: Tensorlake publishes performance benchmarks on their GitHub comparing sandbox execution times across providers. Refer to their repository for current numbers.

Running This

pip install tensorlake openai
export TENSORLAKE_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
python3 agent.py

Free tier at cloud.tensorlake.ai, no credit card required. The demo takes 3-5 minutes end to end. After it runs, benchmark_results.json has the full per-agent timing data.

Phase 1 (base build and snapshot) runs once. If you want to run the benchmark multiple times, pass your existing snapshot ID directly and skip Phase 1. The snapshot persists between runs until you delete it.

What Actually Took Three Iterations

The first version had plain await sandbox.terminate() at the end of each function. Two exceptions during testing left sandboxes running and billing for idle compute. Switched to async with await AsyncSandbox.create(...) as sandbox: and that stopped.

The second version called sandbox.checkpoint(sandbox.sandbox_id). I had copied the pattern from a CLI reference (tl sbx checkpoint <sandbox-id>) and assumed the Python SDK matched. It does not. The Python instance method takes no positional arguments: sandbox.checkpoint(checkpoint_type=CheckpointType.MEMORY). That is it.

The third version was the first one that ran end to end, but with CheckpointType.FILESYSTEM by default because I had not read the snapshots documentation carefully. The benchmark looked reasonable. The forks were doing full cold boots and I was measuring them alongside the actual work. Switching to CheckpointType.MEMORY was the change that made setup time disappear from per-fork timing.

Small mistakes individually. What they share: Tensorlake's API is well documented, but the snapshot docs, the SDK reference, and the async docs are three separate pages. Read only the quickstart and you miss two of the three things that matter most for this pattern.

You can also check the complete project on my github here:

click_here

The Thing That Changes

Running the same five agents sequentially and then in parallel is one of those moments where the architecture becomes legible in a way that documentation does not fully convey.

The snapshot moves setup cost from inside the loop to outside it. The agents still do the same work on the same hardware. The savings come from not rebuilding an environment five times when it only needed to be built once.

Most multi-agent optimization advice focuses on LLM calls: batching, caching, cheaper models. That advice is right. But if you have five agents each spending 90 seconds on pip installs before making a single inference call, no amount of LLM optimization helps until you address setup time first.

The bottleneck was never the agents. It was rebuilding the same environment on every run. Snapshot it once, fork cheaply, and parallel execution finally delivers what you expected when you first wrote asyncio.gather.

References:

MCP Is Dead. The Downloads Just Don't Know It Yet.

Divy Yadav — Fri, 05 Jun 2026 12:54:22 +0000

30 CVEs in 60 days, a maintenance tax nobody warned you about, and what engineers are quietly switching to.

Your AI agent ran a query on a fake database last month.

It got real results. The tool worked perfectly. Your SSH keys left in the background.

The agent didn't flag it. The registry didn't catch it. Nobody warned you.

That's not a hypothetical. That's MCP in 2026, with 97 million monthly downloads and a Linux Foundation home.

The hype was real. So are the cracks.

First: what is MCP, and why should you care

If you've never built AI agents before, this matters. Skip it if you have.

Say you're building an AI assistant that needs to do real work:

Look up customer records in a database
Create tickets in Jira
Send a Slack message
Pull a file from Google Drive

Each of those lives in a different system. Different API, different auth, different data format.

To connect your AI to all of them, you'd write a custom integration for each one. Fine for two tools. Painful for ten. Then you switch models and rewrite everything.

This is the N×M problem: N tools multiplied by M AI models equals a mountain of glue code nobody wants to maintain.

MCP — the Model Context Protocol — solves that. Released by Anthropic in November 2024, it's an open standard that gives AI models one universal way to talk to external tools. You build an MCP server once around a tool, and any MCP-compatible AI can use it.

Your agent  →  MCP Client  →  MCP Server  →  Real Tool (Slack, Postgres, GitHub)

Three pieces:

MCP Host: your app (Claude Desktop, VS Code, a custom agent)
MCP Client: the component inside your app that speaks MCP, discovers tools, calls them
MCP Server: a small process wrapping a real tool, exposing it in a format any MCP client can use

That's it. The N×M problem disappears. One integration per tool, works with every AI.

The pitch was real. The adoption proved it.

How MCP went from zero to everywhere in 14 months

The adoption happened fast. Unusually fast.

Nov 2024 — Anthropic launches MCP. ~2M monthly SDK downloads.
Apr 2025 — OpenAI adopts it. Downloads jump to 22M.
Jul 2025 — Microsoft integrates it into Copilot Studio. 45M.
Nov 2025 — AWS adds support. 68M.
Mar 2026 — Every major AI vendor on board. 97M downloads. 10,000+ public MCP servers.

In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, co-founded with OpenAI and Block. It stopped being Anthropic's protocol and became the industry's.

The "USB-C for AI" comparison spread everywhere. Everyone plugged in.

And then engineers started running it in production.

Why developers hate MCP

The complaints have been building in forums, GitHub issues, and private Slack channels for months. Not from people who misunderstood MCP. From people who ran it in production and got surprised by the same things.

"It's unauthenticated by default."

Out of the box, an MCP server trusts whatever connects to it.

No built-in check that the server is who it claims. No built-in check that the client is who it claims. You're responsible for adding that layer.

Most tutorials don't mention this.

"The STDIO transport executes arbitrary OS commands."

The official MCP STDIO transport runs any OS command you point at it to launch a server. Even when the server startup fails. No sanitization warnings. Nothing in the developer toolchain flags it.

OX Security documented this in April 2026.

Anthropic's response: expected behavior, sanitization is the developer's responsibility. LangChain said the same. Microsoft said the same.

Three major vendors. Same answer: your problem.

"The spec moves, community servers don't."
MCP servers published in community registries frequently fall behind spec updates. A server that worked last month may behave differently after the protocol updates. The registry has no enforcement mechanism. You find out in production.

"Every REST API I already have needs a new wrapper process."

Adding MCP to a tool that already has a clean REST API means building an entire MCP server around it.

That server needs to be:

Deployed and monitored
Updated when the underlying API changes
Secured separately from both the agent and the tool

For ten existing APIs, that's ten new processes to own. Month three of production, you feel every one of them.

"The registries are basically npm circa 2015."

In early 2026, OX Security cloned mcp-server-postgres and named it mcp-server-postgress (extra 's'). Functionally identical. Same queries, same responses.

Hidden inside: a payload that silently pulled SSH keys and environment files to an outside server.

They submitted it to eleven major MCP registries.

Nine published it. No automated security review. No source code analysis. Nothing.

"It eats my context window before the user says anything."

When your MCP client connects to a server, it loads the full tool schema into your context window — names, descriptions, parameters for every tool.

One server, five tools: ~500 tokens gone before the first message
Ten servers: 2,000–3,000 tokens gone before the first message

The model is already reasoning over a smaller budget. Before your user typed a word.

These aren't edge cases. They're the standard experience for anyone who's moved past a local demo.

The 3 problems that actually break production systems

The trust problem

Your agent has no way to verify an MCP server is who it claims to be.

The OX Security incident made this real: nine out of eleven registries accepted a typosquatted credential-stealing package. The malicious server functioned correctly. Ran database queries. Returned results. And silently pulled your SSH keys in the background. Nothing in the protocol flagged it.

Since January 2026, researchers have filed 30 CVEs against the MCP ecosystem in 60 days. Prompt injection through tool descriptions. Credential theft via config file reads. "Tool poisoning," where a server description manipulates the agent's next decision. These aren't exotic attack vectors.

Your agent can't tell your Postgres server from an attacker's. That's not a code bug. It's a design gap.

MCP was built for a trusted local environment. Production isn't that.

The wrapper tax

Every tool you connect to MCP needs its own MCP server. Ten tools means ten additional processes to own.

Each one needs to:

Stay in sync when the underlying tool's API changes
Be monitored for failures in production
Be secured separately from the agent and the tool itself
Be deployed as part of your infrastructure

For the first two tools, manageable. Month three with fifteen integrations, it's a job.

The N×M problem is solved. The "N new processes" problem quietly replaced it.

The context window bill

Tool schemas are not free. They're tokens. And they arrive before your user's message does.

A team building a customer service agent connected to ten MCP servers found their available reasoning budget had shrunk by 30% before the first user question arrived. Same model. Same prompts. Just more tools.

In a long multi-step agent session, schema tokens compound. Quality drifts. Costs climb. Most teams don't trace this back to tool schema overhead until they look at what's actually in the context window.

What engineers are using instead

A few patterns have emerged for teams that ran into the problems above.

Direct REST API calls

For tools with a clean existing API, skip MCP entirely. Call the API directly from your agent. No new server to maintain, no schema overhead, existing auth covers it.

Works well when you control the tool and the API is stable. Doesn't scale when you need multiple AI systems to share the same integrations.

Native provider tool use

Anthropic and OpenAI both have built-in tool calling that needs no MCP infrastructure. You define the tool schema inline, pass it with the request, the model calls it.

No server process
No registry
Your auth sits directly on the call

Most teams running focused single-purpose agents in 2026 are doing this. Simpler to reason about, harder to share across systems.

UTCP (Universal Tool Calling Protocol)

UTCP skips the wrapper entirely. Instead of wrapping a tool in an MCP server, it calls the tool's existing HTTP endpoints directly, with a discovery layer on top.

As of early 2026:

1,000+ GitHub stars
Implementations in Python, Go, and TypeScript
Growing community from teams that wanted lower latency and less infrastructure overhead

Best for teams with well-designed existing APIs who don't want a separate server layer. Not a full MCP replacement if you need the ecosystem breadth — but for many production use cases, materially simpler.

MCP with a gateway layer

For teams committed to MCP, the answer to most of the problems above is an MCP gateway — a controlled layer between your agent and your servers.

Your agent  →  MCP Gateway  →  MCP Server 1  →  Tool
                            →  MCP Server 2  →  Tool
                            →  MCP Server N  →  Tool

A gateway handles:

Authentication — verifies server identity before your agent calls anything
Tool filtering — loads only schemas relevant to the current task, not all of them
Audit logging — records every tool call for compliance and debugging
Rate limiting — stops runaway tool calls from blowing your budget

As of April 2026, 86–89% of AI agent pilots fail before reaching production. Governance gaps and audit visibility are the two most common reasons. A gateway is what closes both.

So do we actually need MCP?

Yes. With caveats that matter.

Use MCP when:

Multiple AI systems need to share the same tools
You're a SaaS company giving AI agents access to your product
You need dynamic tool discovery across a large integration ecosystem

Skip MCP when:

You're building a focused agent with two or three tools you already own
Your tools have clean REST APIs you control
You need low latency and minimal infrastructure overhead

The "MCP for everything" era is over. It's the right call when standardization pays off at scale. When you just need your agent to hit an API you already control, MCP is overhead pretending to be infrastructure.

Cheat sheet: what to actually do

Your situation	What makes sense
Local dev, one or two tools, just exploring	Bare MCP or native tool calls. Don't over-engineer.
Agent using tools you own, clean REST APIs	Direct API calls or native tool use. Skip MCP overhead.
Production agent, 5+ tools, or external users	MCP with a gateway. Authentication is not optional.
Enterprise, compliance, or regulated industry	MCP gateway with audit logs and SSO. Non-negotiable.
Pulling from community MCP registries	Treat every server as untrusted. Verify before deploying.
Need to share tools across multiple AI systems	MCP is the right call. This is exactly what it's for.

The actual state of things

MCP isn't going away. The downloads are real. The Linux Foundation governance is serious. Multi-vendor adoption means the protocol has institutional staying power.

But the MCP of early tutorials — install a community server, plug it in, done — that version is dead.

It was never safe for production. It was never meant to be.

The engineers moving to UTCP or direct API calls aren't abandoning MCP because it failed. They're routing around the parts that weren't built for what they're building.

I keep coming back to the OX Security test. Nine out of eleven registries. No automated review. The agent called the fake server, ran its queries, and handed over credentials it didn't know it was handing over.

Your agent does what it's told by the tools it trusts.

MCP hasn't fully answered how it decides what to trust. Until it does, treat every community MCP server the way you'd treat a random npm package in 2015.

You know how that era ended.

I Built a Stateful Research Agent Inside a Sandbox. Here's What the Numbers Actually Looked Like.

Divy Yadav — Wed, 27 May 2026 04:37:05 +0000

Three steps into a multi-page research task, the agent lost everything.

Not a crash. Not a thrown exception.

The function returned, context reset, and the pricing data it had just collected vanished.

This failure is predictable: stateless execution environments were never built to hold state across browser sessions that run for twenty minutes.

You hit it eventually, usually at the worst moment.

The two standard workarounds are both annoying. Stuffing state into the prompt works until token costs starts becoming an issue. An external state store solves the problem but now you are maintaining another service.

I had been using E2B for short-lived code execution. It handles that well, and they have added persistence features over time, including early-stage snapshot support. But for agents that need to pause mid-task and resume from a different process, state management is still mostly on you.

Someone in my Discord mentioned TensorLake. I opened the docs and decided to build against this specific problem.

In this article, I will walk you through the steps using which you can build a desktop using an agent in a sandbox.

Let's start with setting up.

Visual Explanation First

Setup

What caught my attention first: named sandboxes with suspend() and resume() that preserve the full VM state, not just files, but running processes and open browser sessions. Sub-second resume, according to their docs.

Ten minutes from zero to running:

pip install tensorlake
tl login   # or TENSORLAKE_API_KEY env var

Free tier, no credit card.

from tensorlake.sandbox import Sandbox

sandbox = Sandbox.create(
    name="research-agent",
    cpus=2.0,
    memory_mb=4096,
    secret_names=["OPENAI_API_KEY"],
    image="tensorlake/ubuntu-vnc",
)

The tensorlake/ubuntu-vnc image is what gives you a real desktop and Firefox inside the VM. You need an actual browser because modern pricing pages heavily use client-side rendering and bot detection that stops headless scrapers cold. Firefox inside a sandbox just looks like a person browsing.

Important: Playwright is not pre-installed in ubuntu-vnc. Install it before the agent runs:

sandbox.run("pip", ["install", "playwright"])
sandbox.run("playwright", ["install", "chromium"])

Two to three minutes on first setup. After that, packages persist across suspend/resume so you pay the cost once.

Latency: What I Actually Measured

First sandbox was running in roughly 800-900ms from the Sandbox.create() call to status running.

Here is where time actually goes:

Sandbox creation:        ~800ms          (named sandbox, first time)
Sandbox resume:          ~400ms          (from suspended state)
LLM call (GPT-4o):       2,000-4,000ms   (per step, dominates everything)
Browser screenshot:      ~300ms          (capture + transfer)
Page load in sandbox:    1,000-2,000ms   (varies by site)
File read/write:         <50ms           (block-based storage)
Sandbox suspend:         ~200ms

The LLM calls dominate by a large margin. Sandbox overhead is not the bottleneck. The main optimization is batching browser operations before each model call rather than interleaving individual round trips.

Tensorlake publishes a SQLite filesystem benchmark claiming 1.6-1.9x faster I/O than E2B and Modal. Self-reported numbers. I could not independently verify them. What I can say is that the block-based storage felt responsive for frequent small writes, which is exactly the pattern a research agent uses when checkpointing after every step.

Computer Use: What Worked and What Didn't

The desktop API itself is clean:

with sandbox.connect_desktop(password="tensorlake") as desktop:
    png_bytes = desktop.screenshot()
    desktop.move_mouse(640, 400)
    desktop.click()
    desktop.type_text("pinecone.io")
    desktop.press("Return")

Screenshot as PNG bytes, decode it, figure out where to click, send coordinates. Each browser interaction takes 1-3 seconds depending on page load. Slow compared to an API call. But it works on pages that block scrapers, because from the server's side it is just a person using Firefox.

The problem: coordinates assume a fixed layout, and layouts do not stay fixed.

Weaviate's pricing page ran an A/B test between two of my agent's steps. The toggle moved 30px down. The agent clicked empty space. No error, no exception. Just a screenshot showing nothing happened, and twenty minutes of debugging before I identified the offset.

The fix: pass screenshots to GPT-4o Vision to identify element positions dynamically rather than hardcoding coordinates. Adds about 2 seconds per interaction, handles layout drift reliably. Worth it for reliability; too slow for high-frequency operations.

When the DOM is accessible, Playwright inside the sandbox is the better path:

result = sandbox.run(
    "python",
    ["-c", """
import asyncio
from playwright.async_api import async_playwright

async def get_pricing():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto("https://pinecone.io/pricing")
        pricing_text = await page.inner_text(".pricing-section")
        print(pricing_text)
        await browser.close()

asyncio.run(get_pricing())
"""]
)

The hybrid strategy I landed on:

Situation	Approach	Why
Site with bot detection	Vision + coordinates	Playwright gets blocked
Accessible DOM	Playwright directly	Faster, no coordinate drift
Unknown or variable layout	Screenshot + GPT-4o Vision	Resolves position dynamically
High-frequency operations	Playwright only	Vision adds ~2s per call

Use vision as a fallback, not a first tool. Vision handles layout variation. Playwright handles speed. Neither does both well.

Statefulness: The Part That Actually Mattered

After three steps (Pinecone free tier limits noted, $70/mo Starter plan recorded, Weaviate docs started), I called sandbox.suspend().

The sandbox froze. Filesystem, memory, running browser: all paused. Twelve minutes later, from a different terminal:

sandbox = Sandbox.connect("research-agent")
sandbox.resume()

About 400ms. The Weaviate pricing tab was still open. Tensorlake's suspend/resume preserves the full VM state, including memory and running processes.

Everything written to /workspace/research_notes.json was intact.

The workflow I settled on: write state explicitly after each meaningful step, then suspend.

# After each step, before suspending:
sandbox.write_file(
    "/workspace/state.json",
    json.dumps({
        "pinecone_pricing": pinecone_data,
        "weaviate_started": True,
        "next_url": "https://weaviate.io/pricing"
    }).encode()
)
sandbox.suspend()

# On next invocation, from any process:
sandbox = Sandbox.connect("research-agent")
sandbox.resume()
state = json.loads(bytes(sandbox.read_file("/workspace/state.json")))
# picks up from state["next_url"]

The state file is the continuity mechanism. Not elegant, but it removes the need for an external database and the filesystem is fast, durable across suspend, and readable from any reconnecting process.

Scaling and Failure Handling

Sandbox.create() is a blocking synchronous call. For parallel workloads, wrap in concurrent.futures:

from tensorlake.sandbox import Sandbox
from concurrent.futures import ThreadPoolExecutor

def research_competitor(name, url):
    sandbox = Sandbox.create(
        name=f"research-{name}",
        cpus=1.0,
        memory_mb=2048,
        secret_names=["OPENAI_API_KEY"],
        image="ubuntu-vnc",
    )
    # ... agent logic ...
    result = sandbox.read_file("/workspace/report.json")
    sandbox.terminate()
    return result

competitors = [
    ("pinecone", "pinecone.io/pricing"),
    ("weaviate", "weaviate.io/pricing"),
    ("qdrant", "qdrant.tech/pricing"),
]

with ThreadPoolExecutor(max_workers=5) as executor:
    reports = list(executor.map(lambda c: research_competitor(*c), competitors))

Three concurrent sandboxes ran without delay. I have not tested at twenty or fifty. Their docs mention hundreds per second. Take that at face value until you have load data.

Note: Tensorlake's Python SDK v0.5.8 introduced native async APIs that offer a cleaner alternative to threading for I/O-bound orchestration. If you are on v0.5.8 or later, those are worth reaching for before wrapping synchronous calls in a thread pool.

Patterns worth building from day one:

Idempotent state writes. Write state after each meaningful step. If the agent fails mid-run, the next invocation reads the file and skips completed work. This does not happen automatically.

Checkpoint before risky operations. sandbox.checkpoint() creates a restorable snapshot. By default, snapshots preserve the filesystem state. Preserving full memory state is supported as an explicit option. Either way, you can restore into a fresh sandbox if an operation goes wrong:

# Filesystem snapshot (default)
snapshot = sandbox.checkpoint()

try:
    agent.navigate_to_pricing_page()
except Exception:
    # Restore filesystem state into a new sandbox
    sandbox = Sandbox.create(snapshot_id=snapshot.snapshot_id)

Named sandboxes. If the orchestration process dies, any other process reconnects with Sandbox.connect("sandbox-name") and resumes from the last written state.

Architectural boundary: Tensorlake provides the execution environment and runtime for agents: the VM, the filesystem, the process lifecycle, the networking. It is not an agent framework. Retry logic, circuit breakers, and LLM rate-limit backoff belong in the orchestration layer above it: LangChain, LlamaIndex, a custom harness, or whatever you are using to drive the agent. That separation is deliberate, not a gap.

The Mental Model

The part that shifted how I thought about the design:

┌─────────────────────────────────────────────┐
│                 Your Agent                   │
│    (LLM + tool calling logic)                │
└──────────────────┬──────────────────────────┘
                   │ tool calls
┌──────────────────▼──────────────────────────┐
│           Tensorlake Sandbox                 │
│  ┌──────────────────────────────────────┐   │
│  │ State Layer: /workspace filesystem   │   │
│  │  state.json, research_notes.json     │   │
│  └──────────────────────────────────────┘   │
│  ┌──────────────────────────────────────┐   │
│  │ Execution Layer: processes, scripts  │   │
│  └──────────────────────────────────────┘   │
│  ┌──────────────────────────────────────┐   │
│  │ Computer Use: VNC, screenshots, mouse│   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

The sandbox is not the agent. It is the stable environment the agent operates in. When it resumes, the environment is exactly where the agent left it. The agent's logic lives outside and reconnects to a world that did not reset.

That changes what you can build. An agent that runs for an hour, navigates fifteen pages, and writes a structured report is feasible when the execution environment outlasts the orchestration session. With purely ephemeral execution, it is not.

How It Compares

vs E2B:

Both use Firecracker microVMs. E2B markets sub-200ms cold starts; community reports put real-world p50 closer to 400-600ms. Tensorlake named sandbox creation was ~800ms in my testing.
E2B has added snapshot and pause-resume in recent releases. The statefulness gap is narrower than a year ago. Tensorlake's suspend/resume preserves the full running VM state, including open processes, browser sessions, all in under a second. E2B's memory snapshot support is still described as early-stage.
Tensorlake claims 1.6-1.9x faster filesystem I/O on their own benchmarks. Self-reported. For an independent reference: Tensorlake recently ranked top 2 across all three categories in the ComputeSDK sandbox benchmarks.
Neither provides DOM-level element selection at the SDK layer.

vs Modal:

Modal uses gVisor rather than Firecracker, designed around stateless function execution. Stateful long-running agents work but need more setup. Cold starts are around 1-1.5 seconds per their docs.

vs Stagehand (BrowserBase):

Stagehand has DOM-level selectors (CSS, XPath, natural language) via locator(). For pure browser automation, this is a real ergonomic advantage.
Tensorlake gives you a full VM. Code execution, file management, package installs, and browser use in the same environment. If that combination is what you need, the full VM model is worth the coordinate complexity.
Browser automation only? Stagehand is the more focused tool.

from tensorlake.sandbox import SandboxClient

client = SandboxClient()

for sb in client.list():
    print(sb.sandbox_id, sb.status)

What the Build Produced

By the end of the session, the agent had produced the comparison: Pinecone versus Weaviate pricing, extracted across seven pages, with notes preserved across two suspensions and a full restart of the orchestrating machine.

report_bytes = sandbox.read_file("/workspace/comparison_report.md")
print(bytes(report_bytes).decode("utf-8"))

Accurate. Correct tier names and numbers.

Tensorlake did not solve the hard parts: the retrieval logic, state schema, hybrid browser strategy. It stayed out of the way while those got built. Most of the infrastructure friction came down to state management, and most of that went away once the sandbox filesystem became the state store.

Three Things to Know Before You Start

Speed is a systems problem, not a sandbox problem. LLM calls account for the bulk of per-step latency. Optimize by batching browser operations before each model call, not by chasing sandbox startup time.

Design for interruption from day one. Write state after every meaningful step. Not because the sandbox will crash, but because resuming from a different process after an unexpected interruption is a real scenario, not an edge case.

Computer use is a primitive. The coordinate-based API works, but layout drift will break hardcoded positions. Use Playwright when the DOM is accessible. Fall back to vision when you need a real browser session. Do not automate full workflows with raw coordinates.

Is the sandbox infrastructure production-ready? Yes. Suspend/resume held up, filesystem persistence was consistent, and Firecracker isolation did what it was supposed to.

Is the computer use layer production-ready? Not without additional engineering. The raw coordinate API is a reasonable primitive, but element resolution needs to be built on top of it. A vision-backed click_element() in the SDK would change the story significantly. Until then, budget the time to build that layer yourself.

Worth using? Yes, if you go in with clear expectations about what the platform handles and what it leaves to you. That boundary is sharper than most, which makes it easier to work with once you have internalized it.

You can also check the complete project on my github here:

click_here

References

Tensorlake. Tensorlake Documentation & Sandbox SDK. https://tensorlake.ai

E2B. E2B Sandbox Infrastructure. https://e2b.dev

Modal. Modal Serverless Infrastructure. https://modal.com

Stagehand (BrowserBase). Stagehand Browser Automation. https://browserbase.com/stagehand

Amazon Web Services. Firecracker MicroVMs. https://firecracker-microvm.github.io/

Microsoft. Playwright Browser Automation. https://playwright.dev

Benchmark: https://www.computesdk.com/benchmarks/sandboxes/

LLMs, RAG, Agents, MCP: The AI Evolution You Actually Need to Understand

Divy Yadav — Wed, 20 May 2026 11:03:00 +0000

Most people still think AI is just a chatbot.

That idea is already outdated.

Modern AI systems browse the web, remember your preferences, execute code, query databases, call APIs, and coordinate workflows. They operate more like software employees than like a search bar.

This did not happen because models got smarter. It happened because the architecture changed.

Every layer of the modern AI stack exists because the previous layer had a real failure. Understanding what failed and why something new was built is the fastest way to understand how any serious AI product works today.

That is what this article covers. Every stage: LLMs, RAG, Agents, and MCP.

In this article, you will get a good idea about the full AI Evolution, and it took me a lot of research and work for this one.

So if you want more such information about AI, consider subscribing to my newsletter, where you will get noise-free information every week

Link for the newsletter: Newsletter

Stage 1: The LLM Era

What LLMs Actually Are

An LLM is a prediction engine.

Not a reasoning engine. Not a database. Not a search system.

Given text, it predicts what comes next. That prediction runs over and over, token by token, until a full response is generated. The model learns these predictions from enormous amounts of human text: books, articles, code, research papers, websites.

Input:  "The capital of France is"
Model:  [predicts next token]
Output: "Paris"

Simple idea. The scale is what makes it work.

A token is roughly 3 to 4 characters of English text. "Hello, world!" is about 4 tokens. Everything the model processes and generates is counted in tokens. This affects cost, speed, and the limits we will cover shortly.

Why It Felt Like a Big Deal

For the first time, a machine could:

Hold a fluent conversation in any language
Write code that actually ran
Summarize a 50-page document in seconds
Explain complex topics to a non-expert
Answer questions across almost any domain

Where It Broke Down

Then people started building real products. The limitations became obvious fast.

Hallucination: The model predicts what is plausible, not what is true. It will state wrong facts with total confidence.

Knowledge cutoff: Training data has a date. Ask about last week and it guesses.

No memory: Every conversation starts blank. The model has no idea what you talked about yesterday.

No access to your data: Your company documents, your database, your internal systems. The model knows none of it.

No ability to act: It produces text. It cannot send an email, run a query, or update a record.

Ask a pure LLM: "What was Apple's stock price yesterday?" It will either refuse or make up a number.

It has no connection to live systems. It is a very smart autocomplete engine. Autocomplete alone does not run a business.

This limitation is what created the next stage.

Stage 2: RAG Changes the Game

The Core Idea

RAG stands for Retrieval-Augmented Generation. One sentence covers it:

Before generating a response, retrieve the relevant information and give it to the model.

Instead of relying only on training data, the system fetches fresh, relevant context at the moment of each query.

A Simple Way to Think About It

Pure LLM: A student answering an exam entirely from memory. Sometimes brilliant. Sometimes confidently wrong.

RAG: The same student, but allowed to open their notes before answering. Answers are grounded in actual sources.

The model did not get smarter. It got better information to work with.

How RAG Works

USER QUERY
    ↓
RETRIEVE relevant documents
(from a vector database, using semantic search)
    ↓
INJECT those documents into the prompt as context
    ↓
LLM generates an answer grounded in the retrieved content
    ↓
RESPONSE (accurate, with sources)

The Technology Behind Retrieval

Embeddings are what make semantic search work.

Documents are converted into vectors, which are lists of numbers that represent meaning mathematically. Similar meanings end up close together in vector space. "Car" and "automobile" are close. "Car" and "photosynthesis" are not.

When a user query arrives, it is also converted to a vector. The system finds the stored vectors nearest to that query vector. Those are the semantically relevant documents, retrieved and injected into context.

Common vector databases:

Database	Best For
Pinecone	Managed, production-ready
Weaviate	Open-source, rich query support
Chroma	Development and small-scale use
FAISS	Fast, local, no managed infrastructure

What RAG Unlocked

RAG became the foundation of a lot of serious AI products:

Enterprise knowledge assistants
Customer support bots grounded in actual policy
PDF and document Q&A
Internal search that surfaces the right document
Any system needing up-to-date or private data

What RAG Still Could Not Do

Retrieval solves the knowledge problem. It does not solve the action problem.

RAG can find the answer to "what is our refund policy?" It cannot process the refund. It can tell you flight options. It cannot book the ticket.

For that, a different capability was needed.

Stage 3: The Rise of AI Agents

The Core Shift

Traditional AI:

User asks → Model answers → Done

Agent:

User sets a goal → Agent plans → Agent uses tools →
Agent observes results → Agent decides next step →
Agent continues until goal is complete

Agents reason, plan, use tools, and execute multi-step workflows. They operate rather than just respond.

Tool Calling: How Agents Reach the Real World

An LLM by itself cannot search Google, call an API, write to a database, or run code. Tool calling extends the model's reach.

User: "Find the cheapest flights from Delhi to Singapore next month."

Agent Step 1: Call flight search API with parameters
Agent Step 2: Receive results
Agent Step 3: Sort and compare options
Agent Step 4: Summarize the three cheapest options

The model decides which tool to call, with what arguments, and what to do with the result. It manages the whole workflow.

What Agents Can Do

A capable AI agent can:

Browse websites and extract information
Write, execute, and debug code
Send emails and messages
Query and update databases
Call any API with proper credentials
Coordinate with other agents
Schedule and manage workflows

Frameworks That Made This Practical

Building agents from scratch is tedious. Frameworks handle the boilerplate:

LangChain / LangGraph - most widely used, graph-based orchestration
AutoGen - multi-agent conversations, good for collaborative tasks
CrewAI - role-based agent crews for structured workflows
OpenAI Agents SDK - native tool calling with built-in orchestration

Where Agents Break

More power introduced more failure modes:

Context overflow: Long agent runs fill the context window. Earlier instructions get lost. Accuracy drops.

Memory fragmentation: Without a coherent memory system, agents lose track of what they were doing.

Tool confusion: Too many tools and the model picks the wrong one or misuses it.

Hallucinated actions: The model invents results from tool calls it never actually made.

Runaway loops: No stop condition means the agent keeps going when it should have asked for clarification.

There was also a deeper infrastructure problem. Every agent integration was custom-built. Connecting to Slack required one connector. Google Drive required another. Salesforce required another. There was no standard. Scaling meant a growing stack of hand-built code.

That is what MCP was built to fix.

Stage 4: MCP, The Protocol That Standardizes Everything

The Problem Before MCP

Before November 2024, connecting an AI system to external tools meant:

Custom integration for every tool
Different formats for every API
No standard for how models discover what tools are available
No consistent way to pass context or results between systems

Every new data source required its own implementation. This was not an AI limitation. It was an infrastructure limitation.

What MCP Is

MCP stands for Model Context Protocol. It is a standard for connecting AI assistants to the systems where data lives, including content repositories, business tools, and development environments.

Anthropic announced MCP in November 2024 and open-sourced it on day one.

MCP defines a universal interface for:

Reading files and data sources
Executing functions and tools
Handling context and prompts
Coordinating between AI systems and external environments

The USB-C analogy is actually a good one here. Just as USB-C made it easy to connect any device to any peripheral, MCP makes it easier to connect any AI model to any data source or tool. One protocol, many integrations.

How It Works Architecturally

MCP servers expose three things:

Tools: actions the model can call
Resources: data the model can read
Prompts: templates for interaction

The model queries the server to discover what is available, then invokes tools in a structured, validated format.

Adoption

MCP did not take years to catch on. Since launch, it has been adopted by OpenAI, Microsoft, Google, and Cloudflare. The Python and JavaScript SDKs together see over 20 million weekly downloads. Over 13,000 MCP servers launched on GitHub in 2025 alone.

In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, co-founded by Anthropic, Block, and OpenAI. It now sits alongside Kubernetes and PyTorch in that portfolio.

The Honest Limitations

MCP is not perfect. Security is a real concern.

Security researchers identified multiple issues with the protocol, including prompt injection, tool permissions that allow data exfiltration, and lookalike tools that can silently replace trusted ones.

The spec does not enforce audit trails, sandboxing, or verification. MCP solves the connectivity problem. Organizations deploying it at scale are responsible for building the security layer on top.

Context Engineering: The Layer Tying It All Together

Context engineering is what makes everything above work reliably in production.

Prompt engineering is writing a good instruction. Context engineering is designing the entire information environment the model operates in:

Memory: what the model remembers from previous interactions
Retrieval: what documents or data are fetched for each query
Tools: what actions are available and how they are described
History: how much of the conversation is included
System state: what the model knows about its current task
Workflow position: where in a multi-step process the model is

The most capable AI systems today are not just better models. They are better systems designed around the model.

Context engineering is what separates an agent that works in production from one that works in a demo.

What the Modern AI Stack Looks Like

A serious AI product in 2026 is a system, not just an API call:

User Interface
    ↓
Orchestration Layer (LangGraph, AutoGen, custom)
    ↓
Context Manager
├── Memory Layer (conversation history, user preferences)
├── Retrieval Layer (vector DB, semantic search)
└── State Manager (task progress, tool outputs)
    ↓
Tool Layer (via MCP or custom integrations)
├── Web search
├── Database queries
├── API calls
├── Code execution
└── File operations
    ↓
LLM (GPT-4o, Claude, Gemini, open-source)
    ↓
Response + Actions

Each layer solves a specific failure from the layer before it. Remove any layer and you reintroduce the problem it was built to solve.

Which Layer Do You Actually Need?

Do not over-engineer this.

A simple RAG pipeline is the right call for most document Q&A use cases. A complex agent adds coordination overhead you do not need if the task is just retrieval.

Add a layer only when the simpler system actually cannot meet the requirement.

What Comes Next

A few trends worth watching:

Long-term memory: Agents that remember your preferences across months, not just sessions.

Multi-agent collaboration: Networks of specialized agents coordinating on shared goals, where each handles one domain.

Deeper real-world execution: Tighter integration with operating systems and software, not just APIs.

Autonomous workflows: Agents that manage their own task queues without step-by-step human orchestration.

The bottleneck has moved. In 2020, it was model intelligence. In 2026, it is system design: how well you manage memory, retrieval, tool coordination, and context across a complex workflow.

The Real Takeaway

The biggest mistake people make is thinking the model is the entire product.

It is not.

Modern AI systems are architectures: memory, retrieval, orchestration, tool ecosystems, context managers, and execution environments wrapped around a model.

The future will not be decided only by which model is best.

It will be decided by which system is built best around it.

If you found this useful, I write about AI engineering weekly in my newsletter AI Engineering Simplified. No hype, just practical breakdowns.