Hector Flores

Posted on Mar 19 • Edited on May 29 • Originally published at htek.dev

NVIDIA OpenShell and the Rise of Agent Sandboxes in Agentic DevOps

#security #devops #aiagents #agenticdevelopment

Your Agents Are Running on Bare Metal. That Should Terrify You.

I've spent months building layered enforcement architecture for AI agents — instructions, hooks, gates. Three layers of defense that make agents structurally incapable of shipping untested code. 247 commits, 100% test coverage, zero rollbacks.

But there's a question I kept dodging: where are these agents actually running?

GitHub Agentic Workflows gives you a sandboxed runner — a disposable VM that spins up, does work, and disappears. It's excellent. It's also specific to GitHub. The moment your agent needs to hit your staging database, call an internal API, or access credentials to provision infrastructure, that sandbox boundary dissolves. Your agent is operating on real systems with real consequences.

Then NVIDIA dropped OpenShell at GTC 2026 — an open-source, policy-driven sandbox runtime for autonomous AI agents. And suddenly the conversation changed from "should we sandbox agents?" to "how fast can we get this deployed?"

That's the gap this article addresses. We've been obsessing over what agents can do (hooks, gates, policies) without addressing where they do it. Sandboxes are the missing piece — Layer 0 of agentic DevOps.

Layer 0: The Enforcement Boundary

In my agent-proof architecture, I described three enforcement layers:

Layer 1: Instructions — Tell the agent what you expect
Layer 2: Hooks — Remind the agent at the moment of action
Layer 3: Gates — Verify server-side before merge

These layers assume something critical: the agent is operating in an environment where enforcement can happen. But what if it isn't?

An agent running on your local machine can spawn subprocesses that bypass hooks. It can write to disk outside your project directory. It can make network calls to services you didn't authorize. Instructions tell it not to. Hooks try to catch it. But without an isolation boundary, these are speed bumps, not walls.

Sandboxes are Layer 0 — the execution environment that makes every other layer enforceable. They don't replace hooks and gates. They make hooks and gates trustworthy.

Think of it this way:

Hooks run inside the sandbox — they control what the agent does
Gates validate from outside the sandbox — they verify what the agent produced
Policies declare what the sandbox allows — they define the boundary itself
The sandbox is the bridge between "tell the agent" and "enforce on the agent"

The 4-layer enforcement architecture — each layer compensates for the weaknesses of the layers above it. Sandboxes at Layer 0 are the final backstop that cannot be bypassed.

The Sandbox Landscape Exploded in 2025–2026

A year ago, "AI sandbox" meant E2B and maybe Docker. Today there are 30+ platforms competing across every dimension — isolation strength, cold start time, GPU access, persistence, and pricing.

The market segments by isolation technology:

The sandbox isolation landscape — 30+ platforms racing to own agent execution security, categorized by isolation technology and strength.

Isolation Tech	Strength	Trade-off	Key Platforms
Firecracker microVM	Strongest — dedicated kernel per workload	Slower cold starts, more resource overhead	E2B, Northflank, Vercel Sandbox, Blaxel, Fly.io Sprites
Kernel-level LSM	Strong — syscall-level enforcement	Requires Linux, complex policy authoring	NVIDIA OpenShell
gVisor	Good — userspace kernel interception	Some syscall compatibility gaps	Modal
Container	Moderate — shared kernel, namespace isolation	Escape vulnerabilities are well-documented	Daytona, Alibaba OpenSandbox
V8 Isolate / Wasm	Lightweight — process-level isolation	Limited to specific runtimes	Cloudflare Workers, Rivet Secure Exec

The cold start race tells you where the market is heading: Blaxel claims 25ms resume from standby, Daytona hits sub-90ms, E2B does ~150ms with full microVM isolation. For agentic workloads where an agent might spin up dozens of sandboxes during a single task, milliseconds matter.

The Comparison That Matters

For agentic DevOps specifically, here's what I'd look at:

Platform	Cold Start	Open Source	GPU	Self-Hosted	Pricing
E2B	~150ms	✅ (core)	❌	Via Terraform	~$0.08/hr
Daytona	Under 90ms	✅ (AGPL)	✅	❌	~$0.08/hr
Modal	Sub-second	❌	✅ Best	❌	Pay-per-second
OpenShell	Seconds	✅ Apache 2.0	✅ (DGX/RTX)	✅	Free
Northflank	Fast	❌	❌	✅ BYOC	Per-second
Fly.io Sprites	1-12s	❌	❌	❌	CPU+mem+storage
OpenSandbox	Variable	✅ Apache 2.0	❌	✅	Free
Microsandbox	Variable	✅ Apache 2.0	❌	✅ Local-first	Free

If you need ephemeral execution for agent backends, E2B is the proven choice with 200M+ sandboxes served. If you need persistent state with fast starts, Daytona (67K GitHub stars) or Fly.io Sprites are compelling. For GPU workloads, Modal is unmatched.

But for agentic DevOps — where policy-governed isolation is the whole point — one platform stands out.

NVIDIA OpenShell: Policy-Driven Agent Sandboxing

OpenShell, announced at GTC 2026, takes a fundamentally different approach. Instead of "here's a sandbox, run your code," it's "here's a policy engine, declare what the agent can do."

OpenShell enforces four protection domains:

OpenShell's 4 protection domains — declarative YAML policies with kernel-level enforcement that agents physically cannot circumvent.

Filesystem — Landlock LSM locks allowed paths at sandbox creation. Not a namespace trick. Kernel-enforced.
Network — Deny-by-default. Every outbound connection goes through an HTTP CONNECT proxy evaluated by OPA/Rego policies in real-time.
Process — Seccomp BPF filters block dangerous syscalls. No privilege escalation, no socket creation outside the proxy.
Inference — A privacy router intercepts LLM API calls, strips caller credentials, and injects backend credentials. Your agent's context never leaks to unauthorized model providers.

The killer feature is declarative YAML policies that hot-reload on running sandboxes:

# Allow the agent to reach GitHub API and npm registry — nothing else
network:
  outbound:
    - host: "api.github.com"
      ports: [443]
      methods: [GET, POST]
    - host: "registry.npmjs.org"
      ports: [443]
      methods: [GET]

Change the policy file, and the running sandbox immediately enforces the new rules. No restart. No downtime. This is what makes it fit the agentic DevOps model — policies are code, code is versioned, versioned policies are auditable.

OpenShell is Apache 2.0, fully self-hosted, and runs as a lightweight K3s cluster inside a single Docker container. Two commands to get started:

openshell sandbox create -- claude
openshell policy set my-sandbox --policy network-policy.yaml

It's alpha software — single-player mode, rough edges. But the architecture is right: sandboxes aren't just isolation, they're governance infrastructure.

Sandboxes Complete the Agentic DevOps Stack

Here's how sandboxes connect to everything I've written about agentic DevOps:

With hookflows, you enforce rules at the moment of action. But hookflows run in the agent's process — they trust the environment. A sandbox makes the environment itself trustworthy.

With agent hooks, you intercept tool calls and block dangerous operations. But hooks can be disabled by a sufficiently creative agent (or developer). A sandbox enforces at the kernel level — there's no --skip-sandbox flag.

With gates in CI/CD, you verify everything server-side. But gates only catch problems after the agent has already made changes. A sandbox prevents the problems from happening during execution.

With GitHub Agentic Workflows, you get a purpose-built sandbox for GitHub's ecosystem. General-purpose sandboxes extend that model to any infrastructure — your staging environments, your databases, your internal APIs.

The progression is clear:

Layer	Mechanism	When	Strength	Weakness
Layer 0: Sandbox	Kernel/VM isolation	During execution	Can't be bypassed	Requires infrastructure
Layer 1: Instructions	Context engineering	Before action	Easy to author	Easy to ignore
Layer 2: Hooks	Tool-call interception	At moment of action	Real-time enforcement	Can be disabled
Layer 3: Gates	CI/CD pipeline	After action	Server-side, tamper-proof	Catches problems late

Each layer compensates for the weaknesses of the others. Sandboxes at Layer 0 mean that even if an agent bypasses hooks, it physically cannot access unauthorized filesystems, networks, or processes.

The Bottom Line

We've been building agentic DevOps from the top down — instructions, hooks, gates. All essential. All insufficient without the foundation.

Sandboxes are that foundation. They're the difference between "we told the agent not to" and "the agent literally cannot." Between policy-as-suggestion and policy-as-physics.

NVIDIA's OpenShell is the most significant new entrant because it treats sandboxes as governance infrastructure, not just containers. Declarative YAML policies, hot-reloadable at runtime, with kernel-level enforcement that agents physically cannot circumvent. It's Apache 2.0, it's free, and it works with Claude Code, Codex, and Copilot out of the box.

The sandbox market is mature enough to use today. E2B for ephemeral execution, Daytona for fast iteration, Modal for GPU workloads, OpenShell for policy-governed isolation. The tooling exists. The question is whether your agentic DevOps stack includes it.

If you're running agents without sandbox isolation, you're running agents on trust. And trust doesn't scale.