Ugo Enyioha

Posted on Mar 10 • Edited on Mar 15

OS-Level Sandboxing: Kernel Isolation for AI Agents

#security #ai #sandboxing #opensource

Recap: Why Permission Dialogues Are the New Flash

In Part 1, we mapped the threat landscape: 37 vulnerabilities across 15+ AI IDEs, distilled into 25 repeatable attack patterns and systematized into 9 security gates. The conclusion was blunt — permission dialogues are the new Flash. Human-in-the-loop confirmations fail at 2 AM during batch operations and they fail when developers are fatigued. Sandboxing is the only structural answer.

This article covers the OS/kernel layer of that defense. We started this work after watching an agent hallucinate a destructive command that wiped local configuration files. The immediate reaction was to add a confirmation prompt. We rejected that almost as fast — confirmation prompts are permission fatigue waiting to happen. The decision: build a zero-trust sandbox architecture that breaks attack chains at the kernel level, without relying on the developer to make good judgment calls under pressure.

This is Part 2A of a four-part series. Part 2B covers the application-layer defenses that operate inside the sandbox — input sanitization, SSRF protection, phantom credential proxying, and WASM capability isolation.

Why Not Docker?

Docker was off the table immediately. Agent tool calls are sub-millisecond operations — ls, cat, grep, git status — fired hundreds of times per session. Docker container cold-start adds hundreds of milliseconds of overhead per invocation (IBM Research benchmarks and community measurements report 200–600ms depending on platform and configuration). For the cold-start-per-command pattern agents use, simple commands take orders of magnitude longer than native execution. For an interactive CLI, that's a non-starter.

The alternative — a persistent Docker container with a hot shell — introduces state management complexity: orphaned containers, stale mounts, port conflicts. It also doesn't solve the cold-start problem for the first invocation. We chose instead a multi-tiered defense-in-depth approach using lightweight OS-level sandboxing primitives that add microseconds, not hundreds of milliseconds.

The tradeoff we accepted: we gave up Docker's well-understood isolation model in exchange for tighter integration with the host and significantly more engineering surface area to maintain.

The Architecture

Figure 1: C4 Container-level diagram — User prompts flow through the HTTP server, agent loop, and permission layer into the sandbox dispatch. Click for full-resolution SVG.

Figure 2: C4 Component-level diagram — Global and agent configs merge via the restrictiveness lattice. Agents can only escalate isolation, never downgrade it. Click for full-resolution SVG.

The data flow is straightforward: user prompt → HTTP server → agent loop → permission layer → sandbox dispatch. The dispatch probes for available backends in order of restrictiveness: firecracker → gvisor → bwrap → namespace → none. This is a waterfall, not a menu — the runtime picks the most restrictive backend available on the host. And if none are available, the system operates without sandboxing, but only if hardened mode is disabled. The design is fail-fast: no silent degradation.

The Restrictiveness Lattice: Agents Cannot Downgrade Themselves

The core design insight in sandbox/index.ts is that isolation isn't binary — it's a partial order. We needed a way to merge global operator policy with per-agent configuration without letting a compromised agent config weaken the system. The answer: a restrictiveness lattice where the merge operation always picks the higher value.

const BACKEND_RESTRICTIVENESS: Record<Backend, number> = {
  none: 0,
  "sandbox-exec": 1,
  namespace: 1,
  bwrap: 2,
  gvisor: 3,
  firecracker: 4,
  auto: 5, // "most restrictive available" — wins every comparison
}

This table drives a critical security property: agents can only escalate their own sandbox level, never downgrade it. When a global config sets bwrap (level 2) and a rogue agent config tries namespace (level 1), the runtime picks bwrap:

// Lattice merge: always picks the more restrictive backend
const effectiveBash: Backend =
  BACKEND_RESTRICTIVENESS[agentBash] > BACKEND_RESTRICTIVENESS[globalBash]
    ? agentBash // agent is MORE restrictive — honor it
    : globalBash // agent is less restrictive — keep global
// A rogue agent.json requesting "namespace" against global "bwrap" → bwrap wins

The reasoning: an agent's config file lives in the workspace, and workspaces are untrusted by default (they come from git clone). The global config lives on the operator's machine. Untrusted input must never override trusted policy.

The auto-detection waterfall on Linux is firecracker → gvisor → bwrap → namespace → none. Explicit mode requests are fail-fast — if you ask for bwrap and the binary is absent, you get a thrown error, not silent degradation to none. Silent fallback is exactly how sandbox bypasses happen in production. A system that silently downgrades to none is worse than a system without a sandbox, because the operator believes they're protected.

Network isolation follows the same lattice principle — false is more restrictive than true. If the global config says no network, no agent can override it:

// If global says no network, agent cannot re-enable it
const effectiveNetwork = (globalSandbox.network ?? false) && (agentSandbox.network ?? false)

Resource limits use Math.min — agents can request less memory/CPU, never more. We guard against Infinity bypass attempts using Number.isFinite() validation, because untrusted config input could contain Infinity values that would defeat the Math.min guard.

Important: The sandbox system activates in hardened mode (OPENCODE_HARDENED_MODE=true). Without hardened mode, the sandbox dispatch and several application-layer gates (like SSRF protection) are bypassed — this is by design for trusted development environments. Hardened mode is the toggle that moves from "developer convenience" to "zero-trust enforcement."

Why this matters for Part 1 threats: The lattice directly prevents the Codex zero-click config downgrade pattern. Even if an attacker plants a config requesting sandbox: "none", the global floor holds.

Figure 3: The restrictiveness lattice. Merge always moves outward — agents can escalate isolation, never downgrade it.

Linux: Bubblewrap (bwrap) — Unprivileged Namespace Isolation

We chose Bubblewrap as the primary Linux backend because it's an unprivileged user-namespace sandbox — no root required, no daemon, no setuid binary. Originally written for Flatpak, it has years of production hardening. The argument construction is deliberately verbose (belt-and-suspenders redundancy on namespace unsharing) because we'd rather have a redundant flag than discover a kernel version where --unshare-all doesn't cover a namespace we assumed it would.

// From packages/opencode/src/sandbox/bwrap.ts
const args = [
  "--unshare-all", // unshare every namespace (user, pid, net, uts, cgroup)
  "--die-with-parent", // child dies when parent dies — no zombie sandbox processes
  "--new-session", // new session ID — detaches from terminal control
  // ... explicit redundant unshares (--unshare-user, --unshare-pid, etc.) omitted
  // See: packages/opencode/src/sandbox/bwrap.ts for full argument list
]

// Network: blocked by default, explicitly opt-in
if (opts.network) {
  args.push("--share-net")
  args.push("--ro-bind", "/etc/resolv.conf", "/etc/resolv.conf")
} else {
  args.push("--unshare-net") // completely removes NIC — no loopback
}

// Minimal read-only filesystem view
args.push("--ro-bind", "/usr", "/usr", "--ro-bind", "/lib", "/lib")
args.push("--ro-bind-try", "/lib64", "/lib64")
args.push("--ro-bind", "/bin", "/bin", "--ro-bind", "/sbin", "/sbin")

// Writable workspace
args.push("--bind", opts.workdir, opts.workdir, "--chdir", opts.workdir)

// Additional writable paths (e.g. tool caches, tmp dirs)
if (opts.writable) {
  for (const path of opts.writable) {
    args.push("--bind", path, path)
  }
}

The filesystem mount list is intentionally minimal. The agent sees /usr, /lib, /bin, /sbin (read-only) and opts.workdir plus any explicitly declared writable paths (read-write). Nothing else. ~/.ssh doesn't exist in the mount tree. ~/.aws doesn't exist. /etc/passwd doesn't exist. We accepted the cost that some tools might fail if they probe paths outside this set, because the alternative — mounting the full filesystem read-only — would expose credentials, SSH keys, and shell history to any prompt-injected agent.

--unshare-net removes the network namespace entirely — including loopback. If the Codex zero-click exploit had fired inside bwrap, the reverse shell payload would have died at DNS resolution:

# Within bwrap:
cat ~/.ssh/id_rsa   # → No such file or directory
curl attacker.com   # → Could not resolve host (network unshared)

What's intentionally excluded from the mount tree: ~/.ssh, ~/.aws, ~/.config, ~/.docker, /etc/passwd, ~/.bash_history. Any of these would be gold for a prompt-injected agent. We'd rather break a tool that expects to find them than silently expose secrets.

gVisor (runsc) — User-Space Kernel

The fundamental problem with all namespace-based sandboxes — bwrap, Docker, raw namespaces — is that they share the host kernel. If a CVE like Dirty Cow (CVE-2016-5195) or io_uring use-after-free (CVE-2023-32233) drops, a sandboxed process can exploit the kernel and escape. Namespaces restrict what a process can see, not what syscalls the kernel executes on its behalf.

gVisor eliminates this by interposing a user-space kernel called the Sentry — a Go process that intercepts every syscall. The host kernel never sees raw syscalls from the sandboxed process.

// From packages/opencode/src/sandbox/gvisor.ts
const runsc = runscPath()!
const args: string[] = ["--rootless"]
args.push(opts.network ? "--network=host" : "--network=none")
args.push("do", "--cwd", opts.workdir)

const writable = new Set([opts.workdir, ...(opts.writable ?? [])])
for (const dir of writable) {
  args.push("--volume", `${dir}:${dir}`)
}
args.push("--", ...opts.command)

The tradeoff: gVisor adds ~10–50% overhead to syscall-heavy workloads. For an ls or cat, that's noise. For a find across a large monorepo, it's noticeable. We decided the kernel isolation boundary was worth the cost for operators who want it, while keeping bwrap as the default for the common case where kernel exploits aren't in the threat model.

When to use gVisor vs bwrap: If your threat model includes kernel exploits — for example, multi-tenant environments where an attacker controls one agent and tries to escape to affect another — gVisor is the correct choice. For single-developer CLI use where the attacker is a prompt injection trying to exfiltrate secrets, bwrap's namespace isolation is sufficient and faster.

Figure 4: bwrap shares the host kernel — a kernel CVE can lead to escape. gVisor's Sentry intercepts every syscall in user space, preventing raw syscall access to the host kernel.

macOS: Apple Seatbelt (sandbox-exec)

macOS doesn't have user namespaces. The closest equivalent is Apple's Seatbelt MAC framework, accessed through sandbox-exec with a dynamically generated Sandbox Profile Language policy. The profile is generated at runtime because the writable paths and network policy depend on the agent's configuration — a static profile can't express "write only to /Users/dev/myproject."

// From packages/opencode/src/sandbox/darwin.ts
function profile(opts: Sandbox.Options) {
  const writable = [opts.workdir, ...(opts.writable ?? [])]
  const allowWrite = writable.map((item) => `(allow file-write* (subpath \"${esc(item)}\"))`).join("\n")
  const allowNet = opts.network !== false ? "(allow network*)" : "(deny network*)"

  return [
    "(version 1)",
    "(deny default)", // deny-by-default
    "(allow process-exec)",
    "(allow process-fork)",
    "(allow file-read*)", // reads allowed everywhere — see note below
    allowWrite, // writes ONLY in project dir + extras
    allowNet, // network: block or allow all
  ].join("\n")
}

The deliberate (allow file-read*) deserves explanation. The alternative — enumerating every path that npm, cargo, go, python, ruby, and their transitive dependencies might need to read — is a maintenance nightmare that would break on every toolchain update. The security model accepts read visibility in exchange for write isolation and network isolation. If you need read isolation, you need bwrap or gVisor, which means you need Linux.

# Within sandbox-exec, writes outside workdir are blocked at the kernel:
echo 'evil' >> ~/.bashrc    # → Operation not permitted
echo 'evil' >> ~/.gitconfig # → Operation not permitted

The deprecation risk we're living with: sandbox-exec is deprecated by Apple and may be removed in a future macOS release. There is no official replacement with equivalent functionality. When that day comes, our options are: ship a custom kext (painful with Apple's notarization process), move to an Endpoint Security framework approach (requires a daemon), or accept that macOS agents run with weaker isolation than Linux. None of these are good answers. We're being transparent about this gap rather than pretending it doesn't exist.

The MCP Server Gap: The Industry's Open Problem

We need to be blunt about this: MCP servers are a distinct attack surface from agent tool calls, and we don't sandbox them.

Our sandbox dispatch wraps ephemeral tool execution (like bash or webfetch), but MCP server processes spawn in the host context, natively on your machine. This is the industry standard across Claude Desktop, Cursor, and every other tool we've examined. The reason is straightforward: capability heterogeneity. A PostgreSQL MCP connector needs network access. An AWS manager MCP needs to read ~/.aws/credentials. A filesystem MCP needs arbitrary read paths. If we drop their network interfaces and restrict their filesystems, they crash.

We considered applying a blanket bwrap policy to all MCP servers and immediately hit the configuration explosion problem. Every MCP server would need a capability manifest declaring its filesystem and network requirements, and there's no standard for that. The alternative — interactive prompts ("This MCP requests network access. Allow?") — is permission fatigue, which is exactly what we're trying to eliminate.

The current mitigation (Gate 1): We closed the zero-click vector that made this gap most dangerous. The Trust Module uses SHA-256 content-hashing: a malicious repository containing an .mcp.json file cannot automatically spawn a rogue server. OpenCode intercepts the untrusted config on boot, flags it as unapproved, and blocks execution before the MCP server is ever launched. If a developer explicitly runs opencode trust on a malicious repo, they grant that MCP server host access. But the zero-click supply-chain vector is dead.

The three paths forward:

WASM-Only Mandate. Force all MCP servers to compile to WebAssembly with strict capability constraints. Projects like mcp.run are using Extism for this already. The tension: compiling Python/JS to WASM creates massive binaries, breaks C-extensions, and lacks threading. It would break 99% of existing servers.
Docker Sidecar. Run long-lived Docker containers for untrusted MCPs, passing stdio over the container boundary. Docker's MCP Toolkit advocates this approach. The tension: the sidecar doesn't share the host filesystem, so MCP servers that read local Git state require complex volume mount orchestration.
Lattice Extension. MCP servers declare required capabilities in their manifest; the runtime routes execution through the sandbox dispatcher. Claude Code uses a variation of this with explicit user approval for network/file modifications. The tension: if a workspace MCP requests unsafe capabilities, it requires an interactive prompt — which is permission fatigue.

Our current position: we rely on the G1 trust hash to prevent drive-by MCP executions, while giving power users flexibility to bring their own Docker isolation via configuration. It's not a complete answer. We're watching the WASM ecosystem mature and capability-manifest standards coalesce before committing to a single path.

Next: What Happens Inside the Sandbox

OS sandboxes draw hard boundaries around processes. But what happens when an agent has legitimate network access and gets prompt-injected? What stops it from exfiltrating secrets through an allowed HTTP channel? What prevents it from using a legitimate database credential to DROP TABLE?

Part 2B covers the application-layer defenses that operate at a higher semantic level than the kernel: input sanitization that strips invisible Unicode before the LLM sees it, SSRF protection with DNS-pinned IP denylists, phantom credential proxying that keeps real API keys outside the sandbox entirely, content-addressed trust that defeats TOCTOU attacks, and WASM capability isolation that eliminates the shell as an attack surface.

Continue to Part 2B: Application-Layer Defense →

This article is Part 2A of a four-part series on AI agent security. Part 1 covers the threat landscape — 37 vulnerabilities across 15 AI IDEs, 25 vulnerability patterns, and the 9 security gates. Part 2B covers application-layer defenses. Part 3 covers testing the sandbox with property-based fuzzing and red-team evaluations.

Based on the sandbox architecture built into OpenCode. Code refs: packages/opencode/src/sandbox/{index,bwrap,darwin,gvisor,firecracker,linux}.ts, packages/opencode/src/trust/index.ts.

The threat model in this article is informed by independent research from Mindgard's AI Red Team, who disclosed 37 vulnerabilities across 15+ AI IDE vendors. Their vulnerability pattern catalog and security checklist are available on GitHub.

Top comments (1)

Uma Paru • May 16

The MCP server gap is the most honest part of this post. Nobody has solved it cleanly. The zero-click hash guard closes the worst vector, but the fundamental tension remains — MCP servers need capabilities that conflict with isolation. WASM might be the long answer, but it breaks too much existing code today. I'm glad you called out that this is an industry-wide open problem, not a solved one.