Dipankar Sarkar

Posted on Jul 5

Sandboxing untrusted agent code with gVisor costs ~200ms per cold start. Blocking syscalls instead of emulating them costs ~8ms

#ai #agents #devops #opensource

You are running code you did not write. It might be an AI agent executing an
LLM's output, a CI job running npm install across dependencies nobody audited,
or a plugin that insists it needs shell access. A normal container does almost
nothing for you here. It is namespaces and cgroups, and the full kernel attack
surface is still sitting right there. Every runc escape CVE is the reminder.

The usual heavy answer is gVisor. It puts a userspace kernel in front of the
workload and emulates syscalls. It works. It also costs you 5-250x syscall
overhead, roughly 200ms cold starts, and about 50MB per container. For a
high-throughput API or a serverless function, that overhead is the whole budget.

zviz takes a different route. It is an OCI-compatible container runtime
written in Zig that uses selective denial instead of emulation. Most syscalls
reach the host kernel at native speed. A small set of dangerous ones are blocked
before any kernel code runs. One is argument-filtered inline. No userspace
kernel, no emulation, no daemon.

The core idea: allow, deny, broker

gVisor's model is that every syscall goes through its Sentry process, which
emulates it. zviz's model is a filter that makes one of three decisions per
syscall:

gVisor:  App -> Sentry (emulates ~300 syscalls) -> Host kernel (~70 syscalls)
zviz:    App -> BPF filter -> ALLOW (native speed) / DENY (EPERM) / BROKER (mediated)

Allowed syscalls hit the kernel directly, so they run at native speed. Dangerous
ones are denied immediately with EPERM, so exploit code fails on the spot rather
than being safely emulated. A tiny set gets routed to a userspace broker for
inspection. The socket syscall is filtered on its arguments inline.

The philosophical difference matters for compatibility. gVisor emulates a
dangerous syscall safely. zviz refuses it. Both isolate, but the failure modes
are opposite: emulation stays compatible, denial stays strict and fast.

How it works

The enforcement is five layers, applied in a specific order, and the order is
the interesting part:

Layer	Mechanism	Purpose
1	Namespaces (user, pid, mount, ipc, uts)	Resource isolation
2	Capabilities (all 41 dropped)	Privilege elimination
3	Landlock LSM	Filesystem access control
4	Seccomp-BPF (124 instructions)	Syscall filtering
5	cgroups v2	Resource limits

Capabilities drop before seccomp loads, and Landlock applies before seccomp,
so the security setup syscalls do not get caught by the very filter they are
installing. Get that ordering wrong and the runtime blocks itself while arming.
The default profile drops all 41 Linux capabilities, applies a Landlock ruleset,
mounts /proc, /sys, and /dev privately, and runs the workload as PID 1 of a
fresh user, PID, mount, IPC, and UTS namespace.

The whole seccomp policy is 124 BPF instructions. That is small enough to audit
by hand, which is a real security property. You can read the exact filter that
stands between untrusted code and your kernel.

Running it

You build with Zig 0.15.0+ on Linux 5.13+ (5.13 is where Landlock landed), then
run any OCI bundle. The README's example extracts a Redis image and runs it:

git clone https://github.com/Skelf-Research/zviz.git
cd zviz && zig build -Doptimize=ReleaseSafe

# Build an OCI bundle from any Docker image
mkdir -p ~/zviz-bundle/rootfs
docker create --name extract redis:alpine
docker export extract | tar -C ~/zviz-bundle/rootfs -xf -
docker rm extract

# Run it, verbose logs every blocked syscall
./zig-out/bin/zviz run my-container ~/zviz-bundle --verbose

The --verbose flag logs every blocked syscall, which is exactly what you need
when an agent workload hits a restriction you did not expect. There are built-in
profiles for the common cases: ci-runner (the balanced default), web-server
(network allowed), batch-job (no network, 8G memory), hostile-tenant
(maximum restrictions), and development (allows ptrace, explicitly not for
production).

zviz auto-mounts the pseudo-filesystems so you do not have to declare them:
/proc as procfs with nosuid, nodev, noexec, /sys as read-only sysfs, and
/dev as a private tmpfs with the standard device nodes bind-mounted in.

The numbers

The README reports these, tested against real escape techniques:

Metric	zviz	gVisor
Escape tests blocked	19/19 (100%)	11/19 (58%)*
Cold start	~8ms	~200ms
Memory per container	~2MB	~50MB
Policy compatibility	98.2%	baseline

Syscall latency is where selective denial pays off. clock_gettime is 20ns on
zviz versus 4,982ns on gVisor, a 249x gap, because zviz lets it hit the kernel
directly while gVisor routes it through Sentry. read is 20.7x faster, getpid
4.1x. The asterisk on the escape numbers is important and the README is honest
about it: gVisor "allows" some syscalls like ptrace and mount but emulates them
safely, which is a different philosophy with an equivalent security outcome for
those operations, not a straight loss.

Where it does not fit

The README has a whole "when to use gVisor instead" section, which is the right
instinct.

If your workload needs ptrace, zviz blocks it. strace, debuggers, and anything
that traces another process will not run. gVisor emulates it safely, so for
debugging-heavy workloads gVisor wins. If you need mount or unshare for
Docker-in-Docker, or you run Bazel or Nix builds that create their own internal
namespaces, zviz denies the syscalls those need. Nested containerization is a
gVisor job.

The 1.8% policy gap is a deliberate choice: zviz defaults network egress to
deny, gVisor allows it. That is stricter, but it means a workload that expects
outbound network fails closed until you pick a profile that opens it. On Ubuntu
24.04+ there is an extra setup step, because the kernel's
apparmor_restrict_unprivileged_userns sysctl blocks the bind mount
pivot_root needs. Without installing the provided AppArmor profile, zviz falls
back to chdir-only filesystem isolation, which is weaker. And it is Linux-only,
kernel 5.13+, cgroups v2 required. There is no macOS or Windows story.

The honest framing from the README: if you need nested containers or process
tracing, use gVisor. Otherwise zviz is faster and stricter.

Takeaways

Selective denial beats emulation on speed because allowed syscalls hit the kernel at native speed. That is the 249x clock_gettime gap.
Denial also fails safer for exploits. Malicious code gets EPERM immediately rather than a safely-emulated success.
Layer ordering is load-bearing. Capabilities and Landlock go before seccomp so the runtime does not block its own setup.
The cost is compatibility. No ptrace, no nested containers, no Bazel/Nix internal sandboxing. That is the trade for ~8ms cold starts and a 124- instruction filter you can actually read.

If you run untrusted or agent-generated code and your workloads do not need
ptrace or nested containers, zviz is worth a benchmark against your current
sandbox. Code, the threat model, and the comparison docs are here:
https://github.com/Skelf-Research/zviz

I am curious which blocked syscall trips up the first real agent workload you
throw at it. Run it with --verbose and open an issue with the trace.

Top comments (1)

Max Quimby • Jul 7

The allow/deny/broker split is the right frame, and I think the compatibility-vs-strictness tradeoff is even sharper for agent workloads than for CI. A CI job that hits EPERM just fails and a human reads the log. An agent that hits EPERM often reasons around it — it rewrites the code, tries a different syscall path, or decides the sandbox is "broken" and burns tokens looping. So --verbose logging every blocked syscall isn't just ops hygiene, it's the signal you feed back into the agent's context so it stops fighting the wall. The 200ms cold start matters more than it looks, too: if you sandbox per tool-call instead of per session, that's 200ms multiplied by every step in a long run, and it dominates wall-clock fast. Question on the broker path: for the argument-filtered socket case, can a workload request an allowlisted egress (one API host), or is it all-or-nothing per profile? Egress is usually where "hostile-tenant" actually leaks.