Mahmoud Berkoti

Posted on Jun 7

Building a Secure Code Execution Sandbox in Rust

#rust #security #systems #linux

I got annoyed.

I was looking at how most code execution platforms handle sandboxing and kept seeing the same pattern: throw it in a Docker container, set a timeout, call it secure. That's not sandboxing, that's hoping nothing goes wrong. So I built Custody to figure out what real isolation actually looks like when you have to think about every layer.

This isn't a tutorial. It's more of a breakdown of the decisions I made, why I made them, and what I'd change if I started over.

What I was defending against
Before touching any code I wrote out the threat model. Not because I was being formal about it, but because "secure code execution" means nothing unless you're specific about what you're securing against. My list:

Fork bombs killing the host by exhausting PIDs
Memory exhaustion crashing everything outside the sandbox
Code trying to phone home over the network
Filesystem escapes reaching host data
Syscall abuse through dangerous kernel interfaces
Infinite loops sitting there consuming CPU forever

The important thing I realized early is that no single mechanism stops all of these. Docker alone doesn't. gVisor alone doesn't. You have to layer them, and each layer has to be doing something the others aren't.

The outer wall: gVisor

gVisor sits between the untrusted code and the Linux kernel. Instead of syscalls going straight to the kernel, they hit gVisor's Sentry first, which is a user-space kernel reimplementation in Go. It's slower than native execution, but for a sandbox that's actually a reasonable tradeoff.

Every job in Custody spawns through runsc with two flags I care about:

"--network=none".to_string(),
"--no-new-privs".to_string(),

--network=none is self-explanatory. No network, no exfiltration, no DNS tricks, nothing. --no-new-privs is the less obvious one. It stops any process inside the sandbox from gaining elevated privileges through setuid binaries. Without it, a clever attacker finds a setuid binary in your rootfs and suddenly your sandbox assumptions fall apart.

Resource limits: cgroups v2

gVisor handles isolation. It does nothing about resource consumption. A sandboxed process can still fork bomb itself into consuming every PID on the system if you don't enforce limits separately.

I write directly to the cgroup filesystem instead of going through a container runtime:

fs::write(
self.cgroup_path.join("cpu.max"),
format!("{} {}", quota_us, period_us),
)?;

Four limits per job: CPU quota, memory ceiling, max PIDs, and wall clock timeout. The wall timeout is separate from CPU quota on purpose. A process that just sleeps uses zero CPU but sits there forever. cgroups won't catch that. So the wall timeout is enforced at the application layer with a polling loop that kills the process if it runs past the deadline.

Cleanup happens in the Drop impl so the cgroup gets torn down even if something panics mid-execution. That one took me longer than I'd like to admit to think of.

Syscall filtering: seccomp

This is the innermost layer. Even if someone escapes gVisor somehow, seccomp is a kernel-level allowlist of syscalls the process is permitted to make. A Python script needs maybe 30 syscalls to run normally. It does not need ptrace, mount, or socket. Block everything else at the kernel level and a lot of attack surface just disappears.

I detect violations by watching for exit code 159 or SIGSYS in stderr:

if code == 159 || stderr_str.contains("seccomp") {
kill_reason = Some("seccomp_violation".to_string());
}

Two things most sandboxes miss

Output floods and OOM detection. If a job writes gigabytes to stdout it can exhaust host memory even if execution itself is isolated. Custody caps output per job and truncates explicitly so the caller knows why output ended early rather than just getting a mystery empty response.

OOM detection reads memory.current from the cgroup after execution finishes. Not perfect, but good enough to catch the obvious cases and label them correctly in the audit log.

The kill reason taxonomy

Every job exits with a labeled reason: timeout_wall, output_limit, oom, seccomp_violation, or clean. This sounds like a small thing but it matters a lot in practice. When you're looking at audit logs and trying to understand whether you're seeing a bug, a resource tuning problem, or an actual attack attempt, vague "job failed" entries are useless. Named failure modes make the logs actually tell you something.

What I'd do differently

Three things I'd change in a v2.

The rootfs preparation copies binaries from the host at runtime. That's fragile. A proper implementation uses pre-built OCI images per language that are version-pinned and auditable.

The wall timeout uses a polling loop with 100ms sleeps. For short-lived jobs this adds real latency. The right approach is pidfd_open plus epoll so you get notified the instant the process exits instead of waiting for the next poll interval.

The seccomp profiles are handwritten per language. That works but it's maintenance-heavy and easy to get wrong. Better to generate them from actual runtime behavior using eBPF or strace analysis, then tighten from there.

How it held up

100% containment across 8 attack scenarios including fork bombs, network exfiltration attempts, and OOM conditions. The architecture is deliberately simple. For security-critical code I'd rather have something I can fully audit in an afternoon than something clever I have to trust blindly.

DEV Community

Building a Secure Code Execution Sandbox in Rust

Top comments (0)