I am Building a Forking Container in C++; here's what I learned

#cpp #containers #linux #showdev

I'm building a container runtime from scratch in C++ to learn how containers actually work under the hood — no Docker, just raw Linux syscalls.
The end goal is forking a running sandbox in milliseconds using copy-on-write memory: keep one warm, fork a copy on demand, and anything that needs a clean isolated environment — like an AI agent running code — skips most of the startup cost. That's the idea behind the name.
I'm not there yet. Before you can fork a sandbox you need a sandbox, and getting the basics working threw three bugs at me that taught me most of what I now understand about Linux isolation. This post walks through each one.

To be clear about scope: this is a basic project built on raw Linux syscalls, and I'm making it mainly to learn how containers work under the hood.

Here's what works so far. It can take any command, run it as a child process, capture its output and exit code, and hand it a fake root filesystem it can't see outside of. So a command runs in its own separate environment with no view of my real files. The code is here:

⭐ Repo: ForkCage.

Getting that working gave me three bugs that taught me a lot, so this post walks through each one and what I learned from it. At the end I'll cover what I'm building next.

Bug 1: the two-pipe deadlock

This one is basic, so I'll keep it quick. To run a command I fork() a child, execvp() the command in it, and capture its output through a pipe(). My first version read all of stdout, then read stderr. It worked on small output and hung forever on anything large.

The reason: a pipe has a fixed 64 KB kernel buffer, and a writer blocks once it's full. The child filled the stderr buffer and stopped, waiting for me to read stderr, but I was still stuck reading stdout, waiting for the child to finish. Both sides waited on each other forever.

The fix is to drain both pipes at the same time instead of one after the other:

std::thread stdout_reader([&]{ stdout_output = read_all(stdout_pipe[0]); });
std::thread stderr_reader([&]{ stderr_output = read_all(stderr_pipe[0]); });
stdout_reader.join();
stderr_reader.join();

That's the whole bug. If you want to go deeper on pipes, file descriptors, and why Python's subprocess.communicate() exists for the same reason, I've linked some resources at the end.

At this point I had a working launcher, but the command could still see my entire machine. /home, /etc/passwd, my SSH keys, all of it. So the next job was to lock that down.

Bug 2: the jail where nothing would run

To hide the host filesystem, I set up a fake root directory and call chroot() on it. After chroot("/tmp/ForkCage-root"), that directory becomes / for the process, and it can't name anything above it.

I made the fake root, copied /bin/sh into it, chrooted in, and ran sh:

chroot: failed to run command '/bin/sh': No such file or directory

But /bin/sh was right there. I could list it. The error made no sense until I understood what it was actually complaining about.

/bin/sh is dynamically linked. When it starts, the kernel first loads its interpreter, the dynamic linker ld-linux-x86-64.so.2, and then every shared library the binary needs, like libc.so.6. Those live in /lib and /lib64, which didn't exist inside my fake root. The No such file or directory wasn't about sh. It was the kernel failing to find the interpreter that sh depends on.

ldd shows exactly what a binary needs:

$ ldd /bin/sh
    linux-vdso.so.1 (0x00007fff...)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f...)
    /lib64/ld-linux-x86-64.so.2 (0x00007f...)

So my setup code parses ldd for each binary and copies every .so it depends on into the jail at the same path:

for (const auto& bin : binaries) {
    copy_into_jail(bin);                       // the binary
    for (const auto& lib : shared_libs(bin))   // and every library it needs
        copy_into_jail(lib);
}

This is what a minimal container base image like alpine actually is: a small set of binaries plus the libraries they depend on. Copying the binary alone is never enough on Linux, which is also why statically linked Go binaries are popular for containers. They carry no dependencies.

Bug 3: the sandbox was leaking mounts to the host

Before the chroot, I give the process its own mount namespace with unshare(CLONE_NEWNS). The point of a mount namespace is that anything I mount inside the sandbox stays inside the sandbox.

By default it doesn't. Linux marks mounts as "shared", so mount and unmount events propagate between namespaces. A mount I set up inside the sandbox could show up on the host, and tearing it down inside could remove a mount the host was using. The isolation had a hole in it, and I didn't notice for a while because running a plain sh never mounts anything.

The fix is one line, run right after creating the namespace:

// Mark everything under "/" private so mount events don't propagate to the host.
mount(nullptr, "/", nullptr, MS_REC | MS_PRIVATE, nullptr);

MS_PRIVATE stops the propagation. MS_REC applies it to every mount under /, not just the root. A lot of from-scratch container tutorials skip this and are quietly broken because of it. I only found it by reading how runc and nsjail do the same step.

The full sequence the child runs before exec, in order:

unshare(CLONE_NEWNS);                               // 1. own mount namespace
mount(nullptr, "/", nullptr, MS_REC|MS_PRIVATE, 0); // 2. stop leaks to host
chroot(root_path_.c_str());                         // 3. fake root
chdir("/");                                          // 4. move into it

Two things I want to be honest about

chroot on its own is escapable. A process running as root inside a chroot can break out with a second chroot and some directory traversal. The proper fix is pivot_root, which swaps the root and lets you unmount the old one. chroot is a fine place to start learning; pivot_root is the real answer, and it's on my list.

unshare(CLONE_NEWNS) also needs privilege. As a normal user you get Operation not permitted, because creating a mount namespace needs CAP_SYS_ADMIN. The clean fix is a user namespace, which makes you root inside the sandbox while staying unprivileged outside. That's next.

What's next

The process can't see the host's files anymore, but it can still see the host's processes, and it can still use all the CPU and memory on the machine. Next I'm adding PID namespaces so it only sees its own process tree, user namespaces so it can run unprivileged, and cgroups to cap CPU and memory. After that, the feature I actually started this for: forking a running sandbox in milliseconds using copy-on-write memory.

If this is your kind of thing, the repo is here and I'm writing up each part as I go. Please read the code and give a star if you appreciate the efforts.

⭐ github.com/Devansh-jpg/ForkCage

Resources

If you want to dig into the basics behind Bug 1 and the rest:

man 2 pipe, man 2 fork, man 2 execve, man 2 dup2, man 7 pipe for the raw mechanics.
Python's subprocess.communicate() docs, which call out the exact deadlock from Bug 1.

Corrections are welcome.