DEV Community

Cover image for Build a Container from Scratch in Go (Modern Namespaces + cgroup v2)
Faizan Firdousi
Faizan Firdousi

Posted on

Build a Container from Scratch in Go (Modern Namespaces + cgroup v2)

So I made a container from scratch in Go with the minimum number of lines of code possible and learned various things that usually happen inside a container and that is what Docker abstracts away.

Although you may get other articles about the same, I wrote this because of some changes in the Linux kernel, the blogs got somewhat deprecated in terms of understanding and implementation of code.

What are containers?

Let's understand what containers are. As you already might know, fundamentally, containers are "about bundling up dependencies so we can ship code around in a repeatable, secure way."

But let's understand the underlying concepts - Namespaces, Cgroups, and Filesystem isolation that make containers what they are.

Namespaces

Linux namespaces are a fundamental kernel feature that provides resource isolation so different sets of processes see different sets of resources.

Namespaces are very important, as we will see further how much we are going to use them because they're the core technologies containers are built on. So whenever you create containers with Docker or Podman, it automatically creates namespaces on your behalf.

Namespaces primarily are of 6 types (note that we will cover them more while coding the container, as that is how I like learning by making things so I'm not bombarding you with a lot of information firsthand):

  • PID - Assigns independent process IDs. The first process in a new namespace gets PID 1.
  • Network - Provides an independent network stack via virtual ethernet pairs.
  • MNT - Maintains an independent list of mount points, allowing filesystem mounts/unmounts without affecting the host.
  • USER - Has its own user IDs and group IDs, allowing a process to have root privilege within its namespace without having it elsewhere.
  • IPC - Isolates interprocess communication resources like POSIX message queues.
  • UTS - Allows different host and domain names for different processes on the same system.

Cgroups

While understanding cgroups and knowing on the surface level what cgroups do is very easy it basically means control groups, it's a Linux kernel feature that limits, accounts for, and isolates the resource usage including CPU, memory, disk I/O, and network of collections of processes.

We can talk a lot about cgroups, but I'll stick to the basics required for implementation, and don't worry, I'll talk about it more again down further.

Let's name the phases of code as it will be beneficial to go one by one, understanding deeply.

here's the github link for full code(please give a star)-
https://github.com/faizanfirdousi/container-from-scratch

Phase 0

First, let me clarify what we're doing here exactly our main goal is to create a container that is isolated from the host machine. In other words, we're building a sandboxed environment that runs a shell with its own filesystem, process namespace, and resource limits, giving it strong isolation from the host.

The container can be of anything. Here, for simplicity, we are going to have an Alpine Linux container as it's smaller in size and also has full components that give the feel of a whole system. At the end, we will create a container or basically a process that thinks the Alpine directory is the entire computer.

For that, we need to have the root filesystem of Alpine, so set this up before starting to code:

# Download and extract Alpine Linux rootfs
docker export $(docker create alpine) -o alpine.tar
mkdir -p /home/faizan/alpine-rootfs
tar -xf alpine.tar -C /home/faizan/alpine-rootfs
Enter fullscreen mode Exit fullscreen mode

If you don't have Docker, then install it via curl. I just used it for simplicity.

Phase 1

First, we're going to create a simple program that runs commands*but it won't have any isolation yet*. This might seem counterintuitive, but there's a good reason: by seeing what happens without isolation, you'll truly understand what each isolation primitive does when we add it later.

package main

import (
    "fmt"
    "os"
    "os/exec"
)

func main() {
    if len(os.Args) < 2 {
        fmt.Fprintf(os.Stderr, "Usage: %s run <cmd> [args...]\n", os.Args[0])
        os.Exit(1)
    }

    switch os.Args[1] {
    case "run":
        run()
    default:
        panic("Unknown command")
    }
}

func run() {
    fmt.Printf("Running %v\n", os.Args[2:])

    cmd := exec.Command(os.Args[2], os.Args[3:]...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    must(cmd.Run())
}

func must(err error) {
    if err != nil {
        panic(err)
    }
}
Enter fullscreen mode Exit fullscreen mode

What this code does

The program starts by checking the arguments passed to it. The first argument must be run, which tells our program that we want to execute a command. Everything after run becomes the actual command and its arguments that we want to execute for example, /bin/bash or /bin/sh.

When the run function is called, it uses Go's exec.Command to create a new process. The way this works is straightforward: os.Args[2] contains the command itself (like /bin/bash), and os.Args[3:] contains any additional arguments you want to pass to that command. The ... syntax spreads those arguments out so they're passed individually to the command.

We then connect the standard input, output, and error streams. This is important because it allows the command to interact with your terminal naturally. You can type input and see output just like you would with any normal command. Finally, cmd.Run() actually executes the command and waits for it to complete.

Test

go run main.go run /bin/bash
Enter fullscreen mode Exit fullscreen mode

You should see something like this:

Running [/bin/bash]
[root@archlinux cfs]#
Enter fullscreen mode Exit fullscreen mode

Test if it's isolated by running commands like ps and hostname.

Now let me show you what happens when I test this. Run ps inside the shell and look closely at the output.

Notice something interesting? The PIDs are not starting from 1 like they would in a real container. Instead, you're seeing PIDs like 46628, 46629, 46669 ,these are the exact same process IDs that the host system sees. If you open another terminal window on your host and run ps aux there, you'll see the exact same PID values. This proves we're sharing the same process namespace as the host.

Now let's check the hostname.

Run hostname and you'll see your machine's actual hostname. Try changing it with sudo hostname test-container, then run hostname again to confirm it changed. Exit the shell and check your host's hostname, it's been changed there too! We just modified the host system's hostname directly. There's no isolation barrier protecting the host from changes made inside our "container."

The same goes for the filesystem. When you run ls /, you're looking at your actual host's root directory. Any file you create in /tmp will appear on the host. We're operating directly on the host system with zero separation.

Why This Matters

What we've built right now is basically just a wrapper that executes commands, there's no containerization happening at all. But this baseline is important. In the next steps, when we add Linux namespaces, you'll see how dramatically things change. The PIDs will start from 1, hostname changes won't affect the host, and we'll have our own isolated filesystem.

Phase 2

Now update the run() function to look exactly like this:

func run() {
    fmt.Printf("Running %v\n", os.Args[2:])

    // Re-exec ourselves as "child" inside new namespaces
    cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
        Unshareflags: syscall.CLONE_NEWNS,
    }

    must(cmd.Run())
}
Enter fullscreen mode Exit fullscreen mode

And add this new child() function:

func child() {
    fmt.Printf("Running %v as child\n", os.Args[2:])

    cmd := exec.Command(os.Args[2], os.Args[3:]...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    must(syscall.Sethostname([]byte("container")))

    if err := cmd.Run(); err != nil {
        fmt.Println("Process exited:", err)
    }
}
Enter fullscreen mode Exit fullscreen mode

Let's understand what we're doing here. First, we need to create namespaces. Before that, we write this in run():

cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
Enter fullscreen mode Exit fullscreen mode

This is the critical insight: We're not running /bin/bash directly. Instead, we're re-executing our own Go program (/proc/self/exe is the path to our currently running binary) but with a new argument list: ["child", "/bin/bash"].

Why this weird self-re-execution? Because Linux namespaces can only be created when spawning a NEW process, you cannot create namespaces for a process that's already running. So we need our code to run again, but this time inside fresh, isolated namespaces.

Then we actually create the namespaces using this code:

cmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
    Unshareflags: syscall.CLONE_NEWNS,
}
Enter fullscreen mode Exit fullscreen mode

As I told you at the start of the blog, different namespaces isolate different things. What we're doing here is creating namespaces using Linux system calls. Specifically, we're using the clone() system call with special flags.

The system calls we're using are the CLONE_* family. Each CLONE_NEW* flag tells the Linux kernel: "When you create this child process, put it in a completely separate namespace for this resource." Let me break down what each one does:

CLONE_NEWUTS creates a new UTS (Unix Time-Sharing) namespace. This isolates the hostname and domain name. With this flag, when the child process changes its hostname to "container", the host machine's hostname stays completely unchanged. Without this, changing the hostname inside would affect your actual host system exactly what we saw in Phase 1.

CLONE_NEWPID creates a new PID (Process ID) namespace. This is huge. It means the child process gets its own, completely separate process ID numbering system. Inside the container, the shell will see itself as PID 1—the first process, just like the init system in a real Linux boot. But from the host's perspective, that same process might be PID 15234. This is why when you run ps aux in a real container, you only see the container's processes, not all the host processes.

CLONE_NEWNS creates a new Mount namespace. This isolates filesystem mount points. Any mounts we do inside the container (like mounting /proc or /tmp) won't appear on the host system, and vice versa. This is crucial for filesystem isolation.

The | (pipe) symbol between these flags is a bitwise OR operation—it's how we enable multiple namespaces at once. We're essentially saying "create ALL THREE of these namespaces for the child process."

Unshareflags with CLONE_NEWNS is an additional safety measure. It makes the mount namespace "unshareable" to prevent mount propagation, basically ensuring that any filesystem changes inside the container absolutely cannot leak to the host.

Test

Now let's test if our namespace isolation is actually working. Run your updated code:

You should see:

Running [/bin/bash]
Running [/bin/bash] as child
[root@container cfs]#
Enter fullscreen mode Exit fullscreen mode

You will notice that the hostname in your prompt already shows container—that's the Sethostname call working.

Now check if hostname changes stay isolated by changing the hostname of the container and checking if the host's hostname also changed.

Now check the PID namespace isolation - this is interesting and important.

Inside the container, run:

echo $$
Enter fullscreen mode Exit fullscreen mode

On your host (different terminal):

ps aux | grep bash
Enter fullscreen mode Exit fullscreen mode

Same process, two different PIDs! Inside the namespace it's PID 7 (can be different but a low number for you), from the host it's PID 44999. This proves PID isolation is working.

Test The /proc Problem

Run ps inside the container:

ps aux
Enter fullscreen mode Exit fullscreen mode

You'll see all the host's processes—systemd, kthreadd, Chrome, everything. Why? Because even though we have PID namespace isolation, we're still reading the host's /proc filesystem. The ps command gets its information from /proc, which we haven't isolated yet.

Phase 3

Update your child() function to add filesystem isolation. First, add environment variables right after the cmd setup:

func child() {
    fmt.Printf("Running %v \n", os.Args[2:])

    cmd := exec.Command(os.Args[2], os.Args[3:]...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    // Add these environment variables
    cmd.Env = []string{
        "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "HOME=/root",
        "TERM=xterm",
    }

    must(syscall.Sethostname([]byte("container")))

    // Add filesystem isolation
    must(syscall.Chroot("/home/faizan/alpine-rootfs"))
    must(os.Chdir("/"))

    // Mount /proc filesystem
    must(syscall.Mount("proc", "proc", "proc", 0, ""))

    // Mount tmpfs at /tmp
    must(os.MkdirAll("tmp", 0755))
    must(syscall.Mount("tmpfs", "tmp", "tmpfs", 0, ""))

    if err := cmd.Run(); err != nil {
        fmt.Println("Process exited:", err)
    }

    // Cleanup mounts on exit
    syscall.Unmount("proc", 0)
    syscall.Unmount("tmp", 0)
}
Enter fullscreen mode Exit fullscreen mode

Environment Variables

cmd.Env = []string{
    "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
    "HOME=/root",
    "TERM=xterm",
}
Enter fullscreen mode Exit fullscreen mode

This is a small but important detail. When a child process is spawned, by default it inherits the parent's environment variables, which means it would inherit your host's PATH, your host's HOME, all of it. That's a problem because after we do the filesystem isolation in the next step, those host paths won't even exist inside our container.

So we're explicitly setting a clean, minimal environment. PATH tells the shell where to look for binaries. HOME sets the home directory. TERM=xterm makes sure terminal behavior works correctly, things like clearing the screen, arrow keys, color output. Without TERM, a lot of terminal programs behave weirdly or refuse to run.

Chrooting (the important step)

must(syscall.Chroot("/home/faizan/alpine-rootfs"))
must(os.Chdir("/"))
Enter fullscreen mode Exit fullscreen mode

chroot redefines what / means for this process and all its children. After this call, when the shell opens /etc/passwd, the kernel resolves it as /home/faizan/alpine-rootfs/etc/passwd. The host filesystem becomes completely invisible. Run ls / and you see Alpine's bin, etc, usr, exactly like a real Alpine machine.

chroot changes where / points, but your current working directory doesn't move. So in the directory you are running the code from, you're still sitting there after the chroot, that means outside the jail. From there, a process can just do cd ../../.. and walk right out. os.Chdir("/") moves you inside the jail immediately, so there's nowhere to escape to, which is perfect.

Mounting /proc - Fixing What ps Reads

must(syscall.Mount("proc", "proc", "proc", 0, ""))
Enter fullscreen mode Exit fullscreen mode

After chroot, Alpine's /proc is an empty directory. The ps command reads everything from /proc/<pid>/status without mounting something there, ps would just fail.

So we mount a fresh procfs. The key detail: this is not a copy of the host's /proc. The kernel populates it based on what the current PID namespace can see. Since we're inside CLONE_NEWPID from Phase 2, only the container's processes appear. Run ps aux now and you'll only see the container's shell. The host process flood is gone. This is PID isolation and filesystem isolation finally working together.

Mounting tmpfs at /tmp

must(os.MkdirAll("tmp", 0755))
must(syscall.Mount("tmpfs", "tmp", "tmpfs", 0, ""))
Enter fullscreen mode Exit fullscreen mode

tmpfs is memory-backed. Anything written to /tmp inside the container lives in RAM, never touches your host's /tmp, and disappears the moment the container exits. This is also why CLONE_NEWNS from Phase 2 mattered. Without the mount namespace, these mounts would propagate to the host's mount table. The mount namespace is the prerequisite; these mounts are the payload.

Test

now check everything if it's showing file system and process of the container and not of the host

Phase 4 - Cgroups

We now have a container with real isolation, its own filesystem, its own process tree, its own hostname. But there's still one problem: nothing stops a process inside this container from consuming all of your host's CPU, spawning thousands of processes, or eating all your RAM. A simple fork bomb inside the container would take down your entire machine.

This is what cgroups (control groups) solve. While namespaces control what a process can see, cgroups control what it can use.

So update your code with this function:

func cg() {
    cgroupRoot := "/sys/fs/cgroup/"
    cgroupName := "container-cgroup"
    containerCgroup := filepath.Join(cgroupRoot, cgroupName)

    os.WriteFile(
        filepath.Join(cgroupRoot, "cgroup.subtree_control"),
        []byte("+pids +memory +cpu"),
        0644,
    )

    must(os.MkdirAll(containerCgroup, 0755))

    must(os.WriteFile(filepath.Join(containerCgroup, "pids.max"), []byte("20"), 0700))
    must(os.WriteFile(filepath.Join(containerCgroup, "memory.max"), []byte("52428800"), 0700))
    must(os.WriteFile(filepath.Join(containerCgroup, "cpu.max"), []byte("50000 100000"), 0700))

    must(os.WriteFile(
        filepath.Join(containerCgroup, "cgroup.procs"),
        []byte(strconv.Itoa(os.Getpid())),
        0700,
    ))
}
Enter fullscreen mode Exit fullscreen mode

Cgroup v2

This is what I wrote this blog mainly for. You see, most of the blogs were based on cgroups v1, but it got changed to cgroup v2 a long time ago, so eventually the code's gotta change as well.

Cgroup v2 uses a unified hierarchy. Every controller—pids, memory, cpu, lives under a single tree at /sys/fs/cgroup/. This is different from v1, which had separate trees per controller, like /sys/fs/cgroup/pids/, /sys/fs/cgroup/memory/, and so on. If you've seen old container blogs using v1 paths, that's why they're broken on modern kernels.

Working with cgroups is entirely done through the filesystem—you create directories, write to files. No special syscalls needed.

os.WriteFile(
    filepath.Join(cgroupRoot, "cgroup.subtree_control"),
    []byte("+pids +memory +cpu"),
    0644,
)
Enter fullscreen mode Exit fullscreen mode

First, we enable the controllers on the root cgroup. Writing +pids +memory +cpu to subtree_control tells the kernel to make these controllers available to child cgroups we create underneath. It's like unlocking the feature before using it.

Creating the Cgroup

must(os.MkdirAll(containerCgroup, 0755))
Enter fullscreen mode Exit fullscreen mode

This is the entire cgroup creation. You make a directory under /sys/fs/cgroup/ and the kernel sees it, recognizes it as a new cgroup, and automatically populates it with control files, pids.max, memory.max, cpu.max, cgroup.procs, and more. Nothing else needed.

The Limits

must(os.WriteFile(filepath.Join(containerCgroup, "pids.max"), []byte("20"), 0700))
must(os.WriteFile(filepath.Join(containerCgroup, "memory.max"), []byte("52428800"), 0700))
must(os.WriteFile(filepath.Join(containerCgroup, "cpu.max"), []byte("50000 100000"), 0700))
Enter fullscreen mode Exit fullscreen mode

pids.max is the most immediately important one. Setting it to 20 means the kernel will start rejecting fork() and clone() calls the moment the process count hits 20. A fork bomb inside the container just chokes and dies, the host is untouched.

memory.max is 52428800 bytes - 50 MiB. If the container crosses this, the kernel's OOM killer steps in and kills the process that pushed it over. The host's memory is never at risk.

cpu.max format is $MAX $PERIOD in microseconds. 50000 100000 means: out of every 100ms window, this cgroup gets at most 50ms of CPU time, 50% of one core. A runaway CPU loop inside the container gets throttled at the kernel scheduler level. Your machine stays responsive.

Moving the Process In

must(os.WriteFile(
    filepath.Join(containerCgroup, "cgroup.procs"),
    []byte(strconv.Itoa(os.Getpid())),
    0700,
))
Enter fullscreen mode Exit fullscreen mode

Everything before this was just configuration. This line is what actually enforces it. Writing our PID to cgroup.procs moves the current process into the cgroup. From this point, this process and every child it spawns—including the shell and everything the user runs inside, is subject to all three limits above.

Test

Try a fork bomb inside the container:

:(){ :|:& };:
Enter fullscreen mode Exit fullscreen mode

With pids.max = 20, the kernel rejects every fork() past the limit. The container shell dies. Open another terminal on your host, everything is running perfectly. Without cgroups, that fork bomb would have required a hard reboot.


That's it! You've built a real container from scratch with proper isolation using namespaces, filesystem virtualization with chroot, and resource limits with cgroups v2. This is fundamentally what Docker does under the hood, just with a lot more features, better UX, and production-grade tooling on top.

Top comments (2)

Collapse
 
theminimalcreator profile image
Guilherme Zaia

Solid intro, but you stopped at the easy part. Real containers fail at runtime—cgroup limits hit, OOM kills, or pivot_root breaks because /proc is already mounted. Where's the error handling? Namespaces are the 'what', syscalls are the 'how'. Show the unshare() flags and failure modes. That's where juniors learn why Docker exists.

Collapse
 
clarabennettdev profile image
Clara Bennett

This is a fantastic deep-dive! Building containers from scratch is one of the best ways to really understand what Docker/containerd are doing under the hood. The cgroup v2 unified hierarchy approach is especially relevant now that most distros have migrated away from v1. One thing I'd love to see expanded: how you'd handle networking namespaces for the container — that's usually where things get tricky with veth pairs and bridge setup. Great write-up though, bookmarking this as a reference.