DEV Community

Cover image for Build Your Own Container Runtime in Go: From Zero to a Running Isolated Process
Shubham Nainwal
Shubham Nainwal

Posted on

Build Your Own Container Runtime in Go: From Zero to a Running Isolated Process

I built gocount as a way to actually understand what Docker does under the hood. By the end of this post you'll have a working container runtime that boots an Alpine Linux shell in its own filesystem, PID tree, hostname, and network, with enforced memory and CPU limits, using nothing but Go and Linux kernel features.

Why Build a Container Runtime?

Containers feel like magic until you look at what's actually happening. When docker run executes, the kernel doesn't spin up a virtual machine. It just creates a process with a restricted view of the system, shaped by five or six kernel primitives that have been in Linux since 2008.

Every production container runtime (Docker, Podman, containerd) is ultimately a thin orchestration layer on top of those primitives. Building one yourself is the fastest way to understand:

  • What a container actually is (spoiler: it's a process)
  • How pivot_root replaces the filesystem you see
  • How cgroups enforce the memory limit that OOM-kills your app
  • How the veth pair wires the container to the outside network

Prerequisites

Requirement Why
Linux (kernel 5.10+ recommended) All the kernel features used here are Linux-specific
Go 1.23+ Module system + syscall / golang.org/x/sys
iproute2 (ip command) Creating and managing network interfaces
iptables NAT masquerading for container internet access
Root access Namespaces, cgroup creation, and pivot_root require it
nsenter Moving a network interface into a namespace

Check your kernel supports cgroup v2 (required for memory + CPU limits):

cat /sys/fs/cgroup/cgroup.controllers
# Should include: cpuset cpu io memory hugetlb pids rdma
Enter fullscreen mode Exit fullscreen mode

Project Layout

gocount/
├── main.go
├── go.mod
├── Makefile
├── cmd/
│   ├── root.go       # cobra root command
│   ├── run.go        # run & start subcommands + child setup
│   ├── ps.go         # list containers
│   ├── stop.go       # stop & rm commands
│   └── inspect.go    # inspect a container
└── internal/
    ├── container/
    │   ├── container.go   # Container struct, in-memory map, JSON persistence
    │   ├── mount.go       # pivot_root, /proc /sys /dev mounts, DNS
    │   └── utils.go       # EnsureContainerDir
    ├── cgroups/
    │   └── cgroups.go     # cgroup v2 create / limits / delete
    ├── rootfs/
    │   └── manager.go     # download & extract Alpine Linux rootfs
    └── network/
        └── network.go     # veth pair, IP forwarding, NAT, in-container setup
Enter fullscreen mode Exit fullscreen mode

Initialise it:

mkdir gocount && cd gocount
go mod init gocount
go get github.com/spf13/cobra@v1.10.1
go get golang.org/x/sys@v0.29.0
Enter fullscreen mode Exit fullscreen mode

Concept 1: Linux Namespaces

A namespace wraps a global resource and gives a process the illusion that it has its own isolated instance of that resource. gocount uses four:

Flag What it isolates
CLONE_NEWPID PID space — the container's first process is PID 1
CLONE_NEWUTS Hostname and domain name
CLONE_NEWNS Mount tree — filesystem changes are invisible to the host
CLONE_NEWNET Network stack — the container gets its own lo, no host interfaces

When you combine all four, the process is isolated enough to look and feel like a separate machine, even though it's sharing the same kernel as the host.

Concept 2: cgroup v2

cgroup v2 is the kernel's resource accounting and limiting subsystem. It's a virtual filesystem mounted at /sys/fs/cgroup. You control a process by:

  1. Creating a subdirectory: mkdir /sys/fs/cgroup/gocount/<id>
  2. Writing the limit: echo 104857600 > /sys/fs/cgroup/gocount/<id>/memory.max
  3. Writing the PID: echo <pid> > /sys/fs/cgroup/gocount/<id>/cgroup.procs

Once the PID is in the cgroup, the kernel enforces the limit. Exceed it and the OOM killer fires.

Concept 3: pivot_root

chroot changes where a process looks for /, but pivot_root is stronger. It actually swaps the root mount point and lets you unmount the old one entirely, so the process can't escape back to the host filesystem.

The steps:

  1. Bind-mount the new rootfs onto itself (required by the kernel)
  2. Create a /.pivot_root directory inside the new rootfs
  3. Call pivot_root(newroot, newroot/.pivot_root)
  4. chdir("/") to land inside the new root
  5. Unmount /.pivot_root with MNT_DETACH
  6. Remove /.pivot_root

After step 6, the host filesystem is completely gone from the process's perspective.

Concept 4: veth pairs

A veth pair is a virtual Ethernet cable with two ends. When you move one end into the container's network namespace, you get a private point-to-point link between host and container. gocount then:

  • Assigns 10.0.0.1/24 to the host end
  • Assigns 10.0.0.2/24 to the container end
  • Adds a default route through 10.0.0.1
  • Configures iptables NAT (masquerade) so the container can reach the internet

Step 1: Entry Point and CLI

main.go
cmd/root.go
Enter fullscreen mode Exit fullscreen mode

main.go is two lines:

// main.go
package main

import "gocount/cmd"

func main() {
    cmd.Execute()
}
Enter fullscreen mode Exit fullscreen mode

cmd/root.go registers the root cobra command:

// cmd/root.go
package cmd

import (
    "fmt"
    "os"
    "github.com/spf13/cobra"
)

var rootCmd = &cobra.Command{
    Use:   "gocount",
    Short: "gocount is a minimal container runtime",
    Long:  `Run Linux processes in isolated namespaces, like a tiny Docker.`,
}

func Execute() {
    if err := rootCmd.Execute(); err != nil {
        fmt.Println(err)
        os.Exit(1)
    }
}
Enter fullscreen mode Exit fullscreen mode

Step 2: The Container Struct and Persistence

Every running container is represented by this struct and stored in two places: an in-memory map for fast lookup, and a JSON file under /tmp/gocount/<id>.json so it survives process restarts.

// internal/container/container.go
package container

import (
    "encoding/json"
    "fmt"
    "math/rand"
    "os"
    "strings"
)

type Container struct {
    ID      string
    Pid     int
    Command []string
    Status  string
    RootFs  string
    Cgroup  string
}

var Containers = map[string]*Container{}

func GenerateID() string {
    letters := "abcdefghijklmnopqrstuvwxyz0123456789"
    id := ""
    for i := 0; i < 8; i++ {
        id += string(letters[rand.Intn(len(letters))])
    }
    return id
}

func SaveContainer(c *Container) error {
    data, _ := json.Marshal(c)
    path := fmt.Sprintf("/tmp/gocount/%s.json", c.ID)
    return os.WriteFile(path, data, 0644)
}

func LoadContainers() ([]*Container, error) {
    var containers []*Container
    files, _ := os.ReadDir("/tmp/gocount")
    for _, f := range files {
        if f.IsDir() || !strings.HasSuffix(f.Name(), ".json") {
            continue
        }
        data, err := os.ReadFile("/tmp/gocount/" + f.Name())
        if err != nil {
            fmt.Fprintf(os.Stderr, "Warning: could not read %s: %v\n", f.Name(), err)
            continue
        }
        var c Container
        if err := json.Unmarshal(data, &c); err != nil {
            fmt.Fprintf(os.Stderr, "Warning: could not parse %s: %v\n", f.Name(), err)
            continue
        }
        containers = append(containers, &c)
    }
    return containers, nil
}
Enter fullscreen mode Exit fullscreen mode

Why two stores? The in-memory map is O(1) with no I/O overhead for commands that run in the same process lifetime as run. The JSON files persist state across invocations, so ps, stop, and inspect work even after the parent process exits.

Step 3: Downloading the Rootfs

A container needs a root filesystem. gocount downloads Alpine Linux minirootfs on first run. It's only ~3 MB compressed and contains a full /bin/sh, ip, ping, python3, and apk.

// internal/rootfs/manager.go
const DefaultRootfsURL = "https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz"

func EnsureRootfs(rootfsDir string) error {
    if isValidRootfs(rootfsDir) {
        return nil
    }
    fmt.Println("Rootfs not found. Downloading Alpine Linux rootfs...")
    return DownloadAndExtractRootfs(DefaultRootfsURL, rootfsDir)
}
Enter fullscreen mode Exit fullscreen mode

isValidRootfs checks that bin/, lib/, etc/, usr/, and bin/sh all exist. If any are missing the tarball is re-downloaded.

Security note: zip-slip protection. When extracting the tarball every header path is validated to stay inside the destination directory:

target := filepath.Join(destPath, header.Name)
if !strings.HasPrefix(
    filepath.Clean(target)+string(os.PathSeparator),
    filepath.Clean(destPath)+string(os.PathSeparator),
) {
    return fmt.Errorf("illegal path in tar archive: %s", header.Name)
}
Enter fullscreen mode Exit fullscreen mode

A crafted tar with ../../etc/passwd entries would otherwise silently overwrite host files.

The rootfs is stored at /tmp/gocount/<id>/rootfs and reused across runs sharing the same ID. On the very first gocount run you'll see:

Rootfs not found. Downloading Alpine Linux rootfs...
Downloading from https://dl-cdn.alpinelinux.org/alpine/...
Extracting rootfs...
Rootfs setup complete!
Enter fullscreen mode Exit fullscreen mode

Subsequent runs are instant because isValidRootfs passes immediately.

Step 4: cgroup v2 Resource Limits

The cgroup code lives entirely in internal/cgroups/cgroups.go. The interface is simple:

// Create a cgroup for a container
cgPath, err := cgroups.Create(id)

// Apply limits (empty string = skip)
cgroups.SetMemoryLimit(cgPath, "100M")      // supports M and G suffixes
cgroups.SetCPUQuota(cgPath, "50000 100000") // quota/period in microseconds

// Add a process to the cgroup
cgroups.AddProc(cgPath, pid)
Enter fullscreen mode Exit fullscreen mode

How memory limits are applied:

func SetMemoryLimit(cgPath, limit string) error {
    if limit == "" {
        return nil
    }
    val := limit
    if strings.HasSuffix(limit, "M") {
        mb, _ := strconv.ParseInt(strings.TrimSuffix(limit, "M"), 10, 64)
        val = strconv.FormatInt(mb*1024*1024, 10)
    } else if strings.HasSuffix(limit, "G") {
        gb, _ := strconv.ParseInt(strings.TrimSuffix(limit, "G"), 10, 64)
        val = strconv.FormatInt(gb*1024*1024*1024, 10)
    }
    writeFile(filepath.Join(cgPath, "memory.max"), val)
    writeFile(filepath.Join(cgPath, "memory.swap.max"), "0") // disable swap
    writeFile(filepath.Join(cgPath, "memory.oom.group"), "1") // kill whole cgroup on OOM
    return nil
}
Enter fullscreen mode Exit fullscreen mode

CPU quota uses cgroup v2's cpu.max format: <quota_microseconds> <period_microseconds>. For example "50000 100000" means 50 ms out of every 100 ms period, which works out to 50% of one CPU core.

Important: The cgroup is created before the child process starts, so limits are in place the moment the container process writes its PID to cgroup.procs.

Step 5: The run Command (Parent Side)

This is the heart of gocount. The run command works by re-executing itself, a classic self-re-exec pattern used by runc and lxc.

gocount run /bin/sh
       │
       ├─ parent: sets up cgroup, starts child with namespaces
       │
       └─ child (GOCOUNT_CHILD=1): pivot_root, mount /proc /sys /dev,
                                   wait for veth, configure eth0, exec /bin/sh
Enter fullscreen mode Exit fullscreen mode

Why self-re-exec? Go's runtime starts multiple threads before main(). clone(CLONE_NEWPID) in a multi-threaded process can leave threads in inconsistent namespaces. Re-executing /proc/self/exe gives us a fresh, single-threaded process that enters the namespace cleanly from the very start.

// cmd/run.go (parent side, simplified)
id := container.GenerateID()
rootdir := "/tmp/gocount/" + id + "/rootfs"

rootfs.EnsureRootfs(rootdir)

cgPath, _ := cgroups.Create(id)
cgroups.SetMemoryLimit(cgPath, flagMemory)
cgroups.SetCPUQuota(cgPath, flagCPU)

command := exec.Command("/proc/self/exe", append([]string{"run"}, args...)...)
command.Env = append(os.Environ(),
    "GOCOUNT_CHILD=1",
    "GOCOUNT_CONTAINER_ID="+id,
    "GOCOUNT_ROOTFS="+rootdir,
)
command.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: syscall.CLONE_NEWUTS |
                syscall.CLONE_NEWPID |
                syscall.CLONE_NEWNS  |
                syscall.CLONE_NEWNET,
}

command.Start()
network.SetupVethPair(id, command.Process.Pid) // host-side networking

c := &container.Container{
    ID: id, Pid: command.Process.Pid,
    Command: args, Status: "running",
    RootFs: rootdir, Cgroup: cgPath,
}
container.Containers[id] = c
container.SaveContainer(c)

command.Wait()
Enter fullscreen mode Exit fullscreen mode

Step 6: The run Command (Child Side)

When the child detects GOCOUNT_CHILD=1 it calls childSetup:

func childSetup(args []string) {
    rootfsPath  := os.Getenv("GOCOUNT_ROOTFS")
    containerID := os.Getenv("GOCOUNT_CONTAINER_ID")

    // Join the pre-created cgroup
    cgPath := filepath.Join("/sys/fs/cgroup", "gocount", containerID)
    cgroups.AddProc(cgPath, os.Getpid())

    // Switch to the isolated filesystem
    container.SetupMount(rootfsPath)

    // Set container hostname
    syscall.Sethostname([]byte("gocount"))

    // Wait up to 5s for the parent to wire the veth pair
    for i := 0; i < 50; i++ {
        if exec.Command("ip", "link", "show", "eth0").Run() == nil {
            break
        }
        time.Sleep(100 * time.Millisecond)
    }

    // Configure eth0 and default route
    network.SetupNetworkInsideContainer()

    // Replace this process with the user's command — it becomes PID 1
    syscall.Exec(args[0], args, os.Environ())
}
Enter fullscreen mode Exit fullscreen mode

The syscall.Exec call is critical. It replaces the Go runtime with the container's process so that the user's command is PID 1 inside the container.

Step 7: Filesystem Isolation with pivot_root

SetupMount is called inside the child, after it enters the new mount namespace:

// internal/container/mount.go
func SetupMount(rootfs string) error {
    rootfs, _ = filepath.Abs(rootfs)

    // Make the current mount tree private so host cannot see our changes
    syscall.Mount("", "/", "", syscall.MS_REC|syscall.MS_PRIVATE, "")

    // Bind mount rootfs onto itself (kernel requirement for pivot_root)
    syscall.Mount(rootfs, rootfs, "", syscall.MS_BIND|syscall.MS_REC, "")

    // Create a directory to receive the old root
    putold := filepath.Join(rootfs, ".pivot_root")
    os.MkdirAll(putold, 0700)

    // Swap the root mount
    syscall.PivotRoot(rootfs, putold)
    os.Chdir("/")

    // Detach and remove the old root
    syscall.Unmount("/.pivot_root", syscall.MNT_DETACH)
    os.RemoveAll("/.pivot_root")

    // Mount essential virtual filesystems
    syscall.Mount("proc",  "/proc", "proc",  0, "")
    syscall.Mount("sysfs", "/sys",  "sysfs", 0, "")
    syscall.Mount("tmpfs", "/dev",  "tmpfs", syscall.MS_NOSUID|syscall.MS_STRICTATIME, "mode=755")

    createDeviceNodes() // /dev/null, /dev/zero, /dev/urandom, etc.
    setupDNS()          // write 8.8.8.8 to /etc/resolv.conf
    return nil
}
Enter fullscreen mode Exit fullscreen mode

After PivotRoot, / is the Alpine rootfs. The host's /home, /etc, /proc — none of it is visible.

Step 8: Networking with veth Pairs

// internal/network/network.go (host side)
func SetupVethPair(containerID string, pid int) error {
    hostIf      := fmt.Sprintf("veth-%s", containerID[:8])  // e.g. veth-a1b2c3d4
    containerIf := fmt.Sprintf("vethc-%s", containerID[:7]) // temporary name

    // Create the pair
    exec.Command("ip", "link", "add", hostIf, "type", "veth", "peer", "name", containerIf).Run()

    // Move one end into the container's network namespace
    exec.Command("ip", "link", "set", containerIf, "netns", fmt.Sprintf("%d", pid)).Run()

    // Rename to eth0 inside the container namespace
    exec.Command("nsenter", "-t", fmt.Sprintf("%d", pid), "-n",
        "ip", "link", "set", containerIf, "name", "eth0").Run()

    // Bring up and address the host end
    exec.Command("ip", "link", "set", hostIf, "up").Run()
    exec.Command("ip", "addr", "add", "10.0.0.1/24", "dev", hostIf).Run()

    EnableIPForwarding()
    SetupNAT() // iptables MASQUERADE for 10.0.0.0/24
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Inside the container, SetupNetworkInsideContainer mirrors this:

exec.Command("ip", "link", "set", "lo",   "up").Run()
exec.Command("ip", "link", "set", "eth0", "up").Run()
exec.Command("ip", "addr", "add", "10.0.0.2/24", "dev", "eth0").Run()
exec.Command("ip", "route", "add", "default", "via", "10.0.0.1").Run()
Enter fullscreen mode Exit fullscreen mode

The container's DNS is 8.8.8.8 (written by setupDNS into /etc/resolv.conf inside the new rootfs).

Step 9: Container Lifecycle Commands

ps — list containers

Reads every *.json file from /tmp/gocount/ and prints a table:

CONTAINER ID    PID     STATUS    COMMAND
a1b2c3d4        12345   running   [/bin/sh]
Enter fullscreen mode Exit fullscreen mode

stop — send SIGKILL

syscall.Kill(c.Pid, syscall.SIGKILL)
c.Status = "stopped"
container.SaveContainer(c)
Enter fullscreen mode Exit fullscreen mode

rm — stop and remove metadata

Kills the process (if still running), removes the JSON file, and deletes it from the in-memory map.

inspect — detailed view

Shows PID, status, cgroup resource limits, memory usage, CPU time, and every namespace link:

Container Information:
  ID:        ar9vb2mm
  Status:    running
  PID:       57699
  Command:   [/bin/sh]
  RootFS:    /tmp/gocount/ar9vb2mm/rootfs
  Cgroup:    /sys/fs/cgroup/gocount/ar9vb2mm

Process Status:
  Running:   Yes
  State:     S (sleeping)

Resource Limits:
  Memory Limit:    unlimited
  Memory Usage:    319488 bytes (312.00 KB)
  Memory Peak:     2838528 bytes (2.71 MB)
  CPU Quota:       max/100000 (0.0%)
  CPU Time:        36.04ms

Memory Events:

Processes in Cgroup: 1
  PID: 57699

Namespaces:
  cgroup: cgroup:[4026531835]
  ipc: ipc:[4026531839]
  mnt: mnt:[4026533179]
  net: net:[4026533458]
  pid: pid:[4026533456]
  pid_for_children: pid:[4026533456]
  time: time:[4026531834]
  time_for_children: time:[4026531834]
  user: user:[4026531837]
  uts: uts:[4026533455]
Enter fullscreen mode Exit fullscreen mode

isProcessRunning sends syscall.Signal(0), a no-op signal that returns an error only if the process doesn't exist:

func isProcessRunning(pid int) bool {
    process, err := os.FindProcess(pid)
    if err != nil {
        return false
    }
    return process.Signal(syscall.Signal(0)) == nil
}
Enter fullscreen mode Exit fullscreen mode

Step 10: Build and Run

# Build
make build
# Binary lands at bin/gocount

# Run an interactive Alpine shell
sudo ./bin/gocount run /bin/sh

# Run with resource limits
sudo ./bin/gocount run --memory 100M --cpu "50000 100000" /bin/sh

# In another terminal — list containers
sudo ./bin/gocount ps

# Inspect
sudo ./bin/gocount inspect <id>

# Stop
sudo ./bin/gocount stop <id>

# Remove
sudo ./bin/gocount rm <id>
Enter fullscreen mode Exit fullscreen mode

Testing the memory limit

test.py allocates 1 MB per iteration. With a 50 MB limit you can watch it get OOM-killed:

make test-memory
# Copies test.py into the rootfs, then runs:
# sudo gocount run --memory 50M /usr/bin/python3 /test.py
Enter fullscreen mode Exit fullscreen mode

Output:

Starting memory consumption...
Allocated: 1 MB
Allocated: 2 MB
...
Allocated: 48 MB
Allocated: 49 MB
Killed
Enter fullscreen mode Exit fullscreen mode

The kernel's OOM killer fires the moment the cgroup's memory.max threshold is crossed and terminates the entire cgroup.

How It All Fits Together: End-to-End Walk-through

$ sudo ./bin/gocount run --memory 50M /bin/sh

1. GenerateID()"a1b2c3d4"
2. EnsureRootfs(...)          → download Alpine if missing
3. cgroups.Create("a1b2c3d4")mkdir /sys/fs/cgroup/gocount/a1b2c3d4
4. SetMemoryLimit(..., "50M") → write 52428800 to memory.max
5. exec.Command("/proc/self/exe", "run", "/bin/sh")
   + GOCOUNT_CHILD=1
   + CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET
6. command.Start()            → child PID = 12345
7. SetupVethPair("a1b2c3d4", 12345)
   → ip link add veth-a1b2c3d4 type veth peer name vethc-a1b2c3
   → ip link set vethc-a1b2c3 netns 12345
   → nsenter -t 12345 -n ip link set vethc-a1b2c3 name eth0
   → ip addr add 10.0.0.1/24 dev veth-a1b2c3d4
   → iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -j MASQUERADE
8. SaveContainer(...)         → /tmp/gocount/a1b2c3d4.json

--- inside child (GOCOUNT_CHILD=1) ---
9.  AddProc(cgPath, getpid()) → write PID to cgroup.procs
10. SetupMount(rootfsPath)    → pivot_root + mount /proc /sys /dev
11. Sethostname("gocount")
12. poll until eth0 appears   → up to 5 s
13. SetupNetworkInsideContainer()
    → ip link set lo up
    → ip link set eth0 up
    → ip addr add 10.0.0.2/24 dev eth0
    → ip route add default via 10.0.0.1
14. syscall.Exec("/bin/sh", ...) → you are now in the container
Enter fullscreen mode Exit fullscreen mode

Key Bugs Fixed Along the Way

Building a container runtime means touching the kernel directly. A few subtle bugs appeared during development:

Bug Impact Fix
Duplicate init() in inspect.go Compile error Remove the second registration
process.Signal(os.Signal(nil)) Panic: nil interface type assertion Use syscall.Signal(0)
removeCmd missing nil check Panic when container not found Guard c == nil before access
AddContainer called after Containers[id] = c Overwrites Cgroup field with zero value Remove the redundant AddContainer call
Zip-slip in extractTar Malicious tar writes outside rootfs strings.HasPrefix boundary check
startCmd missing CLONE_NEWNET Container shares host network stack Add CLONE_NEWNET to clone flags
json.Unmarshal errors ignored Corrupted JSON inserts zero-value container Explicit error check + continue
rand.Seed(time.Now().UnixNano()) Deprecated in Go 1.20+ Removed: global source is auto-seeded

What's Missing (and How to Add It)

gocount is deliberately minimal. Here's what a production runtime adds on top:

Feature What to do
User namespace (CLONE_NEWUSER) Map container root to an unprivileged host UID — no more sudo
IPC namespace (CLONE_NEWIPC) Isolate System V semaphores and POSIX message queues
Rootfs layers / overlay Use overlayfs so containers share a read-only base and get a private writable layer
Container images Pull from an OCI registry (skopeo / go-containerregistry)
seccomp filter Block dangerous syscalls (ptrace, mount, reboot) using libseccomp
Capabilities drop Use AmbientCaps + CapDrop in SysProcAttr to drop CAP_NET_ADMIN after setup
Port forwarding Add iptables DNAT rules: host port to container IP:port
Persistent storage Bind-mount host directories into the container before pivot_root
Multi-container networking Replace individual veth pairs with a Linux bridge (like Docker's docker0)

Full Source

The complete source for gocount is structured exactly as shown above. The key files:

gocount is ~600 lines of Go and it's a real container runtime. The kernel was doing all of this every time you typed docker run. Now you know exactly how.

Top comments (0)