I built gocount as a way to actually understand what Docker does under the hood. By the end of this post you'll have a working container runtime that boots an Alpine Linux shell in its own filesystem, PID tree, hostname, and network, with enforced memory and CPU limits, using nothing but Go and Linux kernel features.
Why Build a Container Runtime?
Containers feel like magic until you look at what's actually happening. When docker run executes, the kernel doesn't spin up a virtual machine. It just creates a process with a restricted view of the system, shaped by five or six kernel primitives that have been in Linux since 2008.
Every production container runtime (Docker, Podman, containerd) is ultimately a thin orchestration layer on top of those primitives. Building one yourself is the fastest way to understand:
- What a container actually is (spoiler: it's a process)
- How
pivot_rootreplaces the filesystem you see - How cgroups enforce the memory limit that OOM-kills your app
- How the
vethpair wires the container to the outside network
Prerequisites
| Requirement | Why |
|---|---|
| Linux (kernel 5.10+ recommended) | All the kernel features used here are Linux-specific |
| Go 1.23+ | Module system + syscall / golang.org/x/sys
|
iproute2 (ip command) |
Creating and managing network interfaces |
iptables |
NAT masquerading for container internet access |
| Root access | Namespaces, cgroup creation, and pivot_root require it |
nsenter |
Moving a network interface into a namespace |
Check your kernel supports cgroup v2 (required for memory + CPU limits):
cat /sys/fs/cgroup/cgroup.controllers
# Should include: cpuset cpu io memory hugetlb pids rdma
Project Layout
gocount/
├── main.go
├── go.mod
├── Makefile
├── cmd/
│ ├── root.go # cobra root command
│ ├── run.go # run & start subcommands + child setup
│ ├── ps.go # list containers
│ ├── stop.go # stop & rm commands
│ └── inspect.go # inspect a container
└── internal/
├── container/
│ ├── container.go # Container struct, in-memory map, JSON persistence
│ ├── mount.go # pivot_root, /proc /sys /dev mounts, DNS
│ └── utils.go # EnsureContainerDir
├── cgroups/
│ └── cgroups.go # cgroup v2 create / limits / delete
├── rootfs/
│ └── manager.go # download & extract Alpine Linux rootfs
└── network/
└── network.go # veth pair, IP forwarding, NAT, in-container setup
Initialise it:
mkdir gocount && cd gocount
go mod init gocount
go get github.com/spf13/cobra@v1.10.1
go get golang.org/x/sys@v0.29.0
Concept 1: Linux Namespaces
A namespace wraps a global resource and gives a process the illusion that it has its own isolated instance of that resource. gocount uses four:
| Flag | What it isolates |
|---|---|
CLONE_NEWPID |
PID space — the container's first process is PID 1 |
CLONE_NEWUTS |
Hostname and domain name |
CLONE_NEWNS |
Mount tree — filesystem changes are invisible to the host |
CLONE_NEWNET |
Network stack — the container gets its own lo, no host interfaces |
When you combine all four, the process is isolated enough to look and feel like a separate machine, even though it's sharing the same kernel as the host.
Concept 2: cgroup v2
cgroup v2 is the kernel's resource accounting and limiting subsystem. It's a virtual filesystem mounted at /sys/fs/cgroup. You control a process by:
- Creating a subdirectory:
mkdir /sys/fs/cgroup/gocount/<id> - Writing the limit:
echo 104857600 > /sys/fs/cgroup/gocount/<id>/memory.max - Writing the PID:
echo <pid> > /sys/fs/cgroup/gocount/<id>/cgroup.procs
Once the PID is in the cgroup, the kernel enforces the limit. Exceed it and the OOM killer fires.
Concept 3: pivot_root
chroot changes where a process looks for /, but pivot_root is stronger. It actually swaps the root mount point and lets you unmount the old one entirely, so the process can't escape back to the host filesystem.
The steps:
- Bind-mount the new rootfs onto itself (required by the kernel)
- Create a
/.pivot_rootdirectory inside the new rootfs - Call
pivot_root(newroot, newroot/.pivot_root) -
chdir("/")to land inside the new root - Unmount
/.pivot_rootwithMNT_DETACH - Remove
/.pivot_root
After step 6, the host filesystem is completely gone from the process's perspective.
Concept 4: veth pairs
A veth pair is a virtual Ethernet cable with two ends. When you move one end into the container's network namespace, you get a private point-to-point link between host and container. gocount then:
- Assigns
10.0.0.1/24to the host end - Assigns
10.0.0.2/24to the container end - Adds a default route through
10.0.0.1 - Configures iptables NAT (masquerade) so the container can reach the internet
Step 1: Entry Point and CLI
main.go
cmd/root.go
main.go is two lines:
// main.go
package main
import "gocount/cmd"
func main() {
cmd.Execute()
}
cmd/root.go registers the root cobra command:
// cmd/root.go
package cmd
import (
"fmt"
"os"
"github.com/spf13/cobra"
)
var rootCmd = &cobra.Command{
Use: "gocount",
Short: "gocount is a minimal container runtime",
Long: `Run Linux processes in isolated namespaces, like a tiny Docker.`,
}
func Execute() {
if err := rootCmd.Execute(); err != nil {
fmt.Println(err)
os.Exit(1)
}
}
Step 2: The Container Struct and Persistence
Every running container is represented by this struct and stored in two places: an in-memory map for fast lookup, and a JSON file under /tmp/gocount/<id>.json so it survives process restarts.
// internal/container/container.go
package container
import (
"encoding/json"
"fmt"
"math/rand"
"os"
"strings"
)
type Container struct {
ID string
Pid int
Command []string
Status string
RootFs string
Cgroup string
}
var Containers = map[string]*Container{}
func GenerateID() string {
letters := "abcdefghijklmnopqrstuvwxyz0123456789"
id := ""
for i := 0; i < 8; i++ {
id += string(letters[rand.Intn(len(letters))])
}
return id
}
func SaveContainer(c *Container) error {
data, _ := json.Marshal(c)
path := fmt.Sprintf("/tmp/gocount/%s.json", c.ID)
return os.WriteFile(path, data, 0644)
}
func LoadContainers() ([]*Container, error) {
var containers []*Container
files, _ := os.ReadDir("/tmp/gocount")
for _, f := range files {
if f.IsDir() || !strings.HasSuffix(f.Name(), ".json") {
continue
}
data, err := os.ReadFile("/tmp/gocount/" + f.Name())
if err != nil {
fmt.Fprintf(os.Stderr, "Warning: could not read %s: %v\n", f.Name(), err)
continue
}
var c Container
if err := json.Unmarshal(data, &c); err != nil {
fmt.Fprintf(os.Stderr, "Warning: could not parse %s: %v\n", f.Name(), err)
continue
}
containers = append(containers, &c)
}
return containers, nil
}
Why two stores? The in-memory map is O(1) with no I/O overhead for commands that run in the same process lifetime as run. The JSON files persist state across invocations, so ps, stop, and inspect work even after the parent process exits.
Step 3: Downloading the Rootfs
A container needs a root filesystem. gocount downloads Alpine Linux minirootfs on first run. It's only ~3 MB compressed and contains a full /bin/sh, ip, ping, python3, and apk.
// internal/rootfs/manager.go
const DefaultRootfsURL = "https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz"
func EnsureRootfs(rootfsDir string) error {
if isValidRootfs(rootfsDir) {
return nil
}
fmt.Println("Rootfs not found. Downloading Alpine Linux rootfs...")
return DownloadAndExtractRootfs(DefaultRootfsURL, rootfsDir)
}
isValidRootfs checks that bin/, lib/, etc/, usr/, and bin/sh all exist. If any are missing the tarball is re-downloaded.
Security note: zip-slip protection. When extracting the tarball every header path is validated to stay inside the destination directory:
target := filepath.Join(destPath, header.Name)
if !strings.HasPrefix(
filepath.Clean(target)+string(os.PathSeparator),
filepath.Clean(destPath)+string(os.PathSeparator),
) {
return fmt.Errorf("illegal path in tar archive: %s", header.Name)
}
A crafted tar with ../../etc/passwd entries would otherwise silently overwrite host files.
The rootfs is stored at /tmp/gocount/<id>/rootfs and reused across runs sharing the same ID. On the very first gocount run you'll see:
Rootfs not found. Downloading Alpine Linux rootfs...
Downloading from https://dl-cdn.alpinelinux.org/alpine/...
Extracting rootfs...
Rootfs setup complete!
Subsequent runs are instant because isValidRootfs passes immediately.
Step 4: cgroup v2 Resource Limits
The cgroup code lives entirely in internal/cgroups/cgroups.go. The interface is simple:
// Create a cgroup for a container
cgPath, err := cgroups.Create(id)
// Apply limits (empty string = skip)
cgroups.SetMemoryLimit(cgPath, "100M") // supports M and G suffixes
cgroups.SetCPUQuota(cgPath, "50000 100000") // quota/period in microseconds
// Add a process to the cgroup
cgroups.AddProc(cgPath, pid)
How memory limits are applied:
func SetMemoryLimit(cgPath, limit string) error {
if limit == "" {
return nil
}
val := limit
if strings.HasSuffix(limit, "M") {
mb, _ := strconv.ParseInt(strings.TrimSuffix(limit, "M"), 10, 64)
val = strconv.FormatInt(mb*1024*1024, 10)
} else if strings.HasSuffix(limit, "G") {
gb, _ := strconv.ParseInt(strings.TrimSuffix(limit, "G"), 10, 64)
val = strconv.FormatInt(gb*1024*1024*1024, 10)
}
writeFile(filepath.Join(cgPath, "memory.max"), val)
writeFile(filepath.Join(cgPath, "memory.swap.max"), "0") // disable swap
writeFile(filepath.Join(cgPath, "memory.oom.group"), "1") // kill whole cgroup on OOM
return nil
}
CPU quota uses cgroup v2's cpu.max format: <quota_microseconds> <period_microseconds>. For example "50000 100000" means 50 ms out of every 100 ms period, which works out to 50% of one CPU core.
Important: The cgroup is created before the child process starts, so limits are in place the moment the container process writes its PID to cgroup.procs.
Step 5: The run Command (Parent Side)
This is the heart of gocount. The run command works by re-executing itself, a classic self-re-exec pattern used by runc and lxc.
gocount run /bin/sh
│
├─ parent: sets up cgroup, starts child with namespaces
│
└─ child (GOCOUNT_CHILD=1): pivot_root, mount /proc /sys /dev,
wait for veth, configure eth0, exec /bin/sh
Why self-re-exec? Go's runtime starts multiple threads before main(). clone(CLONE_NEWPID) in a multi-threaded process can leave threads in inconsistent namespaces. Re-executing /proc/self/exe gives us a fresh, single-threaded process that enters the namespace cleanly from the very start.
// cmd/run.go (parent side, simplified)
id := container.GenerateID()
rootdir := "/tmp/gocount/" + id + "/rootfs"
rootfs.EnsureRootfs(rootdir)
cgPath, _ := cgroups.Create(id)
cgroups.SetMemoryLimit(cgPath, flagMemory)
cgroups.SetCPUQuota(cgPath, flagCPU)
command := exec.Command("/proc/self/exe", append([]string{"run"}, args...)...)
command.Env = append(os.Environ(),
"GOCOUNT_CHILD=1",
"GOCOUNT_CONTAINER_ID="+id,
"GOCOUNT_ROOTFS="+rootdir,
)
command.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS |
syscall.CLONE_NEWPID |
syscall.CLONE_NEWNS |
syscall.CLONE_NEWNET,
}
command.Start()
network.SetupVethPair(id, command.Process.Pid) // host-side networking
c := &container.Container{
ID: id, Pid: command.Process.Pid,
Command: args, Status: "running",
RootFs: rootdir, Cgroup: cgPath,
}
container.Containers[id] = c
container.SaveContainer(c)
command.Wait()
Step 6: The run Command (Child Side)
When the child detects GOCOUNT_CHILD=1 it calls childSetup:
func childSetup(args []string) {
rootfsPath := os.Getenv("GOCOUNT_ROOTFS")
containerID := os.Getenv("GOCOUNT_CONTAINER_ID")
// Join the pre-created cgroup
cgPath := filepath.Join("/sys/fs/cgroup", "gocount", containerID)
cgroups.AddProc(cgPath, os.Getpid())
// Switch to the isolated filesystem
container.SetupMount(rootfsPath)
// Set container hostname
syscall.Sethostname([]byte("gocount"))
// Wait up to 5s for the parent to wire the veth pair
for i := 0; i < 50; i++ {
if exec.Command("ip", "link", "show", "eth0").Run() == nil {
break
}
time.Sleep(100 * time.Millisecond)
}
// Configure eth0 and default route
network.SetupNetworkInsideContainer()
// Replace this process with the user's command — it becomes PID 1
syscall.Exec(args[0], args, os.Environ())
}
The syscall.Exec call is critical. It replaces the Go runtime with the container's process so that the user's command is PID 1 inside the container.
Step 7: Filesystem Isolation with pivot_root
SetupMount is called inside the child, after it enters the new mount namespace:
// internal/container/mount.go
func SetupMount(rootfs string) error {
rootfs, _ = filepath.Abs(rootfs)
// Make the current mount tree private so host cannot see our changes
syscall.Mount("", "/", "", syscall.MS_REC|syscall.MS_PRIVATE, "")
// Bind mount rootfs onto itself (kernel requirement for pivot_root)
syscall.Mount(rootfs, rootfs, "", syscall.MS_BIND|syscall.MS_REC, "")
// Create a directory to receive the old root
putold := filepath.Join(rootfs, ".pivot_root")
os.MkdirAll(putold, 0700)
// Swap the root mount
syscall.PivotRoot(rootfs, putold)
os.Chdir("/")
// Detach and remove the old root
syscall.Unmount("/.pivot_root", syscall.MNT_DETACH)
os.RemoveAll("/.pivot_root")
// Mount essential virtual filesystems
syscall.Mount("proc", "/proc", "proc", 0, "")
syscall.Mount("sysfs", "/sys", "sysfs", 0, "")
syscall.Mount("tmpfs", "/dev", "tmpfs", syscall.MS_NOSUID|syscall.MS_STRICTATIME, "mode=755")
createDeviceNodes() // /dev/null, /dev/zero, /dev/urandom, etc.
setupDNS() // write 8.8.8.8 to /etc/resolv.conf
return nil
}
After PivotRoot, / is the Alpine rootfs. The host's /home, /etc, /proc — none of it is visible.
Step 8: Networking with veth Pairs
// internal/network/network.go (host side)
func SetupVethPair(containerID string, pid int) error {
hostIf := fmt.Sprintf("veth-%s", containerID[:8]) // e.g. veth-a1b2c3d4
containerIf := fmt.Sprintf("vethc-%s", containerID[:7]) // temporary name
// Create the pair
exec.Command("ip", "link", "add", hostIf, "type", "veth", "peer", "name", containerIf).Run()
// Move one end into the container's network namespace
exec.Command("ip", "link", "set", containerIf, "netns", fmt.Sprintf("%d", pid)).Run()
// Rename to eth0 inside the container namespace
exec.Command("nsenter", "-t", fmt.Sprintf("%d", pid), "-n",
"ip", "link", "set", containerIf, "name", "eth0").Run()
// Bring up and address the host end
exec.Command("ip", "link", "set", hostIf, "up").Run()
exec.Command("ip", "addr", "add", "10.0.0.1/24", "dev", hostIf).Run()
EnableIPForwarding()
SetupNAT() // iptables MASQUERADE for 10.0.0.0/24
return nil
}
Inside the container, SetupNetworkInsideContainer mirrors this:
exec.Command("ip", "link", "set", "lo", "up").Run()
exec.Command("ip", "link", "set", "eth0", "up").Run()
exec.Command("ip", "addr", "add", "10.0.0.2/24", "dev", "eth0").Run()
exec.Command("ip", "route", "add", "default", "via", "10.0.0.1").Run()
The container's DNS is 8.8.8.8 (written by setupDNS into /etc/resolv.conf inside the new rootfs).
Step 9: Container Lifecycle Commands
ps — list containers
Reads every *.json file from /tmp/gocount/ and prints a table:
CONTAINER ID PID STATUS COMMAND
a1b2c3d4 12345 running [/bin/sh]
stop — send SIGKILL
syscall.Kill(c.Pid, syscall.SIGKILL)
c.Status = "stopped"
container.SaveContainer(c)
rm — stop and remove metadata
Kills the process (if still running), removes the JSON file, and deletes it from the in-memory map.
inspect — detailed view
Shows PID, status, cgroup resource limits, memory usage, CPU time, and every namespace link:
Container Information:
ID: ar9vb2mm
Status: running
PID: 57699
Command: [/bin/sh]
RootFS: /tmp/gocount/ar9vb2mm/rootfs
Cgroup: /sys/fs/cgroup/gocount/ar9vb2mm
Process Status:
Running: Yes
State: S (sleeping)
Resource Limits:
Memory Limit: unlimited
Memory Usage: 319488 bytes (312.00 KB)
Memory Peak: 2838528 bytes (2.71 MB)
CPU Quota: max/100000 (0.0%)
CPU Time: 36.04ms
Memory Events:
Processes in Cgroup: 1
PID: 57699
Namespaces:
cgroup: cgroup:[4026531835]
ipc: ipc:[4026531839]
mnt: mnt:[4026533179]
net: net:[4026533458]
pid: pid:[4026533456]
pid_for_children: pid:[4026533456]
time: time:[4026531834]
time_for_children: time:[4026531834]
user: user:[4026531837]
uts: uts:[4026533455]
isProcessRunning sends syscall.Signal(0), a no-op signal that returns an error only if the process doesn't exist:
func isProcessRunning(pid int) bool {
process, err := os.FindProcess(pid)
if err != nil {
return false
}
return process.Signal(syscall.Signal(0)) == nil
}
Step 10: Build and Run
# Build
make build
# Binary lands at bin/gocount
# Run an interactive Alpine shell
sudo ./bin/gocount run /bin/sh
# Run with resource limits
sudo ./bin/gocount run --memory 100M --cpu "50000 100000" /bin/sh
# In another terminal — list containers
sudo ./bin/gocount ps
# Inspect
sudo ./bin/gocount inspect <id>
# Stop
sudo ./bin/gocount stop <id>
# Remove
sudo ./bin/gocount rm <id>
Testing the memory limit
test.py allocates 1 MB per iteration. With a 50 MB limit you can watch it get OOM-killed:
make test-memory
# Copies test.py into the rootfs, then runs:
# sudo gocount run --memory 50M /usr/bin/python3 /test.py
Output:
Starting memory consumption...
Allocated: 1 MB
Allocated: 2 MB
...
Allocated: 48 MB
Allocated: 49 MB
Killed
The kernel's OOM killer fires the moment the cgroup's memory.max threshold is crossed and terminates the entire cgroup.
How It All Fits Together: End-to-End Walk-through
$ sudo ./bin/gocount run --memory 50M /bin/sh
1. GenerateID() → "a1b2c3d4"
2. EnsureRootfs(...) → download Alpine if missing
3. cgroups.Create("a1b2c3d4") → mkdir /sys/fs/cgroup/gocount/a1b2c3d4
4. SetMemoryLimit(..., "50M") → write 52428800 to memory.max
5. exec.Command("/proc/self/exe", "run", "/bin/sh")
+ GOCOUNT_CHILD=1
+ CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET
6. command.Start() → child PID = 12345
7. SetupVethPair("a1b2c3d4", 12345)
→ ip link add veth-a1b2c3d4 type veth peer name vethc-a1b2c3
→ ip link set vethc-a1b2c3 netns 12345
→ nsenter -t 12345 -n ip link set vethc-a1b2c3 name eth0
→ ip addr add 10.0.0.1/24 dev veth-a1b2c3d4
→ iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -j MASQUERADE
8. SaveContainer(...) → /tmp/gocount/a1b2c3d4.json
--- inside child (GOCOUNT_CHILD=1) ---
9. AddProc(cgPath, getpid()) → write PID to cgroup.procs
10. SetupMount(rootfsPath) → pivot_root + mount /proc /sys /dev
11. Sethostname("gocount")
12. poll until eth0 appears → up to 5 s
13. SetupNetworkInsideContainer()
→ ip link set lo up
→ ip link set eth0 up
→ ip addr add 10.0.0.2/24 dev eth0
→ ip route add default via 10.0.0.1
14. syscall.Exec("/bin/sh", ...) → you are now in the container
Key Bugs Fixed Along the Way
Building a container runtime means touching the kernel directly. A few subtle bugs appeared during development:
| Bug | Impact | Fix |
|---|---|---|
Duplicate init() in inspect.go
|
Compile error | Remove the second registration |
process.Signal(os.Signal(nil)) |
Panic: nil interface type assertion | Use syscall.Signal(0)
|
removeCmd missing nil check |
Panic when container not found | Guard c == nil before access |
AddContainer called after Containers[id] = c
|
Overwrites Cgroup field with zero value |
Remove the redundant AddContainer call |
Zip-slip in extractTar
|
Malicious tar writes outside rootfs |
strings.HasPrefix boundary check |
startCmd missing CLONE_NEWNET
|
Container shares host network stack | Add CLONE_NEWNET to clone flags |
json.Unmarshal errors ignored |
Corrupted JSON inserts zero-value container | Explicit error check + continue
|
rand.Seed(time.Now().UnixNano()) |
Deprecated in Go 1.20+ | Removed: global source is auto-seeded |
What's Missing (and How to Add It)
gocount is deliberately minimal. Here's what a production runtime adds on top:
| Feature | What to do |
|---|---|
User namespace (CLONE_NEWUSER) |
Map container root to an unprivileged host UID — no more sudo
|
IPC namespace (CLONE_NEWIPC) |
Isolate System V semaphores and POSIX message queues |
| Rootfs layers / overlay | Use overlayfs so containers share a read-only base and get a private writable layer |
| Container images | Pull from an OCI registry (skopeo / go-containerregistry) |
| seccomp filter | Block dangerous syscalls (ptrace, mount, reboot) using libseccomp
|
| Capabilities drop | Use AmbientCaps + CapDrop in SysProcAttr to drop CAP_NET_ADMIN after setup |
| Port forwarding | Add iptables DNAT rules: host port to container IP:port |
| Persistent storage | Bind-mount host directories into the container before pivot_root
|
| Multi-container networking | Replace individual veth pairs with a Linux bridge (like Docker's docker0) |
Full Source
The complete source for gocount is structured exactly as shown above. The key files:
-
cmd/run.go— the parent/child split, namespace flags, cgroup wiring -
internal/container/mount.go—pivot_rootand essential mounts -
internal/cgroups/cgroups.go— cgroup v2 create, limit, add-proc -
internal/network/network.go— veth pair, IP forwarding, NAT -
internal/rootfs/manager.go— Alpine download and zip-slip-safe extraction
gocount is ~600 lines of Go and it's a real container runtime. The kernel was doing all of this every time you typed docker run. Now you know exactly how.
Top comments (0)