Ravi Kishan

Posted on Jun 5

Building a Container Runtime from Scratch with Go (MyDocker)

Table of Contents

Introduction: What is a Container Runtime?
Namespaces: Isolating Processes on Linux
Control Groups (cgroups): Managing Resources
Project Architecture: The MyDocker Design
- Command-Line Interface and Main.go
- Container Lifecycle (startContainer, child)
- Networking Isolation (Bridge and veth)
- Image Handling and OCI Unpacking
- Volumes and Port Mappings
Diagrams: Visualizing the Runtime
Getting Started and Resources

Introduction

Containers are lightweight, isolated environments for running applications. Unlike virtual machines, containers share the host kernel but isolate processes using kernel features like namespaces and cgroups. In this tutorial we’ll explore MyDocker, a simple container runtime written in Go, and walk through the key pieces that make it work. We’ll explain how MyDocker uses Linux namespaces (UTS, PID, network, mount, etc.) to isolate the container’s view of the system, and how it uses cgroups to limit resources. Along the way we’ll show key code snippets (e.g. startContainer, child, execContainer) and outline MyDocker’s architecture for CLI commands, networking, and image handling. By the end, you’ll see how MyDocker’s internals fit together, and how OCI image unpacking with tools like umoci and simple mount/iptables tricks provide the functionality of docker pull, run, exec, and more. We encourage you to check out the [MyDocker GitHub repository][22] and its [README][26] for examples and to try extending the code yourself.

Namespaces

Linux namespaces provide isolation by giving a process its own “view” of system resources. For example, PID namespaces mean a process can see only processes in its own container; UTS namespaces let each container have its own hostname; mount namespaces give each container its own filesystem root; network namespaces provide separate network interfaces/iptables, and so on. In other words, namespaces prevent processes in one container from seeing or affecting those in another. This isolation is a cornerstone of container security: “namespaces are quite flexible…they can be applied individually or in groups to one or more processes”. MyDocker’s container “child” process is created with clone flags for multiple namespaces (UTS, PID, NET, mount, etc.), so it runs in isolation. The code roughly follows the pattern:

// In main or run command:
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
cmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: syscall.CLONE_NEWUTS |
                syscall.CLONE_NEWPID |
                syscall.CLONE_NEWNET |
                syscall.CLONE_NEWNS, // etc.
}
cmd.Stdin, cmd.Stdout, cmd.Stderr = os.Stdin, os.Stdout, os.Stderr
must(cmd.Run())

// The child() function runs inside the new namespaces:
func child() {
    // e.g., set hostname inside UTS namespace
    syscall.Sethostname([]byte("mydocker"))
    // Possibly mount proc, chroot, pivot_root, etc.
    // Finally exec the requested command:
    cmd := exec.Command(os.Args[2], os.Args[3:]...)
    cmd.Stdin, cmd.Stdout, cmd.Stderr = os.Stdin, os.Stdout, os.Stderr
    must(cmd.Run())
}

As in other “container-from-scratch” examples, this pattern uses exec.Command("/proc/self/exe", "child", …) to re-invoke the same program in a new process (in new namespaces). From outside, the run command spawns the namespaced child; inside the container, the child function does setup (hostname, filesystem mounts, etc.) and then executes the user’s command. This effectively gives the container its own isolated environment separate from the host.

Figure: Container runtime architecture. MyDocker creates a namespaced “child” process for each container, mounts a fresh root filesystem (unpacked from an OCI image), applies cgroup limits, and sets up networking (bridge mydocker0 and veth pair). The CLI (main.go) drives commands like run, exec, pull, etc.

In summary, namespaces ensure that processes in MyDocker containers see only their own CPU, memory, filesystems, networks, and process list. This is one of the fundamental layers of container isolation. (MyDocker’s README also highlights this as a core feature: “Container process isolation using Linux namespaces (UTS, PID, NET, MNT)”.)

Control Groups (cgroups)

While namespaces isolate what a process sees, cgroups control how much of a resource a process can use. A control group (cgroup) is a Linux kernel feature that limits, accounts for, and isolates resource usage (CPU, memory, I/O, etc.) of a group of processes. MyDocker uses cgroups so that each container process has its own CPU and memory limits (as defined by CLI flags). For example, it might create a new cgroup under /sys/fs/cgroup/cpu/mydocker/<id> and write the PID of the container process into it, along with any limits (like cpu.shares or memory.limit_in_bytes). This means the container will only get a fraction of CPU or a fixed amount of memory, protecting the host and other containers.

“Cgroups are a kernel feature that allows you to partition and limit the system resources… that a group of processes can use. Think of them as virtual cages where you can corral your processes and set rules for their behavior”. (This quote describes cgroups in general; MyDocker uses them specifically for CPU and memory limits.)

For instance, MyDocker might contain code like:

// Simplified: create a cpu cgroup for the container
cgroupPath := filepath.Join("/sys/fs/cgroup/cpu/mydocker", containerID)
os.MkdirAll(cgroupPath, 0755)
// write the container's PID into the cgroup.procs file
os.WriteFile(filepath.Join(cgroupPath, "cgroup.procs"), []byte(pidStr), 0644)
// optionally set limits like cpu.shares or memory.limit_in_bytes

This ensures each container is “in its own cage”. The MyDocker README emphasizes resource control: “Resource control via cgroups (CPU & Memory)”. By combining cgroups with namespaces, the runtime isolates both the view and resources of each container process, similarly to how Docker does it behind the scenes.

Project Architecture

MyDocker’s source is structured simply (see [Architecture Overview][100]). At the top level we have:

main.go – The CLI entry point: parses commands like run, exec, pull, etc. It calls functions like startContainer() or execContainer() accordingly.
cgroups/ – Go package to set up cgroup directories and limits.
network/ – Go package to create a bridge (mydocker0) and virtual ethernet (veth) pairs for container networking.
helper.c – A small C program that is used to join existing namespaces (see below).
Other files: e.g. configuration templates, documentation.

The architecture overview in the README summarizes this:

“Architecture Overview: main.go — The CLI and container runtime entry point; cgroups/ — resource limits; network/ — bridge networks/veth; /var/lib/mydocker/ — stores metadata/images”.

So when you run something like sudo ./mydocker run -v /data:/data -p 8080:80 ubuntu:22.04 sh, what happens? A brief walkthrough:

CLI (main.go): The program parses the run command and container options (image name, mounts, ports, etc.). It likely calls a function startContainer(containerID, command, flags...).
startContainer: This function creates a unique container ID, sets up storage (image unpack, rootfs), sets up networking, then spawns a child process in new namespaces (using exec.Command("/proc/self/exe", ...) as above). It also records container metadata (like PID, rootfs path) under /var/lib/mydocker/.
Child process: In the new namespaces, it does further initialization: e.g., performs chroot/pivot_root into the unpacked image root filesystem; mounts the host directory if -v was given; configures networking (calls net.SetupVeth() to move one end of a veth into this namespace); then finally execs the user’s command (e.g. sh).
Networking: The network/ code ensures there is a Linux bridge (e.g. mydocker0) on the host. It creates a veth pair, moves one end into the child’s net namespace (as eth0), and attaches the other end to the mydocker0 bridge, giving the container network access (with its own IP).
Cgroups: Concurrently, before or after spawning the child, MyDocker creates new cgroup entries under CPU/memory controllers and adds the child’s PID, enforcing resource limits on that container.
Image Unpacking: For pulling images, MyDocker uses umoci. The pull command downloads an OCI image (like Ubuntu) and uses umoci unpack to extract it into a directory under /var/lib/mydocker. Umoci documentation explains: “umoci unpack - Unpacks an OCI image tag into a runtime bundle”. In other words, MyDocker relies on umoci to convert a container image into a rootfs directory with config, just like Docker does with layered filesystems.

Command-Line Interface and Main.go

The main.go file defines the CLI commands (run, exec, ps, etc.) using a library like urfave/cli. For example, when the user runs mydocker run ..., it ends up invoking something like:

func runCommandAction(ctx *cli.Context) error {
    imageName := ctx.Args().Get(0)
    cmd := ctx.Args().Get(1)
    flags := ctx.GlobalFlags()
    containerID := generateID()
    startContainer(containerID, imageName, cmd, flags)
    return nil
}

The core functions include startContainer (for run) and ExecContainer (for exec). In MyDocker, startContainer will unpack the image if needed, set up cgroups/networks, then clone a new process.

The execContainer function (called by mydocker exec) is also interesting. It finds the PID of an existing container process (by reading stored metadata) and then uses exec.Command("/proc/self/exe", "exec") to fork a helper process that enters that container’s namespaces (via a small C helper). Part of the code might look like:

func ExecContainer(containerID string, comArray []string) {
    pid, err := getPidByContainerId(containerID)
    // Set environment variables for the helper C program:
    cmd := exec.Command("/proc/self/exe", "exec")
    os.Setenv("mydocker_pid", pid)
    os.Setenv("mydocker_cmd", strings.Join(comArray, " "))
    cmd.Stdin, cmd.Stdout, cmd.Stderr = os.Stdin, os.Stdout, os.Stderr
    cmd.Run()
}

This tells the re-invoked process to run the "exec" branch of main(). The helper C code (in helper.c) sees mydocker_pid/mydocker_cmd and performs setns() calls to join the PID/UTS/NET namespaces of that container, then uses system(mydocker_cmd) to run the desired command inside the container. (In effect, mydocker exec adds your shell to the existing container’s namespaces, letting you run commands inside it.) The Chinese blog [89] illustrates this pattern clearly. The key idea is: we fork a new process that points itself into the container’s namespaces and execs the command as if we were inside the container.

Container Lifecycle (`startContainer`, `child`)

Putting it all together, MyDocker’s container lifecycle roughly is:

Start: User runs mydocker run .... The Go main.go calls startContainer(...).
Setup: In startContainer, MyDocker creates a directory for the container under /var/lib/mydocker/<id>. It unpacks the OCI image rootfs there (if not done already), creates cgroups, sets up network (bridge/veth), and prepares the command arguments.

Clone (run): The code then does something like:

cmd := exec.Command("/proc/self/exe", append([]string{"child"}, commandArgs...)...)
cmd.Stdin, cmd.Stdout, cmd.Stderr = os.Stdin, os.Stdout, os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS,
    UidMappings: ...,
    GidMappings: ...,
}
cmd.Run() // blocks until container process exits

This effectively forks into a child process that will run mydocker child ....

Child setup: In the child() function inside Go, the process is now in fresh namespaces. It performs actions similar to Docker’s container init:

- Calls `syscall.Chroot()` or `pivot_root()` to switch the root filesystem to the unpacked image.

- Applies mounts and volume binds (`syscall.Mount()`) for any `-v host:container` flags.

- Calls `net.SetupVeth()` to move one end of the veth into this namespace and bring up the network interface.

- Executes the container’s init process (e.g., `exec.Command("/bin/sh")`).

Run: The container process (PID 1 in its PID namespace) starts and runs user code. Meanwhile, the parent Go process (still the original CLI) waits (or returns immediately if detached).
Stop: When the container’s main process exits (e.g. you exit the shell), the child function returns. startContainer (the parent) will then do cleanup: remove cgroups, delete metadata, etc.

The code for startContainer and child in MyDocker is similar in spirit to many "from-scratch Docker" examples (e.g. [86]). For instance, the parent uses exec.Command("/proc/self/exe", "child", …) to create the child. Inside child, Go code uses syscalls (Mount, Chroot, Sethostname, etc.) to finish setting up the namespace, then exec.Command to run the actual command. Though we can’t cite the MyDocker source directly here, conceptually it follows the pattern above.

Networking

By default, containers should be able to talk to each other and the outside world (on certain ports). MyDocker’s network/ package likely does the following (in Go, via the vishvananda/netlink library or raw syscalls):

Ensure a Linux bridge exists (named mydocker0). If not, create it with ip link add name mydocker0 type bridge.
For each new container, create a veth pair (e.g. veth-host and veth-container). Move veth-container into the new namespace of the child (with netlink.LinkSetNsFd) and rename it to eth0.
Bring up eth0 inside the container and assign it an IP (e.g. via ip addr).
Attach veth-host to the mydocker0 bridge and bring it up on the host.

This way, containers get an IP on the mydocker0 network, and host-side NAT (iptables MASQUERADE) handles outbound traffic and port mapping. MyDocker’s README mentions a custom bridge network (mydocker0) and veth pairs for network isolation. (In our blog, we won’t go into all the iptables details, but that’s how host ports get forwarded: the CLI probably calls iptables -t nat rules so that -p host:container flags forward TCP from the host port to the container’s IP:port.)

Image Handling and OCI Unpacking

Container images (like ubuntu:22.04) come in the OCI/Docker image format. MyDocker supports pulling and unpacking these images using the umoci tool. The mydocker pull command likely shells out to umoci pull or similar, then umoci unpack. As the umoci manual states: “umoci unpack – Unpacks an OCI image tag into a runtime bundle”. In practice, MyDocker runs something like:

umoci unpack --image ubuntu:22.04 /var/lib/mydocker/images/ubuntu_22.04

This extracts the image layers and config into a directory that becomes the container’s root filesystem. MyDocker then stores this location (under /var/lib/mydocker/) so that future run commands can use the rootfs directly without repulling. The unpacked bundle includes /rootfs with all files and a JSON config. MyDocker’s Go code then chroot-es into that /rootfs when starting a container.

Using umoci (or a similar OCI utility) is important because it handles the layered filesystem format of images. Without overlay support, MyDocker unpacks into a single directory per image. The README cites “Simple OCI image unpacking using umoci” as a feature.

Volumes and Port Mappings

Volumes (-v): If the user specifies -v /host/dir:/container/dir, MyDocker will mount a bind mount inside the container namespace. The Go code in child() would do something like:

syscall.Mount("/host/dir", "/container/dir", "", syscall.MS_BIND, "")

after switching to the new root. This makes the host’s directory appear at the given path in the container. It relies on the mount namespace being isolated, so only the container sees the mount (the host’s filesystem is shared, but the namespace separates the mount list).

Port mappings (-p): For -p hostPort:containerPort, the runtime must forward network traffic. A simple approach: on the host, use iptables to NAT traffic from hostPort into the container’s IP and containerPort. For example:

iptables -t nat -A PREROUTING -p tcp --dport <hostPort> -j DNAT --to-destination <containerIP>:<containerPort>
iptables -t nat -A POSTROUTING -j MASQUERADE

This way, if a service inside the container listens on containerPort, requests to localhost:hostPort on the host will reach it. MyDocker’s implementation likely sets up such iptables rules automatically when starting the container, and removes them on stop. This achieves “port mapping” similar to Docker.

Diagrams

Below are some visual aids (using PlantUML) to illustrate the architecture and isolation concepts. These diagrams are simplified schematics.

These diagrams outline how the host CLI and services collaborate to create an isolated container process. In particular, the second diagram emphasizes that once the container process (C) is created, it has its own namespace instances and cannot see the host’s unrelated processes or network stacks.

Getting Started and Resources

MyDocker is intended as a learning tool to explore container internals. It “demonstrates how containerization works under the hood” and helps you understand namespaces, cgroups, and image formats. You can try it out by following the README instructions: for example, pull an image (mydocker pull ubuntu:22.04) and run it (mydocker run -it -v /data:/data -p 8080:80 ubuntu:22.04 bash) as shown in the usage examples. You’ll see your process isolated (check ps and hostname inside the container, and network connectivity to the bridge).

We encourage you to explore the Project Files (which includes source code and examples) and even extend it. Possible exercises include adding features like Dockerfile builds, overlayfs support, or enhanced networking. By tinkering with MyDocker’s code, you’ll gain a deep understanding of what happens in every docker run command. Happy hacking and containerizing!

Check out the github repo:

Ravikisha / MyDocker

MyDocker is a lightweight container runtime built from scratch in Go that demonstrates how containerization works under the hood — similar to Docker but in a simplified form.

🐳 MyDocker

MyDocker is a lightweight container runtime built from scratch in Go that demonstrates how containerization works under the hood — similar to Docker but in a simplified form.

👉 GitHub Repository

🚀 Features

Container process isolation using Linux namespaces (UTS, PID, NET, NS)
Resource control via cgroups (CPU & Memory)
Volume mounting (-v host:container)
Port mapping support (-p host:container)
Simple OCI image unpacking using umoci
Container image pulling, listing, and execution
Custom bridge network (mydocker0) and veth pairs for network isolation
Command-line interface similar to Docker (run, exec, ps, stop, pull, images, version)

🎯 Architecture

📦 Installation

Requirements

Go 1.19+

Root access (for namespace and networking)

umoci tool installed and in $PATH

Linux OS (recommended: Ubuntu)

git clone https://github.com/ravikisha/mydocker.git
cd mydocker
go build -o mydocker main.go
sudo ./mydocker version

…

View on GitHub

DEV Community

Building a Container Runtime from Scratch with Go (MyDocker)

Introduction

Namespaces

Control Groups (cgroups)

Project Architecture

Command-Line Interface and Main.go

Container Lifecycle (`startContainer`, `child`)

Networking

Image Handling and OCI Unpacking

Volumes and Port Mappings

Diagrams

Getting Started and Resources

Ravikisha / MyDocker

MyDocker is a lightweight container runtime built from scratch in Go that demonstrates how containerization works under the hood — similar to Docker but in a simplified form.

🐳 MyDocker

🚀 Features

🎯 Architecture

📦 Installation

Top comments (0)

Introduction

Namespaces

Control Groups (cgroups)

Project Architecture

Command-Line Interface and Main.go

Container Lifecycle (startContainer, child)

Networking

Image Handling and OCI Unpacking

Volumes and Port Mappings

Diagrams

Getting Started and Resources

Ravikisha / MyDocker

MyDocker is a lightweight container runtime built from scratch in Go that demonstrates how containerization works under the hood — similar to Docker but in a simplified form.

🐳 MyDocker

🚀 Features

🎯 Architecture

📦 Installation

Container Lifecycle (`startContainer`, `child`)