Rise of Docker project
Introduced in 2010 by "dotCloud"(renamed to Docker Inc.
), and now docker became common standard of containerization. Also, it made containerizing tech to be very close to every engineers(not only infrastructure part), to setup common development environment easily for co-working engineers
It is running based on container internally, which is a lightweight, standalone, and executable software package that encapsulates an application and all its dependencies, including code, runtime, system tools, libraries, and settings.
Container?
Probably you will be familiar with "Docker image", which is kind of blueprint to create instance. Containers are composed by one or more instances. Which means image became instance when running (such as docker run
), it becomes container, providing isolated/consistent environment to make application execute based on 'runtime'.
Runtime of container is low-level software component, responsible for executing and managing containers on a host operating system. It acts as the interface between the container and underlying system resources.
Initially, Docker didn't have its own low-level container runtime, so it has relied on existing Linux Kernel feature for processing isolation named LXC (Linux Containers). However, this created a dependency on a specific kernel API and had some limitations in portability and functionality.
To avoid these, Docker developed "libcontainer". This provides native Go implementation for creating containers with namespaces, cgroups, capabilities, and filesystem access controls. It allows you to manage the lifecycle of the container performing additional operations after the container is created.
What is runC?
runc
is a CLI tool for spawning and running containers on Linux according to the OCI specification. In other words, it handles low-level works to create/manage containers based on standard specification named OCI Runtime Specification(let's talk about this bit later).
When you run a command like docker run
, docker daemon (in detail, it will be higher-level runtime like containerd) ultimately calls runC to perform the actual container creation.
OCI Runtime?
Open Container Initiative Runtime, or OCI Runtime, is specification that defines how to run a containerized application. This specification provides a standardized way to define the configuration and lifecycle of a container, ensuring that any OCI-compliant runtime can correctly execute a container created by another.
The core of this specification is the filesystem bundle which contains:
- rootfs: A directory containing the root filesystem of the container.
- config.json: A JSON file, that defines all the configuration for container process.
The config.json
is the configuration to generate container. It specifies essential details for OCI-Runtime to set up the isolated environment correctly.
Here's some sample:
{
"ociVersion": "1.0.2",
"process": {
"terminal": true,
"user": {
"uid": 0,
"gid": 0
},
"args": [
"/bin/sh"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]
},
"root": {
"path": "rootfs",
"readonly": true
},
"hostname": "my-container",
"mounts": [],
"linux": {
"namespaces": [
{
"type": "pid"
},
{
"type": "network"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
}
],
"cgroupsPath": "/my-container-cgroup"
}
}
You can find full specification here -> https://specs.opencontainers.org/runtime-spec/config/
Implementing Basic OCI-Runtime
Implementing a basic OCI-compliant runtime involves utilizing core Linux kernel features to creat isolated environment based on the "config.json" file.
This is key requirements:
Parse config.json: The runtime will first read and parse the file, to understand the container's configuration.
Create
namespaces
: It must call system calls like "unshare" or "clone" to create new Linux namespaces (PID, mount, UTS, network, etc.) as specified in the configuration file. These namespaces isolates container's processes from host system.Set up
cgroups
: The runtime needs to interact with "cgroup" filesystem (/sys/fs/cgroup) to create and configure the control group for new container. This allows to apply resource limits for CPU, memory, I/O, etc..Change root filesystem: Perform
pivot_root
orchroot
system call to change root filesystem of the container process torootfs
directory specified in config. This ensures the container can access its own filesystem only.Execute: Finally, the runtime triggers
execve
system call to replace its own process to container's main command (e.g., /bin/sh) inside new, isolated environment. This new process becomes PID 1 inside the container's namespace.
[package]
name = "my-oci-rt"
version = "0.0.0"
edition = "2021"
[dependencies]
nix = { version = "0.30.0", features = ["sched", "process", "user", "hostname", "fs"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
use nix::sys::wait::waitpid;
use nix::unistd::{chroot, setgid, sethostname, setuid, Gid, Uid};
use nix::sched::{clone, CloneFlags};
use serde::Deserialize;
use std::ffi::CString;
use std::fs;
use std::path::PathBuf;
#[derive(Deserialize, Debug)]
struct Spec {
process: Process,
root: Root,
hostname: Option<String>,
}
#[derive(Deserialize, Debug)]
struct Process {
args: Vec<String>,
user: User,
}
#[derive(Deserialize, Debug)]
struct User {
uid: u32,
gid: u32,
}
#[derive(Deserialize, Debug)]
struct Root {
path: PathBuf,
}
// size of stack (1MB)
const STACK_SIZE: usize = 1024 * 1024;
fn main() {
let raw_config = fs::read_to_string("config.json").expect("Failed to read config.json");
let spec: Spec = serde_json::from_str(&raw_config).expect("Failed to parse config.json");
// Create child process
let mut stack = [0; STACK_SIZE];
let child_closure = || -> isize {
println!("== Inside child process {} ==", nix::unistd::getpid());
if let Some(hostname) = &spec.hostname {
sethostname(hostname).expect("Failed setting hostname");
}
chroot(&spec.root.path).expect("chroot failed");
std::env::set_current_dir("/").expect("Failed changing directory to /");
// Setup gid/uid
setgid(Gid::from_raw(spec.process.user.gid)).expect("Failed setting GID");
setuid(Uid::from_raw(spec.process.user.uid)).expect("Failed setting UID");
// parse/trigger command
let command = &spec.process.args[0];
let args: Vec<CString> = spec.process.args.iter()
.map(|s| CString::new(s.as_str()).unwrap())
.collect();
// swap to new process
nix::unistd::execvp(&CString::new(command.as_str()).unwrap(), &args).expect("execvp failed");
0
};
// generate namespace for uts, pid, namespace
let clone_flags = CloneFlags::CLONE_NEWUTS | CloneFlags::CLONE_NEWPID | CloneFlags::CLONE_NEWNS;
let child_pid = unsafe { clone(Box::new(child_closure), &mut stack, clone_flags, Some(nix::sys::signal::Signal::SIGCHLD as i32)) }
.expect("clone failed");
println!("== In Parent Process ==");
println!("Spawned child with PID: {}", child_pid);
// wait shutting down child process
waitpid(child_pid, None).expect("waitpid failed");
println!("Child process exited.");
}
Things to prepare:
To run this project, prepare following features
- Simple
config.json
:
{
"ociVersion": "1.0.0",
"process": {
"args": [
"/bin/sh",
"-c",
"echo 'Hello container!' && ls -l"
],
"user": {
"uid": 0,
"gid": 0
}
},
"root": {
"path": "rootfs"
},
"hostname": "my-container"
}
- Root filesystem for container,
rootfs
directory:
Let's use simple environment with "busybox"
$ mkdir rootfs
$ cd rootfs
$ wget https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
$ chmod +x busybox
Result
Now, you will see message from container as following
$ ./target/debug/my-container
== In Parent Process ==
Spawned child with PID: 4500
== In Child Process (PID: 1) ==
Hello container!
total 1108
drwxr-xr-x 4 0 0 128 Aug 18 02:52 bin
-rwxr-xr-x 1 0 0 1131168 Jan 17 2022 busybox
Child process exited.
Note>
nix
library only allows you to use system calls that exist on the operating system you're compiling for. So, when compiling on macOS, the compiler will throw following error:
...
|
1 | use nix::sched::{clone, CloneFlags};
| ^^^^^ ^^^^^^^^^^ no `CloneFlags` in `sched`
| |
| no `clone` in `sched`
|
= help: consider importing this module instead:
std::clone
...
nix::sched::clone
module and function are Linux-specific, so it is not included in nix library for macOS.
If you're trying to use this in Mac, build/run with separated linux environment with "Docker Desktop".
Top comments (0)