Intro
In Part 0 of the series, we’ve seen how processes can get a restricted view of the resources they see. This part will explain how a container runtime prepares and creates an isolated environment for the container process.
A prerequisite for this part is a basic knowledge of how the Linux filesystem works, what are inodes, symbolic links, and mount points.
The full source code for this post can be found here.
First, let’s start with the OCI spec.
Operations
At the time of writing, the OCI spec defines a minimum of five standard operations: create, start, state, delete and kill. Having that in mind, using the clap library we can generate a nice CLI interface in no time. It should look something like this:
let matches = App::new("Container Runtime")
.subcommand(
SubCommand::with_name("create")
.arg(Arg::with_name("bundle").required(true))
.arg(Arg::with_name("id").required(true)),
)
.subcommand(SubCommand::with_name("start").arg(Arg::with_name("id").required(true)))
.subcommand(
SubCommand::with_name("kill")
.arg(Arg::with_name("id").required(true))
.arg(Arg::with_name("signal")),
)
.subcommand(SubCommand::with_name("delete").arg(Arg::with_name("id").required(true)))
.subcommand(SubCommand::with_name("state").arg(Arg::with_name("id").required(true)))
.get_matches();
We’ll give the most attention to the create and start command because those are the two most important commands when running the docker run command.
The bundle directory contains the config.json file that holds all the metadata for creating the container:
- ociVersion - version of the OCI spec
- process - the user-defined process that the container executes (shell, database, web app, gRPC service, etc.) with the necessary args and environment variables
- root - path to the subdirectory for the container root
- hostname of the container
- mounts - list of mount points inside the container
In addition, the OCI spec contains a platform-specific section enabling custom settings based on the platform where the container is being run. Because we are looking only at Linux containers, the linux section will be of use to us.
The create command is supplied with the container ID and the bundle path. Its purpose is to initialize the container process, mount all necessary subdirectories, “jail” the container inside the root.path folder, update all system variables inside the container (env, hostname, user, group), execute a couple of hooks (we’ll look into that later), assign the unique ID to the container itself, and wait until the start command gets fired. After the create command finishes, the container is in status Created and the user process MUST wait for the start command to start the actual container process.
Everything seems straightforward regarding the implementation, but the “jail” -ing part can be a bit confusing. How is it done?
chroot
Chroot is a syscall that changes the root directory of the calling process. It takes the new root path as an argument, which can be an absolute or relative path. The chroot command from the terminal does the same thing, except it takes an additional argument, namely the process that’s going to be executed inside the changed root.
Before we take a look at an example, first we need to prepare the new rootfs. Unfortunately, the binaries used in the jail must reside inside the chroot-ed directory (obviously), so we need to have a premade rootfs. Luckily, we can use our host OS binaries and mount bind the already existing files and end up with a structure like this:
Don’t worry if your listing differs, just be sure to have bash and ls inside the bin directory.
Let’s take a look at the chroot command (run with sudo):
As we can see, listing directories outside our root (ls ..) lists the jailed root and it seems that we can’t see anything outside. Also, listing the bin and lib directories gives the same result as with the above example.
One can say “that’s how containers are jailed” and jump right onto building a container from scratch. But, things aren’t that easy… Chroot doesn’t change the filesystem nor the mount points that the process sees. It just changes the view for the process’ root, but everything remains the same. And, breaking this jail is fairly easy as described here.
pivot_root
pivot_root on the other hand does exactly what we need. Given a new root and subdirectory of the current root, it moves the current root to the subdirectory and mounts the new root as the root mount point. In this way, it changes the physically mounted folder of the root directory. Later on, we can unmount the “old” root and just leave the newly created root mount point.
We’re going to take a look at an example.
NOTE: pivot_root changes the root mount point and can potentially make a mess out of your filesystem, so be sure to follow the steps below.
First, we need a real rootfs filesystem. We can’t use the above example because we mount bind the host binaries. We need an independent directory that can live on its own. For that, we are going to use Docker to export a fresh rootfs from an Alpine container. Then we’re going to use unshare (remember our friend from Part 0) to create a new mount namespace. Then we’re going to pivot the roots inside our container. It should look like this:
docker export
simply copies files from the container to the host system into a tar archive. After exporting the rootfs from the Alpine image, we bind mount the directory to itself, why? Because by the specification of the pivot_root syscall, new_root must be a path to a mount point that’s different than “/”.
After preparing the container root, we need to create a new mount namespace to have it differently than our host environment, so that pivot_root doesn’t change anything on the host mount namespace. We create a temporary folder to hold the old root, pivot the root, unmount (or unlink with umount -l) the oldroot and, to finalize the swap, remove the oldroot folder.
Voila! Now we have a bash process running inside the jailed container folder.
In Rust code, mounting the rootfs folder with the nix crate would look something like this:
pub fn mount_rootfs(rootfs: &Path) -> Result<(), Box<dyn Error>> {
mount(
None::<&str>,
"/",
None::<&str>,
MsFlags::MS_PRIVATE | MsFlags::MS_REC,
None::<&str>,
)?;
mount::<Path, Path, str, str>(
Some(&rootfs),
&rootfs,
None::<&str>,
MsFlags::MS_BIND | MsFlags::MS_REC,
None::<&str>,
)?;
Ok(())
}
The first mount changes the mount propagation of the root mount point to private (shared mount propagation isn’t allowed with pivot_root, for obvious reasons).
The code for the whole pivoting procedure should look something like this:
pub fn pivot_rootfs(rootfs: &Path) -> Result<(), Box<dyn Error>> {
chdir(rootfs)?;
std::fs::create_dir_all(rootfs.join("oldroot"))?;
pivot_root(rootfs.as_os_str(), rootfs.join("oldroot").as_os_str())?;
umount2("./oldroot", MntFlags::MNT_DETACH)?;
std::fs::remove_dir_all("./oldroot")?;
chdir("/")?;
Ok(())
}
Note that both mount_rootfs
and pivot_rootfs
are called in the newly created mount namespace.
Special links & mounts
The OCI runtime spec defines a set of special symlinks. These symbolic links are used to pass the stdin, stdout, and stderr streams from the container engine (Docker, containerd) to the runtime and vice versa. It simply binds the container’s standard streams to the outside file descriptors of the container process. The container runtime needs to establish these symlinks before pivot_root.
The OCI runtime spec defines a set of filesystems that need to be mounted inside the container. While extracting some config.json files from images like Alpine, Ubuntu, Debian, the /dev/pts and /dev/shm mount points are present inside the mounts section of the runtime config spec.
Two important filesystems that need more attention are proc and sysfs.
The proc filesystem mounts to the /proc directory and acts as an interface for the internal structures of the kernel. For each process it holds a /proc/[PID] subdirectory that keeps the file descriptors, cpu and memory usage, mount information, page tables and many other things. For example, one can’t inspect the current mount points (with the mount command) until this fs isn’t created.
The exact command for mounting the proc fs is:
mount -t proc proc /proc
The sysfs filesystem is a pseudo-fs-like proc that provides an interface to the internal kernel objects. Opposite to the proc filesystem, it holds system-wide information like metadata of block and char devices, bus info, drivers, control groups, kernel info, and other global variables. Mounting the sysfs is the same as with proc:
mount -t sysfs sysfs /sys
Both proc and sysfs need to be mounted after pivot_root, when the new root mount point is created.
Devices
In Linux, everything is considered as a file. Hard drives, peripherals, even processes are fully described through file descriptors. Devices aren’t an exception either. Diskettes, CDROMs, serial ports, any device you attach should appear inside the /dev subdirectory under the root directory. Devices have types and the majority of devices are either block (store data of some kind) or character (stream or transfer data to/from) devices. Terminals, pseudo-random number generators and even the /dev/null file is considered a device too.
The OCI specification defines required devices for each container and the config.json contains a devices list under the linux section. The container runtime is responsible for creating these devices inside the container root directory. The syscall for creating devices is mknod. This syscall (also command inside the terminal) accepts 4 required parameters:
- pathname - full path to the file location
- type - block, character or other device types
- major & minor - unique identifiers for the device
For example, char device with major minor numbers 1, 8 is the random device representing the pseudo-random number generator. Whenever your app requests a random number, this device gets a request.
We can easily generate the special devices with nix’s mknod function or, in case of binding to the host’s device (which is covered by the OCI spec) use the mount bind option.
Conclusion
We’ve seen how chroot changes the view of the root directory for the current process and how pivot_root swaps the root mount points, creating a logical isolation of the filesystem. We’ve also seen how to create standard container devices and that different containers can request special devices inside the mounts section of the config.json file.
Knowing how unshare and pivot_root work gives us the ability to manually create Linux containers in our terminal. In the next parts, we’ll dive a bit deeper into the implementation. Specifically on cloning the child process and preparations for starting the container command.
Top comments (2)
Hi there, we encourage authors to share their entire posts here on DEV, rather than mostly pointing to an external link. Doing so helps ensure that readers don’t have to jump around to too many different pages, and it helps focus the conversation right here in the comments section.
If you choose to do so, you also have the option to add a canonical URL directly to your post.
Thanks for the advice! I'll edit the two posts to contain the actual contents of the story.