Demystifying Containers

#devops #linux #container

Container’s consists of the following not too hard to grasp parts.

Chroot

You probably know the chroot program, it’s basically a wrapper to the kernel’s chroot system call, which prepends a given path to all following path related system calls. Or to quote it’s man page: this call changes an ingredient in the pathname resolution process and does nothing else. If called on a root filesystem - that is all files and directories that make up an operating system - the called process and all it’s future child processes will - in simple terms - think they are on a different linux distribution. In practice most containerization platforms will use the slightly more complex pivot_root call, which follows the same basic concept.

Mountpoints

While chroot changes the root filesystem of the calling process, it does nothing about “special” filesystems like /dev or /proc. These have to be set up at gusto by the given containerization platform with the mount system call. You probably already used the corresponding wrapper binary. “Volumes” are accomplished with nothing more than a bind mount.

Linux namespaces

With chroot and mountpoints we’d already have something usable but it would still not feel like a seperate operating system. Many resources are still coupled between our primitive “container” in progress and the rest of the system. Linux has an interesting concept to solve this called namespaces. Instead of decoupling an entire process at once to have something like a separate “virtual machine”, with Linux you can fine tune what exactly should get it’s own namespaces. For example: to make top not show processes from the main system, the PID namespaces is unshared. If it is desired that mount calls inside your container shall not screw up the main system (= parent processes), you’d do that "unsharing" the MOUNT namespace. It is even possible to “fake” being the root user with USER namespaces. Here is a list of available namespaces that a process can decouple from with the unshare system call. Linux namespaces is a functionality provided by the linux kernel and not supported by other Unix like operating systems.

Isolation

I will not go too deep on this one because it’s not my expertise. But with this three ingredients - Chroot, Mountpoint and Linux namespaces - we already have something that feels like a separate Linux box running on our host operating system. Neat, but making this isolated well enough from a security perspective is another story. An attacker could still “escape” our self baked virtualization quite easily. Chroot for example does not provide any guarantees that a process will stay in it’s chroot compartment.

A central repository server

For practical reasons, it would be nice to not need to manually download a new e.G. Ubuntu Operating System from the internet every time we want to use that as our guest system. Good we have our central repository server where we can download whatever we want and upload our own alterations. Oh and of course it has to support layers or overlays, explained in the next section.

Overlay Filesystems

A guest system - we don’t have containers yet - like Ubuntu with a lamp stack and some other stuff can easily take a couple of gigabytes of space on the hard drive. While it would be possible to upload, download, copy, delete and conduct other operations on that, it’s impractical. That is why containerization platforms manage root filesystem (all files and directories of an operating system) bit by bit as layers. It’s quite simple actually: multiple files and directories are overlayed on top of each other in order to make up the “complete” view that the user will see. “Special” files are used to signal a deleted file. Another interesting approach to solve this could involve reflinks - provided the used filesystem supports that. As all other interfaces described in this post, Overlay File Systems can also be used separately for whatever other use case. There is a Overlay Filesystem baked in into the kernel named “OverlayFS”. There’s also a fuse based implementation called “fuse-overlayfs”.

A build language

Last but not least a big part of containerization platforms is to let users build their own root filesystems in order to distribute or deploy them. A simple shell script would most probably not honor the layers and other “special” features like build time arguments or volumes.

Conclusion

Voilà, this is how to create a containerization platform. From my understanding all this described parts are already available since decades. But together it makes something special. That being said, please don’t try to build your own containerization platform over the weekend. It’s highly addictive and of course will take longer than you think. Here is what I came up with, it’s called plash :-)

DEV Community