Containerization technologies, perhaps like most readers of this article, are stuck in my mind. And it would seem, just write Dockerfile and don't show off. But you always want to learn something new and delve deeper into topics you've already mastered. For this reason, I decided to figure out how containers are implemented in Linux-based systems and then create my own "container" using cmd.
Who maintains containers in Linux?

First, you need to understand what containerization technology is based on. There are two mechanisms in the Linux kernel: namespace and cgroups (control groups). They provide the isolation and scalability that we all love about containers. Let's take a look at both mechanisms in order.
Namespace
Namespaces allow us to isolate system resources between processes. With their help, we can create a separate virtual system while formally remaining in the host system. Perhaps this brief explanation has not enlightened you much, so let's look at an example:
Let's consider a container raised from the alpine image. Let's start it and the interactive shell in it:
docker run -it alpine /bin/sh
Now let's create a new process in the container and check the output of the ps command:
sleep 1000 &
ps -a
Получаем:
PID USER TIME COMMAND
1 root 0:00 /bin/sh
29 root 0:00 sleep 1000
30 root 0:00 ps -a
Note that the process PID is 29. Now let's try to find the same process, but on the host machine. To do this, we will determine the container ID and use the command to display the processes running inside docker
docker top <container ID>
As a result, we get:
UID PID PPID C STIME TTY TIME CMD
root 172147 172124 0 Feb05 pts/0 00:00:00 /bin/sh
root 173602 172147 0 Feb05 pts/0 00:00:00 sleep 1000
Let's pay attention to two columns: PID and PPID (parent PID). They indicate the PID of the process itself and its parent, but in the host system. Let's check it out:
ps aux | grep -E '173602|172147'
We get:
root 172147 0.0 0.0 1736 908 pts/0 Ss+ Feb05 0:00 /bin/sh
root 173602 0.0 0.0 1624 980 pts/0 S Feb05 0:00 sleep 1000
Which is exactly what we needed to prove! To sum up, we can conclude that the container knows nothing about the host machine. It considers itself to be an independent system. However, in reality, all processes are run on the host, they are simply located in the namespace of the container. This creates the illusion of a separate, independent system.
I hope this example has clarified the situation with namespaces a little. In it, we looked at one of the eight types of namespaces. Now I would like to briefly go over each one:
-
Mount- isolation of file system mount points. Allows you to set your own file system hierarchy; -
UTS- host name isolation. Allows each container to specify its own host name; -
PID- process ID isolation. Allows you to create a separate process tree; -
Network- isolation of network interfaces and routing tables; -
IPC- IPC (interprocess communication) isolation. -
User- system user isolation. Allows you to create separate users for each container, including root. -
Cgroup- cgroup access isolation. Allows you to limit container resources and prevents interference from other containers. -
Time- system time isolation
To create a new namespace in Linux, there is a command called unshare. We will take a closer look at it a little later.
Cgroups
Control groups are a Linux kernel mechanism that allows you to manage process resources. With its help, you can limit and isolate the use of CPU, memory, network, and disk resources.
There are two versions of cgoups: v1 and v2. In most modern systems, you will encounter the second version, which is used in systemd. The main difference between the versions is in the construction of the constraint tree. In the first version, nodes were created for each type of constraint, and groups were added to them. In the second version, each group has its own node, which contains all the necessary constraints. To better understand this, let's take a look at the visualization of the v1 and v2 trees:
#v1
/sys/fs/cgroup/
├── cpu
│ ├── group1/
│ │ ├── tasks
│ │ ├── cgroup.procs
│ │ ├── cpu.shares
│ │ └── ...
│ ├── group2/
│ │ ├── tasks
│ │ ├── cgroup.procs
│ │ ├── cpu.shares
│ │ └── ...
│ └── ...
├── memory
│ ├── group1/
│ │ ├── tasks
│ │ ├── cgroup.procs
│ │ ├── memory.limit_in_bytes
│ │ └── ...
│ ├── group2/
│ │ ├── tasks
│ │ ├── cgroup.procs
│ │ ├── memory.limit_in_bytes
│ │ └── ...
│ └── ...
└── ...
#v2
/sys/fs/cgroup/
├── group1/
│ ├── cgroup.procs
│ ├── cpu.max
│ ├── cpu.weight
│ ├── memory.current
│ ├── memory.max
│ └── ...
├── group2/
│ ├── cgroup.procs
│ ├── cpu.max
│ ├── cpu.weight
│ ├── memory.current
│ ├── memory.max
│ └── ...
└── ...
Now let's take a look at how cgroups work using the example of a Docker container. First, let's start the container, limiting its resources (2 cores and 512 MB):
docker run -d --cpus="2" --memory="512m" nginx
Next, we will find a group for this container using the find:
find /sys/fs/cgroup -name '*<container ID>*'
Next, let's check the contents of the cpu.max and memory.max files in the directory we found:
# cpu.max
200000 100000
# memory.max
536870912
Which is what needed to be proven!
Создание контейнера без docker
We have covered the basic theory we need. Now let's move on to practice and resort to the magic of the command line.
First, let's create the container's file system structure and install busybox in the /bin directory:
# Create the root directory of the container and navigate to it.
mkdir ~/container && cd ~/container
# Create the main system directories and navigate to /bin.
mkdir -p ./{proc,sys,dev,tmp,bin,root,etc} && cd bin
# Install busybox.
wget https://www.busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
# Grant execution rights
chmod +x busybox
# Create symlinks for all commands available in busybox
./busybox --list | xargs -I {} ln -s busybox {}
# Return to the root directory of the container
cd ~/container
# Add the PATH variable to the /etc/profile file
echo ‘export PATH=/bin’ > ~/container/etc/profile
We will also add to the /etc/passwd and /etc/group files so that we are root within the isolated system:
echo "root:x:0:0:root:/root:/bin/sh" > ~/container/etc/passwd
echo "root:x:0:" > ~/container/etc/group
Next, we will mount the system directories:
# Mount devices using existing ones
sudo mount --bind /dev ~/container/dev
# Mount processes
sudo mount -t proc none ~/container/proc
# Mount the sysfs file system
sudo mount -t sysfs none ~/container/sys
# Mount the tmpfs file system
sudo mount -t tmpfs none ~/container/tmp
!!!Note: To unmount later, you can use the command:
sudo umount ~/container/{proc,sys,dev,tmp}
We have prepared the file system for our container. Now let's move on to creating isolation. To do this, we will use the command:
unshare -f -p -m -n -i -u -U --map-root-user --mount-proc=./proc \
/bin/chroot ~/container /bin/sh -c "source /etc/profile && exec /bin/sh"
Let's take a closer look at it:
-f - fork. Create a new process to isolate it from the parent process.
-
-p- PID namespace; -
-m- mount namespace; -
-n- Network namespace; -
-i- IPC namespace; -
-u- UTS namespace; -
-U- User namespace; -
--map-root-user- map the active user's uid and gid to root inside the container; -
-mount-proc- mount proc inside the container; -
/bin/chroot ~/container- change the root directory; -
/bin/sh -c “source /etc/profile && exec /bin/sh”- start the shell and execute the command that will apply the /etc/profile file and start an interactive shell.
Great! We got our container. Now we need to limit resources. To do this, we will open a new session on the host and perform a series of actions:
# Create a new group. My system uses cgroups v2, so
# the directory will be automatically configured to work with resources.
sudo mkdir /sys/fs/cgroup/my_container
# Write a limit of 2 processor cores
echo “200000 100000” | sudo tee /sys/fs/cgroup/my_container/cpu.max
# Allocate a maximum of 512MB of memory
echo 536870912 | sudo tee /sys/fs/cgroup/my_container/memory.max
Next, we need to determine the PID of the container. To do this, we will use the command:
ps aux | grep -E '/bin/sh$'
Take the PID from the second column and add it to the cgroup.procs file:
echo <PID> | sudo tee /sys/fs/cgroup/my_container/cgroup.procs
That completes the basic configuration. We have created an isolated system and added resource restrictions. But we would like to make it a little more functional, so let's set up a virtual network between the host and the container:
# Create a pair of virtual interfaces
sudo ip link add veth-host type veth peer name veth-container
# Bring up the interface on the host
sudo ip link set veth-host up
# Assign any free address in your network to the host interface
# I am using 192.168.1.123/24
sudo ip addr add 192.168.1.123/24 dev veth-host
# Move veth-container to the container namespace
# Here you need to specify the PID of the container you used before
sudo ip link set veth-container netns <PID>
# Bring up the interface inside the container
sudo nsenter --net=/proc/<PID>/ns/net ip link set veth-container up
# Assign any free address on your network to the container interface
# I am using 192.168.1.124/24
sudo nsenter --net=/proc/<PID>/ns/net ip addr add 192.168.1.124/24 dev veth-container
# Configure the default gateway for traffic routing
sudo nsenter --net=/proc/<PID>/ns/net ip route add default via 192.168.1.123
We have raised all the necessary interfaces. Now we need to configure routing:
# Allow packet forwarding
echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward
# Add a NAT rule for masquerading outgoing packets from the network
# 192.168.1.0/24 through the interface that faces the external network. For me, this is enp3s0.
# Masquerading masks packets leaving the container so that they look
# like packets sent from the host
sudo iptables -t nat -A POSTROUTING -s 192.168.1.0/24 -o enp3s0 -j MASQUERADE
# Add a rule to allow packet forwarding
sudo iptables -A FORWARD -s 192.168.1.0/24 -o enp3s0 -j ACCEPT
# Add a rule to allow incoming packets
sudo iptables -A FORWARD -d 192.168.1.0/24 -m state --state RELATED,ESTABLISHED -j ACCEPT
Great! We've created our first container. Obviously, there's still a lot that can be configured, such as DNS, which isn't working right now. But that's up to each individual to decide how to deal with it.

Top comments (0)