DEV Community

Cover image for My first container without Docker
Ivan Zykov
Ivan Zykov

Posted on

My first container without Docker

Containerization technologies, perhaps like most readers of this article, are stuck in my mind. And it would seem, just write Dockerfile and don't show off. But you always want to learn something new and delve deeper into topics you've already mastered. For this reason, I decided to figure out how containers are implemented in Linux-based systems and then create my own "container" using cmd.

Who maintains containers in Linux?

On guard
First, you need to understand what containerization technology is based on. There are two mechanisms in the Linux kernel: namespace and cgroups (control groups). They provide the isolation and scalability that we all love about containers. Let's take a look at both mechanisms in order.

Namespace

Namespaces allow us to isolate system resources between processes. With their help, we can create a separate virtual system while formally remaining in the host system. Perhaps this brief explanation has not enlightened you much, so let's look at an example:
Let's consider a container raised from the alpine image. Let's start it and the interactive shell in it:

docker run -it alpine /bin/sh
Enter fullscreen mode Exit fullscreen mode

Now let's create a new process in the container and check the output of the ps command:

sleep 1000 &
ps -a
Enter fullscreen mode Exit fullscreen mode

Получаем:

PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
   29 root      0:00 sleep 1000
   30 root      0:00 ps -a
Enter fullscreen mode Exit fullscreen mode

Note that the process PID is 29. Now let's try to find the same process, but on the host machine. To do this, we will determine the container ID and use the command to display the processes running inside docker

docker top <container ID>
Enter fullscreen mode Exit fullscreen mode

As a result, we get:

UID     PID       PPID      C    STIME    TTY      TIME        CMD
root    172147    172124    0    Feb05    pts/0    00:00:00    /bin/sh
root    173602    172147    0    Feb05    pts/0    00:00:00    sleep 1000
Enter fullscreen mode Exit fullscreen mode

Let's pay attention to two columns: PID and PPID (parent PID). They indicate the PID of the process itself and its parent, but in the host system. Let's check it out:

ps aux | grep -E '173602|172147'
Enter fullscreen mode Exit fullscreen mode

We get:

root      172147  0.0  0.0   1736   908 pts/0    Ss+  Feb05   0:00 /bin/sh
root      173602  0.0  0.0   1624   980 pts/0    S    Feb05   0:00 sleep 1000
Enter fullscreen mode Exit fullscreen mode

Which is exactly what we needed to prove! To sum up, we can conclude that the container knows nothing about the host machine. It considers itself to be an independent system. However, in reality, all processes are run on the host, they are simply located in the namespace of the container. This creates the illusion of a separate, independent system.
I hope this example has clarified the situation with namespaces a little. In it, we looked at one of the eight types of namespaces. Now I would like to briefly go over each one:

  1. Mount - isolation of file system mount points. Allows you to set your own file system hierarchy;
  2. UTS - host name isolation. Allows each container to specify its own host name;
  3. PID - process ID isolation. Allows you to create a separate process tree;
  4. Network - isolation of network interfaces and routing tables;
  5. IPC - IPC (interprocess communication) isolation.
  6. User - system user isolation. Allows you to create separate users for each container, including root.
  7. Cgroup - cgroup access isolation. Allows you to limit container resources and prevents interference from other containers.
  8. Time - system time isolation

To create a new namespace in Linux, there is a command called unshare. We will take a closer look at it a little later.

Cgroups

Control groups are a Linux kernel mechanism that allows you to manage process resources. With its help, you can limit and isolate the use of CPU, memory, network, and disk resources.
There are two versions of cgoups: v1 and v2. In most modern systems, you will encounter the second version, which is used in systemd. The main difference between the versions is in the construction of the constraint tree. In the first version, nodes were created for each type of constraint, and groups were added to them. In the second version, each group has its own node, which contains all the necessary constraints. To better understand this, let's take a look at the visualization of the v1 and v2 trees:

#v1
/sys/fs/cgroup/
├── cpu
│   ├── group1/
│   │   ├── tasks
│   │   ├── cgroup.procs
│   │   ├── cpu.shares
│   │   └── ...
│   ├── group2/
│   │   ├── tasks
│   │   ├── cgroup.procs
│   │   ├── cpu.shares
│   │   └── ...
│   └── ...
├── memory
│   ├── group1/
│   │   ├── tasks
│   │   ├── cgroup.procs
│   │   ├── memory.limit_in_bytes
│   │   └── ...
│   ├── group2/
│   │   ├── tasks
│   │   ├── cgroup.procs
│   │   ├── memory.limit_in_bytes
│   │   └── ...
│   └── ...
└── ...

#v2
/sys/fs/cgroup/
├── group1/
│   ├── cgroup.procs
│   ├── cpu.max
│   ├── cpu.weight
│   ├── memory.current
│   ├── memory.max
│   └── ...
├── group2/
│   ├── cgroup.procs
│   ├── cpu.max
│   ├── cpu.weight
│   ├── memory.current
│   ├── memory.max
│   └── ...
└── ...
Enter fullscreen mode Exit fullscreen mode

Now let's take a look at how cgroups work using the example of a Docker container. First, let's start the container, limiting its resources (2 cores and 512 MB):

docker run -d --cpus="2" --memory="512m" nginx 
Enter fullscreen mode Exit fullscreen mode

Next, we will find a group for this container using the find:

find /sys/fs/cgroup -name '*<container ID>*'
Enter fullscreen mode Exit fullscreen mode

Next, let's check the contents of the cpu.max and memory.max files in the directory we found:

# cpu.max
200000 100000

# memory.max
536870912
Enter fullscreen mode Exit fullscreen mode

Which is what needed to be proven!

Создание контейнера без docker

Is Tux a wizard?!

We have covered the basic theory we need. Now let's move on to practice and resort to the magic of the command line.
First, let's create the container's file system structure and install busybox in the /bin directory:

# Create the root directory of the container and navigate to it.
mkdir ~/container && cd ~/container
# Create the main system directories and navigate to /bin.
mkdir -p ./{proc,sys,dev,tmp,bin,root,etc} && cd bin
# Install busybox.
wget https://www.busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
# Grant execution rights
chmod +x busybox
# Create symlinks for all commands available in busybox 
./busybox --list | xargs -I {} ln -s busybox {}
# Return to the root directory of the container
cd ~/container
# Add the PATH variable to the /etc/profile file
echo ‘export PATH=/bin’ > ~/container/etc/profile
Enter fullscreen mode Exit fullscreen mode

We will also add to the /etc/passwd and /etc/group files so that we are root within the isolated system:

echo "root:x:0:0:root:/root:/bin/sh" > ~/container/etc/passwd
echo "root:x:0:" > ~/container/etc/group
Enter fullscreen mode Exit fullscreen mode

Next, we will mount the system directories:

# Mount devices using existing ones
sudo mount --bind /dev ~/container/dev
# Mount processes
sudo mount -t proc none ~/container/proc
# Mount the sysfs file system
sudo mount -t sysfs none ~/container/sys
# Mount the tmpfs file system
sudo mount -t tmpfs none ~/container/tmp
Enter fullscreen mode Exit fullscreen mode

!!!Note: To unmount later, you can use the command:

sudo umount ~/container/{proc,sys,dev,tmp}  
Enter fullscreen mode Exit fullscreen mode

We have prepared the file system for our container. Now let's move on to creating isolation. To do this, we will use the command:

unshare -f -p -m -n -i -u -U --map-root-user --mount-proc=./proc \
    /bin/chroot ~/container /bin/sh -c "source /etc/profile && exec /bin/sh"
Enter fullscreen mode Exit fullscreen mode

Let's take a closer look at it:
-f - fork. Create a new process to isolate it from the parent process.

  • -p - PID namespace;
  • -m - mount namespace;
  • -n - Network namespace;
  • -i - IPC namespace;
  • -u - UTS namespace;
  • -U - User namespace;
  • --map-root-user - map the active user's uid and gid to root inside the container;
  • -mount-proc - mount proc inside the container;
  • /bin/chroot ~/container - change the root directory;
  • /bin/sh -c “source /etc/profile && exec /bin/sh” - start the shell and execute the command that will apply the /etc/profile file and start an interactive shell.

Great! We got our container. Now we need to limit resources. To do this, we will open a new session on the host and perform a series of actions:

# Create a new group. My system uses cgroups v2, so
# the directory will be automatically configured to work with resources.
sudo mkdir /sys/fs/cgroup/my_container
# Write a limit of 2 processor cores
echo “200000 100000” | sudo tee /sys/fs/cgroup/my_container/cpu.max
# Allocate a maximum of 512MB of memory
echo 536870912 | sudo tee /sys/fs/cgroup/my_container/memory.max
Enter fullscreen mode Exit fullscreen mode

Next, we need to determine the PID of the container. To do this, we will use the command:

ps aux | grep -E '/bin/sh$' 
Enter fullscreen mode Exit fullscreen mode

Take the PID from the second column and add it to the cgroup.procs file:

echo <PID> | sudo tee /sys/fs/cgroup/my_container/cgroup.procs
Enter fullscreen mode Exit fullscreen mode

That completes the basic configuration. We have created an isolated system and added resource restrictions. But we would like to make it a little more functional, so let's set up a virtual network between the host and the container:

# Create a pair of virtual interfaces
sudo ip link add veth-host type veth peer name veth-container
# Bring up the interface on the host
sudo ip link set veth-host up
# Assign any free address in your network to the host interface
# I am using 192.168.1.123/24
sudo ip addr add 192.168.1.123/24 dev veth-host
# Move veth-container to the container namespace
# Here you need to specify the PID of the container you used before
sudo ip link set veth-container netns <PID>
# Bring up the interface inside the container
sudo nsenter --net=/proc/<PID>/ns/net ip link set veth-container up
# Assign any free address on your network to the container interface
# I am using 192.168.1.124/24
sudo nsenter --net=/proc/<PID>/ns/net ip addr add 192.168.1.124/24 dev veth-container
# Configure the default gateway for traffic routing
sudo nsenter --net=/proc/<PID>/ns/net ip route add default via 192.168.1.123
Enter fullscreen mode Exit fullscreen mode

We have raised all the necessary interfaces. Now we need to configure routing:

# Allow packet forwarding
echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward
# Add a NAT rule for masquerading outgoing packets from the network 
# 192.168.1.0/24 through the interface that faces the external network. For me, this is enp3s0.
# Masquerading masks packets leaving the container so that they look
# like packets sent from the host
sudo iptables -t nat -A POSTROUTING -s 192.168.1.0/24 -o enp3s0 -j MASQUERADE
# Add a rule to allow packet forwarding
sudo iptables -A FORWARD -s 192.168.1.0/24 -o enp3s0 -j ACCEPT
# Add a rule to allow incoming packets
sudo iptables -A FORWARD -d 192.168.1.0/24 -m state --state RELATED,ESTABLISHED -j ACCEPT
Enter fullscreen mode Exit fullscreen mode

Great! We've created our first container. Obviously, there's still a lot that can be configured, such as DNS, which isn't working right now. But that's up to each individual to decide how to deal with it.

Top comments (0)