Introduction
Previously I wrote about the multiple variants of Docker and also the dependencies behind the Docker daemon. One of the dependencies was the container runtime called runc. That is what creates the usual containers we are all familiar with. When you use Docker, this is the default runtime, which is understandable since it was started by Docker, Inc.
We can change it and use another runtime to create containers differently. We can also choose a runtime that doesn't even create containers, but virtual machines. You could also create your own runtime that adds something to the definition of the container or increases the level of the isolation.
The Docker documentation mentions alternative container runtimes, but I will not write about all of them. Obviously I'm not trying to copy the documentation. I want to compare 3 different kind of runtimes which you could use for a simple Linux container like Ubuntu.
I will not explain how you can install these runtimes, but if you think the documentations are not good enough, please, let me know in the comments and I will see what I can do.
You can also watch a video of this topic on YouTube.
Table of contents
- The chosen 3 runtimes
- Introduce the runtimes
- Comparing the runtimes
- Checking differences in practice
- Resource handling and limitations of the Kata runtime
- Conclusion
The chosen 3 runtimes
Although the documentation also mentions "youki", that is mentioned as a "drop-in replacement" of the default runtime basically doing the same, so let's stick with runc. The second runtime will be Kata runtime from Kata containers, since it runs small virtual machines which is good for showing how differently it uses the CPU and memory. This also adds a higher level of isolation with some downsides as well. And the third runtime will be runsc from gVisor which is a perfect third runtime to see how we can run containers and still have a little more secure isolation. I will show how we can recognize the differences by running commands from the isolated environments and from the host.
Introduce the runtimes
runc
It probably doesn't even require much explanation, since this is the default, and we could just compare everything to it. What runc does is what we are already used to. It creates and runs containers using Linux kernel namespaces based on a config file which we don't have to create manually if we use Docker. We know that every process is running on the host and the kernel just doesn't let the processes see everything on the host. Which doesn't change the fact that we can see everything from the host that runs in the containers.
Since everything runs on the host, nothing is required in the container, only the process which we want to isolate. It means there is no visible kernel in the container on the filesystem. Just because you don't see the wheels inside a car, it doesn't mean the car is just floating. But when you run uname -a
in the container to get information about the Kernel, the output will be the same as you could see on the host, except the hostname which is different in a container unless you use the host network or share the UTS namespace of the host with the container.
There is no hardware virtualization, the processes in a container can see all the resources on the host, including the CPU and memory. Of course, you can set a CPU or memory limit for the containers, but that only restricts the amount of memory and CPU that the kernel allows the processes in the container to use. The hardware remains the same regardless of where the process is.
It can also lead to problems, since some applications like databases could use a default resource limit based on the available hardware. So if you set a lower memory limit for the container, the application will try to use more memory and the operating system will kill it. So setting the resource limits on application level can be important in addition to the container level resource limits. But what if the application doesn't support it or there is a bug, and it ignores the parameter?
kata-runtime
As I mentioned before, the Kata runtime runs containers in their own very small virtual machines. If you still remember how runc worked, let's see how the Kata runtime changes everything?
The key is the fact that it runs a virtual machine below the container. Since we have a virtual machine, we need a kernel in the VM. That means, when you run uname -a
you will get information about a different kernel. The one in the virtual machine created by the Kata runtime. Even though there is a virtual machine with another kernel, we will still not see the kernel on the filesystem, since the container is not replaced, but extended with a virtual machine layer.
The CPU and memory from the container will not be the same either. In this case, we will have hardware virtualization, so we will see the hardware available for the virtual machine. It also means we will not be able to use all the CPUs and all the memory. A VM created by the Kata runtime will get 2 gigabytes of memory and 1 vCPU. So even if the application in the container does not support resource limits, it can only detect the hardware in the VM. There are some important details regarding resource limits with the Kata runtime, but we will discuss it later.
It is still important to know, that the Docker daemon will still be on the host, and not in the VM created by the Kata runtime. It should be obvious, since the virtual machine is created by the runtime which is executed by a shim process instructed by containerd which is instructed by dockerd, the Docker daemon. So it couldn't possibly be in the virtual machine that it will create. This fact will be important when we want to talk about mounting files from the "host machine". The host machine is where the Docker daemon is running.
runsc
As mentioned before, runsc from gVisor creates containers. The difference is that tries to make the container more secure by intercepting system calls sent by processes in the container before the host kernel could handle it. This interception makes some requests a little slower. In fact, the documentation mentions that you should use runsc only for "user-facing containers" like reverse proxy containers that the users are directly interacting with. Of course, if you are also interacting with a web application with a security hole and someone can execute a command through your application, using runsc only for the reverse proxy will not help a lot. But the point is that you probably don't want to use this runtime for all your containers due to the impact on the performance.
Let's assume you use it. Then the process in the container will see almost everything that it would see from a container created by runc, the default runtime. The difference is that runsc will have an application kernel to handle the intercepted system calls so when you run uname -a
, you will see that kernel.
Comparing the runtimes
The below table shows a comparison of the three runtimes
runc (default) | runsc | kata-runtime | |
---|---|---|---|
developer | opencontainers | gVisor (Google) | Kata Containers |
isolation type | container | container | virtual machine |
available resources | all resources | all resources | limited resources |
used kernel | host kernel | application kernel | kernel in the VM |
installable on | VM / Physical | VM / Physical | VM (with nested virt.) / Physical |
As you can see, most of the differences are the consequences of the isolation type. If you know what the runtimes create and the differences between a VM and a container, you have a pretty good idea what to expect. The one additional difference is the application kernel used by runsc. That's why you see a different kernel, but in every case you still have a container either on the host or in a VM, so you will never see the kernel files. Even if you mount /boot
into the container, you will mount it from the host where the Docker daemon is running, so you can't mount that folder from the VM created by the Kata runtime.
The second difference which is worth mentioning is that the runtimes that create containers can be used in virtual machines, but when the runtime creates a virtual machine, you need nested virtualization enabled for the host VM. This means you could not test Kata containers in Docker Desktop.
Checking differences in practice
I wrote a script that can execute the same test command in different environments, including the host machine. I uploaded it to GitHub as a gist, but I share here too to make sure you don't depend on the availability of GitHub and the gist.
#!/usr/bin/env bash
set -eu -o pipefail
runtimes=(Host runc runsc kata)
YELLOW_START='\033[1;33m'
YELLOW_END='\033[0m'
declare -A COMMANDS=(
[cpus]='nproc'
[memory]='free | grep "Mem" | awk "{print \$2}"'
[kernel]='uname -nrv'
[filesystem]='ls /boot | awk "/vmlinuz-/" | sort -r | head -n1'
)
function runtime_run() {
local mode="$1"
local runtime="$2"
local command="$3"
if [[ "$runtime" != "Host" ]]; then
command="docker run --rm --runtime $runtime ubuntu $command"
fi
case "$mode" in
echo) echo "$command" ;;
yellow) echo -e "${YELLOW_START}${command}${YELLOW_END}" ;;
exec) eval "$command" ;;
*) >&2 echo "Invalid mode: $mode. Valid modes: echo, exec"; return 1
esac
}
function showresult() {
local runtime="$1"
local label="$2"
local command="$3"
runtime_run yellow "$runtime" "$command"
echo -n "$label: "
runtime_run exec "$runtime" "$command"
}
labels=("$@")
if (( "${#labels[@]}" == 0 )); then
for label in "${!COMMANDS[@]}"; do
labels+=("$label")
done
fi
for label in "${labels[@]}"; do
for runtime in "${runtimes[@]}"; do
showresult "$runtime" "$label" "${COMMANDS[$label]}"
echo
done
done
Since using the gist is still easier, let's try to download it first:
curl -L --output runtime-test.sh https://gist.githubusercontent.com/rimelek/05241c26a3b10ff8c9cfe1035b787996/raw/5ef46b3cbdeda5e5bd1f094724929ed9514e2f85/docker-container-runtime-test.sh
chmod +x runtime-test.sh
Then you can either run
./runtime-test.sh
without parameters, or test the availability of the CPU, memory and kernel one by one. The beginning of the script shows the categories you can test and what commands will be executed:
declare -A COMMANDS=(
[cpus]='nproc'
[memory]='free | grep "Mem" | awk "{print \$2}"'
[kernel]='uname -nrv'
[filesystem]='ls /boot | awk "/vmlinuz-/" | sort -r | head -n1'
)
Even before the above part, you can find the list of runtimes
runtimes=(Host runc runsc kata)
If your runtime names are different, you can change the list of runtimes in the script. "Host" here is not a runtime, but the host machine. I intentionally wrote it with an uppercase "H" to make it different from the runtime names.
If you run
./runtime-test.sh cpus
The following commands will be generated:
nproc
docker run --rm --runtime runc ubuntu nproc
docker run --rm --runtime runsc ubuntu nproc
docker run --rm --runtime kata ubuntu nproc
You can also test the memory availability
./runtime-test.sh memory
The generated commands would be
free | grep "Mem" | awk "{print \$2}"
docker run --rm --runtime runc ubuntu free | grep "Mem" | awk "{print \$2}"
docker run --rm --runtime runsc ubuntu free | grep "Mem" | awk "{print \$2}"
docker run --rm --runtime kata ubuntu free | grep "Mem" | awk "{print \$2}"
To test the kernel version, you can run
./runtime-tets.sh kernel
which generates an executes these commands:
uname -nrv
docker run --rm --runtime runc ubuntu uname -nrv
docker run --rm --runtime runsc ubuntu uname -nrv
docker run --rm --runtime kata ubuntu uname -nrv
Yes, we are using uname -nrv
here instead of uname -a
to make the lines shorter by not showing redundant information like the CPU architecture multiple times in the output.
And finally, you can also try to list the files under /boot
./runtime-tets.sh filesystem
which generates and executes the following commands:
ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
docker run --rm --runtime runc ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
docker run --rm --runtime runsc ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
docker run --rm --runtime kata ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
Using awk
I try to list only files starting with vmlinuz-
, but I want to get the latest only, because otherwise we could get too many files which would be irrelevant for the test. If there is one, that proves we can see the kernel files. In this test we assume the current kernel with which the operating system was booted is the latest.
You can see the output of the script running on my machine without arguments
ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
filesystem: vmlinuz-5.15.0-122-generic
docker run --rm --runtime runc ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
filesystem:
docker run --rm --runtime runsc ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
filesystem:
docker run --rm --runtime kata ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
filesystem:
nproc
cpus: 12
docker run --rm --runtime runc ubuntu nproc
cpus: 12
docker run --rm --runtime runsc ubuntu nproc
cpus: 12
docker run --rm --runtime kata ubuntu nproc
cpus: 1
free | grep "Mem" | awk "{print \$2}"
memory: 16241924
docker run --rm --runtime runc ubuntu free | grep "Mem" | awk "{print \$2}"
memory: 16241924
docker run --rm --runtime runsc ubuntu free | grep "Mem" | awk "{print \$2}"
memory: 16241924
docker run --rm --runtime kata ubuntu free | grep "Mem" | awk "{print \$2}"
memory: 2038464
uname -nrv
kernel: ta-lxlt 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024
docker run --rm --runtime runc ubuntu uname -nrv
kernel: aea867200366 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024
docker run --rm --runtime runsc ubuntu uname -nrv
kernel: 91d3bb0285b8 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016
docker run --rm --runtime kata ubuntu uname -nrv
kernel: 9ca9368d8781 6.1.62 #1 SMP Mon Sep 9 09:44:34 UTC 2024
If you want to see me executing these commands, consider watching the video linked at the beginning of this post.
Resource handling and limitations of the Kata runtime
Since the Kata runtime runs a virtual machine, you cannot assign half a CPU to a VM. The docker commands still support setting --cpus 0.5
, but it only means this will be the amount that the qemu process can use. By default, as mentioned before, the VM has 1 vCPU and if you set the limit to be half a CPU, you will get 1 and a half rounded up to an integer, so you will get 2. The limit will increase the default amount and not replace it.
This is what happens with the memory as well, except it doesn't have to be rounded up. So if you set --memory 500M
you get 2 and a half gigabytes of memory. If you want to test it, you can use the commands generated by the scripts and add the limits:
docker run --rm -it --runtime kata --cpus 0.5 ubuntu nproc
docker run --rm -it --runtime kata --memory 500M ubuntu free | grep "Mem" | awk "{print \$2}"
The output is:
2
2562752
Where the second line is the memory in kilobytes.
I bet the first thing you think that it is a bug. There is an issue on GitHub where someone thought the same. The fact is that Kata containers are different and there are Limitations. The first I noticed too, that there is no way to share process or network namespaces between Docker containers. The fact that you cannot use the process namespace or network namespace of the host is easily understandable because we have a VM and not just a host kernel isolating our processes.
Conclusion
Originally Docker created only containers. In fact, it used LXC as an "exec driver" which is basically what we call runtime today or at least the closest thing to it. It was deprecated in Docker 1.8.0.
Now you can even choose a runtime which creates a virtual machine or a container with a more secure isolation. Once there was a runtime for using an NVIDIA GPU called nvidia-container-runtime. That project is now deprecated and Docker has the "--gpus
" option instead. Talking about GPUs is not the scope of this blogpost, but it is a good example of a special runtime that gave additional capabilities to containers.
Each runtime has benefits and downsides. Which one is the best for you depends on what you need it for. I recommend testing the runtimes before making a decision. Running a small VM could seem to be a good idea, but you can discover downsides that change your mind.
Top comments (2)
How about sysbox runtime?
Sorry, I didn't notice your comment. Now I just checked if I had any. I never installed sysbox myself, but it looks like Docker Desktop's "enhanced container isolation feature" is based on it, so thank you for mentioning it. In this post my goal was to compare those that we can see in the "alternative container runtimes" documentation.