DEV Community

Cover image for Comparing 3 Docker container runtimes - Runc, gVisor and Kata Containers
Ákos Takács
Ákos Takács

Posted on

Comparing 3 Docker container runtimes - Runc, gVisor and Kata Containers

Introduction

Previously I wrote about the multiple variants of Docker and also the dependencies behind the Docker daemon. One of the dependencies was the container runtime called runc. That is what creates the usual containers we are all familiar with. When you use Docker, this is the default runtime, which is understandable since it was started by Docker, Inc.

We can change it and use another runtime to create containers differently. We can also choose a runtime that doesn't even create containers, but virtual machines. You could also create your own runtime that adds something to the definition of the container or increases the level of the isolation.

The Docker documentation mentions alternative container runtimes, but I will not write about all of them. Obviously I'm not trying to copy the documentation. I want to compare 3 different kind of runtimes which you could use for a simple Linux container like Ubuntu.

I will not explain how you can install these runtimes, but if you think the documentations are not good enough, please, let me know in the comments and I will see what I can do.

You can also watch a video of this topic on YouTube.

Table of contents

The chosen 3 runtimes

» Back to table of contents «

Although the documentation also mentions "youki", that is mentioned as a "drop-in replacement" of the default runtime basically doing the same, so let's stick with runc. The second runtime will be Kata runtime from Kata containers, since it runs small virtual machines which is good for showing how differently it uses the CPU and memory. This also adds a higher level of isolation with some downsides as well. And the third runtime will be runsc from gVisor which is a perfect third runtime to see how we can run containers and still have a little more secure isolation. I will show how we can recognize the differences by running commands from the isolated environments and from the host.

Introduce the runtimes

runc

» Back to table of contents «

It probably doesn't even require much explanation, since this is the default, and we could just compare everything to it. What runc does is what we are already used to. It creates and runs containers using Linux kernel namespaces based on a config file which we don't have to create manually if we use Docker. We know that every process is running on the host and the kernel just doesn't let the processes see everything on the host. Which doesn't change the fact that we can see everything from the host that runs in the containers.

Since everything runs on the host, nothing is required in the container, only the process which we want to isolate. It means there is no visible kernel in the container on the filesystem. Just because you don't see the wheels inside a car, it doesn't mean the car is just floating. But when you run uname -a in the container to get information about the Kernel, the output will be the same as you could see on the host, except the hostname which is different in a container unless you use the host network or share the UTS namespace of the host with the container.

There is no hardware virtualization, the processes in a container can see all the resources on the host, including the CPU and memory. Of course, you can set a CPU or memory limit for the containers, but that only restricts the amount of memory and CPU that the kernel allows the processes in the container to use. The hardware remains the same regardless of where the process is.

It can also lead to problems, since some applications like databases could use a default resource limit based on the available hardware. So if you set a lower memory limit for the container, the application will try to use more memory and the operating system will kill it. So setting the resource limits on application level can be important in addition to the container level resource limits. But what if the application doesn't support it or there is a bug, and it ignores the parameter?

kata-runtime

» Back to table of contents «

As I mentioned before, the Kata runtime runs containers in their own very small virtual machines. If you still remember how runc worked, let's see how the Kata runtime changes everything?

The key is the fact that it runs a virtual machine below the container. Since we have a virtual machine, we need a kernel in the VM. That means, when you run uname -a you will get information about a different kernel. The one in the virtual machine created by the Kata runtime. Even though there is a virtual machine with another kernel, we will still not see the kernel on the filesystem, since the container is not replaced, but extended with a virtual machine layer.

The CPU and memory from the container will not be the same either. In this case, we will have hardware virtualization, so we will see the hardware available for the virtual machine. It also means we will not be able to use all the CPUs and all the memory. A VM created by the Kata runtime will get 2 gigabytes of memory and 1 vCPU. So even if the application in the container does not support resource limits, it can only detect the hardware in the VM. There are some important details regarding resource limits with the Kata runtime, but we will discuss it later.

It is still important to know, that the Docker daemon will still be on the host, and not in the VM created by the Kata runtime. It should be obvious, since the virtual machine is created by the runtime which is executed by a shim process instructed by containerd which is instructed by dockerd, the Docker daemon. So it couldn't possibly be in the virtual machine that it will create. This fact will be important when we want to talk about mounting files from the "host machine". The host machine is where the Docker daemon is running.

runsc

» Back to table of contents «

As mentioned before, runsc from gVisor creates containers. The difference is that tries to make the container more secure by intercepting system calls sent by processes in the container before the host kernel could handle it. This interception makes some requests a little slower. In fact, the documentation mentions that you should use runsc only for "user-facing containers" like reverse proxy containers that the users are directly interacting with. Of course, if you are also interacting with a web application with a security hole and someone can execute a command through your application, using runsc only for the reverse proxy will not help a lot. But the point is that you probably don't want to use this runtime for all your containers due to the impact on the performance.

Let's assume you use it. Then the process in the container will see almost everything that it would see from a container created by runc, the default runtime. The difference is that runsc will have an application kernel to handle the intercepted system calls so when you run uname -a, you will see that kernel.

Comparing the runtimes

» Back to table of contents «

The below table shows a comparison of the three runtimes

runc (default) runsc kata-runtime
developer opencontainers gVisor (Google) Kata Containers
isolation type container container virtual machine
available resources all resources all resources limited resources
used kernel host kernel application kernel kernel in the VM
installable on VM / Physical VM / Physical VM (with nested virt.) / Physical

As you can see, most of the differences are the consequences of the isolation type. If you know what the runtimes create and the differences between a VM and a container, you have a pretty good idea what to expect. The one additional difference is the application kernel used by runsc. That's why you see a different kernel, but in every case you still have a container either on the host or in a VM, so you will never see the kernel files. Even if you mount /boot into the container, you will mount it from the host where the Docker daemon is running, so you can't mount that folder from the VM created by the Kata runtime.

The second difference which is worth mentioning is that the runtimes that create containers can be used in virtual machines, but when the runtime creates a virtual machine, you need nested virtualization enabled for the host VM. This means you could not test Kata containers in Docker Desktop.

Checking differences in practice

» Back to table of contents «

I wrote a script that can execute the same test command in different environments, including the host machine. I uploaded it to GitHub as a gist, but I share here too to make sure you don't depend on the availability of GitHub and the gist.

#!/usr/bin/env bash

set -eu -o pipefail

runtimes=(Host runc runsc kata)

YELLOW_START='\033[1;33m'
YELLOW_END='\033[0m'

declare -A COMMANDS=(
  [cpus]='nproc'
  [memory]='free | grep "Mem" | awk "{print \$2}"'
  [kernel]='uname -nrv'
  [filesystem]='ls /boot | awk "/vmlinuz-/" | sort -r | head -n1'
)

function runtime_run() {
  local mode="$1"
  local runtime="$2"
  local command="$3"

  if [[ "$runtime" != "Host" ]]; then
    command="docker run --rm --runtime $runtime ubuntu $command"
  fi

  case "$mode" in
    echo)   echo "$command" ;;
    yellow) echo -e "${YELLOW_START}${command}${YELLOW_END}" ;;
    exec)   eval "$command" ;;
    *) >&2  echo "Invalid mode: $mode. Valid modes: echo, exec"; return 1
  esac
}

function showresult() {
  local runtime="$1"
  local label="$2"
  local command="$3"

  runtime_run yellow "$runtime" "$command"
  echo -n "$label: "
  runtime_run exec "$runtime" "$command"
}

labels=("$@")
if (( "${#labels[@]}" == 0 )); then
  for label in "${!COMMANDS[@]}"; do
    labels+=("$label")
  done
fi

for label in "${labels[@]}"; do
  for runtime in "${runtimes[@]}"; do
    showresult "$runtime" "$label" "${COMMANDS[$label]}"
    echo
  done
done
Enter fullscreen mode Exit fullscreen mode

Since using the gist is still easier, let's try to download it first:

curl -L --output runtime-test.sh https://gist.githubusercontent.com/rimelek/05241c26a3b10ff8c9cfe1035b787996/raw/5ef46b3cbdeda5e5bd1f094724929ed9514e2f85/docker-container-runtime-test.sh
chmod +x runtime-test.sh
Enter fullscreen mode Exit fullscreen mode

Then you can either run

./runtime-test.sh
Enter fullscreen mode Exit fullscreen mode

without parameters, or test the availability of the CPU, memory and kernel one by one. The beginning of the script shows the categories you can test and what commands will be executed:

declare -A COMMANDS=(
  [cpus]='nproc'
  [memory]='free | grep "Mem" | awk "{print \$2}"'
  [kernel]='uname -nrv'
  [filesystem]='ls /boot | awk "/vmlinuz-/" | sort -r | head -n1'
)
Enter fullscreen mode Exit fullscreen mode

Even before the above part, you can find the list of runtimes

runtimes=(Host runc runsc kata)
Enter fullscreen mode Exit fullscreen mode

If your runtime names are different, you can change the list of runtimes in the script. "Host" here is not a runtime, but the host machine. I intentionally wrote it with an uppercase "H" to make it different from the runtime names.

If you run

./runtime-test.sh cpus
Enter fullscreen mode Exit fullscreen mode

The following commands will be generated:

nproc
docker run --rm --runtime runc ubuntu nproc
docker run --rm --runtime runsc ubuntu nproc
docker run --rm --runtime kata ubuntu nproc
Enter fullscreen mode Exit fullscreen mode

You can also test the memory availability

./runtime-test.sh memory
Enter fullscreen mode Exit fullscreen mode

The generated commands would be

free | grep "Mem" | awk "{print \$2}"
docker run --rm --runtime runc ubuntu free | grep "Mem" | awk "{print \$2}"
docker run --rm --runtime runsc ubuntu free | grep "Mem" | awk "{print \$2}"
docker run --rm --runtime kata ubuntu free | grep "Mem" | awk "{print \$2}"
Enter fullscreen mode Exit fullscreen mode

To test the kernel version, you can run

./runtime-tets.sh kernel
Enter fullscreen mode Exit fullscreen mode

which generates an executes these commands:

uname -nrv
docker run --rm --runtime runc ubuntu uname -nrv
docker run --rm --runtime runsc ubuntu uname -nrv
docker run --rm --runtime kata ubuntu uname -nrv
Enter fullscreen mode Exit fullscreen mode

Yes, we are using uname -nrv here instead of uname -a to make the lines shorter by not showing redundant information like the CPU architecture multiple times in the output.

And finally, you can also try to list the files under /boot

./runtime-tets.sh filesystem
Enter fullscreen mode Exit fullscreen mode

which generates and executes the following commands:

ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
docker run --rm --runtime runc ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
docker run --rm --runtime runsc ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
docker run --rm --runtime kata ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
Enter fullscreen mode Exit fullscreen mode

Using awk I try to list only files starting with vmlinuz-, but I want to get the latest only, because otherwise we could get too many files which would be irrelevant for the test. If there is one, that proves we can see the kernel files. In this test we assume the current kernel with which the operating system was booted is the latest.

You can see the output of the script running on my machine without arguments

ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
filesystem: vmlinuz-5.15.0-122-generic

docker run --rm --runtime runc ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
filesystem:
docker run --rm --runtime runsc ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
filesystem:
docker run --rm --runtime kata ubuntu ls /boot | awk "/vmlinuz-/" | sort -r | head -n1
filesystem:
nproc
cpus: 12

docker run --rm --runtime runc ubuntu nproc
cpus: 12

docker run --rm --runtime runsc ubuntu nproc
cpus: 12

docker run --rm --runtime kata ubuntu nproc
cpus: 1

free | grep "Mem" | awk "{print \$2}"
memory: 16241924

docker run --rm --runtime runc ubuntu free | grep "Mem" | awk "{print \$2}"
memory: 16241924

docker run --rm --runtime runsc ubuntu free | grep "Mem" | awk "{print \$2}"
memory: 16241924

docker run --rm --runtime kata ubuntu free | grep "Mem" | awk "{print \$2}"
memory: 2038464

uname -nrv
kernel: ta-lxlt 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024

docker run --rm --runtime runc ubuntu uname -nrv
kernel: aea867200366 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024

docker run --rm --runtime runsc ubuntu uname -nrv
kernel: 91d3bb0285b8 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016

docker run --rm --runtime kata ubuntu uname -nrv
kernel: 9ca9368d8781 6.1.62 #1 SMP Mon Sep  9 09:44:34 UTC 2024
Enter fullscreen mode Exit fullscreen mode

If you want to see me executing these commands, consider watching the video linked at the beginning of this post.

Resource handling and limitations of the Kata runtime

» Back to table of contents «

Since the Kata runtime runs a virtual machine, you cannot assign half a CPU to a VM. The docker commands still support setting --cpus 0.5, but it only means this will be the amount that the qemu process can use. By default, as mentioned before, the VM has 1 vCPU and if you set the limit to be half a CPU, you will get 1 and a half rounded up to an integer, so you will get 2. The limit will increase the default amount and not replace it.

This is what happens with the memory as well, except it doesn't have to be rounded up. So if you set --memory 500M you get 2 and a half gigabytes of memory. If you want to test it, you can use the commands generated by the scripts and add the limits:

docker run --rm -it --runtime kata --cpus 0.5 ubuntu nproc
docker run --rm -it --runtime kata --memory 500M ubuntu free | grep "Mem" | awk "{print \$2}"
Enter fullscreen mode Exit fullscreen mode

The output is:

2
2562752
Enter fullscreen mode Exit fullscreen mode

Where the second line is the memory in kilobytes.

I bet the first thing you think that it is a bug. There is an issue on GitHub where someone thought the same. The fact is that Kata containers are different and there are Limitations. The first I noticed too, that there is no way to share process or network namespaces between Docker containers. The fact that you cannot use the process namespace or network namespace of the host is easily understandable because we have a VM and not just a host kernel isolating our processes.

Conclusion

» Back to table of contents «

Originally Docker created only containers. In fact, it used LXC as an "exec driver" which is basically what we call runtime today or at least the closest thing to it. It was deprecated in Docker 1.8.0.

Now you can even choose a runtime which creates a virtual machine or a container with a more secure isolation. Once there was a runtime for using an NVIDIA GPU called nvidia-container-runtime. That project is now deprecated and Docker has the "--gpus" option instead. Talking about GPUs is not the scope of this blogpost, but it is a good example of a special runtime that gave additional capabilities to containers.

Each runtime has benefits and downsides. Which one is the best for you depends on what you need it for. I recommend testing the runtimes before making a decision. Running a small VM could seem to be a good idea, but you can discover downsides that change your mind.

Top comments (2)

Collapse
 
johnwmail profile image
John Wong

How about sysbox runtime?

Collapse
 
rimelek profile image
Ákos Takács

Sorry, I didn't notice your comment. Now I just checked if I had any. I never installed sysbox myself, but it looks like Docker Desktop's "enhanced container isolation feature" is based on it, so thank you for mentioning it. In this post my goal was to compare those that we can see in the "alternative container runtimes" documentation.