Kubernetes 101 : Pod

#kubernetes #devops #linux #docker

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.

And in Kubernetes, Pods are the smallest deployable units of computing that you can create and manage in Kubernetes. When I started learning Kubernetes I had two questions about Pods in particular :

What really is a Pod ?
How is it different from containers ?

In this article I will try to answer these two questions. When I had these questions, the common answer I would get is

Pod is just a wrapper around the container.

But nobody would explain, why this so-called wrapper is needed and what purpose does it serves? So let's dig in and find out.

What actually is a Pod ?

Containers are a great way to deploy single unit of software however if you want to run multiple software together you would have to deploy group of containers that are isolated but partially share the environment with one another. What's more, Kubernetes needs additional information for container management, such as a restart policy, which defines what to do with a container when it terminates, or a liveness probe. So Kubernetes provides an abstraction for this and call it as Pod . Pod hides the underlying complexity from the end users.

You must be wondering what do I mean with partially sharing the environment and how Pod manages all this stuff internally ? Let's understand.

If you run a Pod in your Kubernetes cluster and check the containers running in the node using docker ps you would find something odd.

As you can see here, there are two containers launched, one is our actual nginx container that we launched and there is another additional container with /pause . Let's call this pause container.

These two containers here form what we call as actual Pod . The question now is what's the use of this pause container ? The pause container acts as sort of a parent container for the all other containers that are launched in a Pod. The pause container has two main responsibilities :

First, it serves as the basis of Linux namespace sharing in the pod.
And second, with PID namespace sharing enabled, it serves as PID 1 for each pod and reaps zombie processes.

Let's understand in detail these two responsibilities.

Namespace Sharing

In Linux whenever you run a new process, the child process inherits its namespace (different from namespaces in Kubernetes) from the parent process. But if you want isolation like containers do, we would want a separate namespace for our process. This forms the basis for the containers. Now in context to Pod, the pause container basically creates a separate namespace for the Pod. Docker does this too when you create docker container. Its as simple as running docker run -d --name pause k8s.gcr.io/pause:3.6 and this would isolate container from rest of the host. We have our pause container running and when we launch our actual container, it would join the following namespaces of the pause container :

Network namespace
IPC namespace
PID namespace

In docker we can do this by docker run -d --name nginx --net=container:pause --ipc=container:pause --pid=container:pause nginx. Now our nginx has joined the network, ipc and pid namespace of the pause container. This will effectively create our pod.

This namespace sharing in the Pods have several benefits. Suppose for example, we have two containers say application and monitoring and both containers have joined the pause container's namespaces. Sharing of :

network namespace allows containers to communicate directly using the localhost.
ipc namespace allows the containers to share their inter-process communication (IPC) namespace with the other containers so they can communicate directly through shared-memory with other containers.
pid namespace allows containers to share their process ID (PID) namespace with other containers which enables for example monitoring applications deployed as containers to access information about other applications running in the Pod.

To verify weather both containers are really sharing these namespaces or not, we can get the PIDs of both the containers and then check if its true. Here we go.

Getting PIDs of container

PID 6631 --> Pause container
PID 6882 --> Nginx container

Check namespace of containers

If we compare these two we can see that the namespaces are same for both except PID (We will discuss it later.)

Assuming PID 1 and reaping zombie processes

The second responsibility of the pause container is that it acts as PID 1 for the Pod and reaps all the zombie processes given that PID namespace sharing is enabled. In Linux PID namespace, the processes form a tree like structure called as process tree and PID 1 is the root of that tree. Generally it is reserved for init processes like systemd and all other process start from this init process using fork and exec syscalls.

When ever a new process is created, a new entry is added to the OS process table. It records the state of the process and the exit code. When a child process has finished running, its process table entry remains until the parent process has retrieved its exit code using the wait syscall. This is called "reaping" the zombies.

But if parent dies before the child, the OS assigns the child process to the "init" process and thus the init process "adopts" the child process and becomes its parent. Zombie processes (now called <defunct>)are processes that have stopped running but their process table entry still exists because the parent process hasn't retrieved it via the wait syscall. Technically each process that terminates is a zombie for a very short period of time but they could live for longer.

In containers, one process must be the init process for each PID namespace. With Docker, each container usually has its own PID namespace and the ENTRYPOINT process is the init process. However, a container can be made to run in another container's namespace.
In this case, one container must assume the role of the init process, while others are added to the namespace as children of
the init process. But most user applications like nginx in our case are not well suited to be init processes and are not able to reap zombie processes. That means we could potentially have lots of them and they will last for the life of that container.

In Kubernetes pods, containers are run in much the same way as
above, but there is a special pause container that is created for
each pod. These pause containers assumes the role of PID 1 and will reap an zombies by calling wait on them when they are orphaned by their parent process This way we don't get zombies piling up in the PID namespaces of our Kubernetes pods.

PID Namespace

It's useful to note that there has been a lot of discussion on PID namespace sharing. Reaping zombies is only done by the pause container if you have PID namespace sharing enabled. In Kubernetes 1.8 and above it's disabled by default unless enabled by a kubelet flag. If PID namespace sharing is not enabled then each container in a Kubernetes pod will have its own PID 1 and each one will need to reap zombie processes itself. This is why our containers do not have same PID namespace when we checked the namespaces because they are not sharing PID namespace. Many times it isn't a problem because the application doesn't spawn other processes, but zombie processes using up memory is an often overlooked.

There is whole lot more going on in the Pods but it is not possible to discuss all this here due to space-time complexity 😄 So I leave it to you to explore further. I know its a long one but I hope know you understood what exactly a Pod is and how it works internally. Thanks for the read and see you until next time..!!