DEV Community

Cover image for What is a Container? The OS-Level Truth Most Engineers Don't Know
Krishna Tej Chalamalasetty
Krishna Tej Chalamalasetty

Posted on • Originally published at chkrishnatej.dev

What is a Container? The OS-Level Truth Most Engineers Don't Know

"You Keep Using That Word"

Dispelling Container Misconceptions at the OS Level

Before we write a single line of code, we need to kill the buzzword fog.

What a Container Actually Is

The marketing definition you have heard a hundred times: "a container is an executable unit of software with its dependencies bundled together." That is not wrong, but it tells you nothing useful about what is actually happening on the machine.

Here is the OS-level truth: a container is a process (or a tree of processes) that the kernel runs with a restricted view of its own namespaces and a cgroup-enforced ceiling on the resources it can consume. That is the entire trick. No hypervisor, no guest kernel, no virtualized hardware. Just a process with a carefully constructed set of constraints.
Everything else in this article is evidence for that single claim.

Spin up a simple HTTPD container

I used the following podman command to spin up the HTTPD container with limited amount of resources.

podman run -d \
  --name my-limited-httpd \
  --replace \
  --memory=512m \        # hard memory ceiling
  --memory-swap=512m \   # swap ceiling equal to memory = no swap allowed
  --cpus=1.5 \           # CPU quota: 1.5 cores worth of CPU time
  --cpu-shares=512 \     # relative CPU weight during contention
  --pids-limit=100 \     # max 100 processes/threads in this container
  --blkio-weight=500 \   # relative block I/O weight
  -p 8081:80 \
  registry.access.redhat.com/ubi8/httpd-24
Enter fullscreen mode Exit fullscreen mode

This starts a container with the given resources. You can check

krish-local:~ # pstree -pT
systemd(1)─┬─NetworkManager(944)
           ├─agetty(1011)
           ├─agetty(1012)
           ├─auditd(835)
           ├─chronyd(1007)
           # ConMon is container monitor. HTTPD is the container we ran
           ├─conmon(13669)───httpd(13684)─┬─cat(13727) 
           │                              ├─cat(13728)
           │                              ├─cat(13729)
           │                              ├─cat(13730)
           │                              ├─httpd(13731)
           │                              ├─httpd(13734)
           │                              └─httpd(35659)
           ├─dbus-broker-lau(854)───dbus-broker(856)
           ├─dotd(959)───sleep(3951905)
           ├─irqbalance(858)
           ├─rsyslogd(1159)
           ├─sshd(1562)───sshd-session(3949831)───sshd-session(3950094)───bash(3950096)───pstree(3954963)
           ├─systemd(3950077)───(sd-pam)(3950079)
           ├─systemd-journal(621)
           ├─systemd-logind(890)
           ├─systemd-udevd(657)
           └─wpa_supplicant(891)
Enter fullscreen mode Exit fullscreen mode

From podman run to a Process - What Actually Happens

When you ran podman run, Podman did not start httpd directly. It handed the work to an OCI runtime. On SLES 16, that runtime is crun (you can verify with podman info | grep -i runtime). crun is the thing that actually calls clone() with the right namespace flags, writes the cgroup limits into /sys/fs/cgroup, sets up the root filesystem from the image layers, and then calls execve() to start your process.
runc does the same job but its original OCI reference runtime is written in Go. crun is a C rewrite that is lighter and faster, and is now the default on most modern distros.

conmon in the pstree output is the container monitor that acts as a small supervisor process that holds the container's stdio open, watches for the OCI runtime to exit, and reports container exit codes back to Podman. It is not part of your workload. It is bookkeeping infrastructure.

So the full chain is:

podman → crun → clone() + execve() → your process
Enter fullscreen mode Exit fullscreen mode

By the time httpd appears in the process table, crun has already exited. Its job was setup, not supervision.


Four Words the Industry Uses Interchangeably — and Shouldn't

These four terms get collapsed into each other constantly. The confusion is not accidental. Vendors benefit from the blurring. But it costs engineers clarity when something breaks.

Term What it actually means
Kernel The core software that mediates between hardware and every process running on the machine. It owns CPU scheduling, memory management, syscall handling, and namespace/cgroup enforcement.
Operating System The full stack: kernel plus user-space tooling (libc, shell, package manager, init system) that makes the machine usable by humans or services. RHEL is an OS. The kernel alone is not.
Process A running instance of a program. The kernel has allocated it a PID, some memory, and file descriptors. It exists in kernel memory; /proc is a virtual filesystem that exposes it, not where it lives.
Image Explained in full below.

What an Image Actually Is

An image is a blueprint. A container is one running instance of it. You can start ten containers from the same image simultaneously and each one gets its own thin writable layer on top of the shared read-only filesystem layers underneath. The image itself never runs. It is inert until a runtime turns it into a process.

At the OCI spec level: an image is a stack of read-only filesystem layers plus a JSON config manifest. When you run a container, the OCI runtime (crun on most modern Linux systems, runc historically) unpacks those layers into a root filesystem, applies the runtime constraints, and hands a process off to the kernel.

  • Image = what to run.
  • Runtime spec = how to run it
  • The container = the process that results.

The PID 1 Myth

The belief: there is a mini Linux inside the container, and PID 1 is its init system — the root of a private process universe.

The reality: PID 1 is namespace-relative. The first process spawned inside a new PID namespace gets assigned PID 1 within that namespace. It is simultaneously visible on the host with a completely different PID. One process, two identities, depending on which namespace you are observing from.

Let us verify this directly. The nsenter command lets us step into a process's namespaces and run a command from inside them.

Stepping into the host's namespaces (PID 1 = systemd):

# Enter PID 1's PID namespace (-p) and mount namespace (-m, needed so
# /proc reflects the target namespace) and run ps
krish-local:~ # nsenter -t 1 -p -m ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 Mar29 ?        00:00:05 /usr/lib/systemd/systemd --switched-root --system --deserialize=47
root           2       0  0 Mar29 ?        00:00:00 [kthreadd]
root           3       2  0 Mar29 ?        00:00:00 [pool_workqueue_release]
root           4       2  0 Mar29 ?        00:00:00 [kworker/R-kvfree_rcu_reclaim]
root           5       2  0 Mar29 ?        00:00:00 [kworker/R-rcu_gp]
...
Enter fullscreen mode Exit fullscreen mode

Stepping into the container's namespaces (container process host PID = 13684):

# Same command, but targeting the container's PID namespace.
# Inside this namespace, httpd was the first process spawned — so it gets PID 1.
krish-local:~ # nsenter -t 13684 -p -m ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
default        1       0  0 01:47 ?        00:00:29 httpd -D FOREGROUND
default       38       1  0 01:47 ?        00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default       39       1  0 01:47 ?        00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default       40       1  0 01:47 ?        00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default       41       1  0 01:47 ?        00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default       42       1  0 01:47 ?        00:00:29 httpd -D FOREGROUND
default       45       1  0 01:47 ?        00:00:17 httpd -D FOREGROUND
root      615091       0  0 10:11 ?        00:00:00 ps -ef
Enter fullscreen mode Exit fullscreen mode

That httpd process sitting at PID 1 inside the container namespace? On the host, it is PID 13684. Same process, different lens.

The Shared Kernel - uname Does Not Lie

If there were a mini Linux inside, it would have its own kernel, its own kernel version, its own architecture identity. Run uname on the host and inside the container and compare:

Host:

krish-local:~ # uname -s && uname -m && uname -o && uname -r
Linux
x86_64
GNU/Linux
6.12.0-160000.9-default
Enter fullscreen mode Exit fullscreen mode

Inside the container:

krish-local:~ # podman exec my-limited-httpd `uname -s && uname -m && uname -o && uname -r`
Linux
x86_64
GNU/Linux
6.12.0-160000.9-default
Enter fullscreen mode Exit fullscreen mode

Identical output. The container is not running a different kernel but, it is sharing the host kernel. The UTS namespace gives it an isolated hostname, but kernel identity is not part of that isolation. There is no guest kernel to find.


Resource Isolation: cgroups Are the Real Enforcement Mechanism

Namespaces control what a process can see. cgroups control what a process can consume. Together they are the two pillars of container isolation.

Note on cgroup versions: The demos below use cgroup v2 paths (/sys/fs/cgroup/<scope>/memory.max), which is the unified hierarchy used by SLES 16 with kernel 6.12. If you are on an older distro still running cgroup v1, your paths will look different (/sys/fs/cgroup/memory/<scope>/memory.limit_in_bytes).

Now compare cgroup limits for systemd (PID 1 on the host) versus the container process.

systemd - no enforced ceiling:

krish-local:~ # CGROUP=$(cat /proc/1/cgroup | cut -d: -f3)
krish-local:~ # echo $CGROUP
/init.scope
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/memory.max
max
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/pids.max
max
Enter fullscreen mode Exit fullscreen mode

max means unlimited. The init process is not constrained.
The container process — enforced ceiling:

krish-local:~ # PID=13684  # container's PID on the host
krish-local:~ # CGROUP=$(cat /proc/$PID/cgroup | cut -d: -f3)
krish-local:~ # echo $CGROUP
/machine.slice/libpod-a67892f3083285e34c738fd1e75cccd7eaadbda71f5a8c60a522e73546c0d5a2.scope
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/memory.max
536870912
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/pids.max
100
Enter fullscreen mode Exit fullscreen mode

536870912 bytes = 512 MiB. Exactly the --memory=512m flag we passed at startup. The pids.max of 100 matches --pids-limit=100. The kernel is enforcing these budgets directly — not Podman, not any container runtime abstraction sitting above the kernel.


Putting It Together

A container is a process. It shares the host kernel and uname proves it. Its PID 1 is an artifact of namespace isolation, not evidence of a private OS and nsenter proves it. Its resource limits are enforced by cgroups in the kernel, not by any runtime magic and /sys/fs/cgroup proves it.

The image is the blueprint. crun/runc is the assembly line. The running container is just another entry in the host's process table, one that happens to have a restricted worldview and a constrained resource budget.

That is the mental model. Everything in Part 2 builds on top of it.

Top comments (0)