"You Keep Using That Word"
Dispelling Container Misconceptions at the OS Level
Before we write a single line of code, we need to kill the buzzword fog.
What a Container Actually Is
The marketing definition you have heard a hundred times: "a container is an executable unit of software with its dependencies bundled together." That is not wrong, but it tells you nothing useful about what is actually happening on the machine.
Here is the OS-level truth: a container is a process (or a tree of processes) that the kernel runs with a restricted view of its own namespaces and a cgroup-enforced ceiling on the resources it can consume. That is the entire trick. No hypervisor, no guest kernel, no virtualized hardware. Just a process with a carefully constructed set of constraints.
Everything else in this article is evidence for that single claim.
Spin up a simple HTTPD container
I used the following podman command to spin up the HTTPD container with limited amount of resources.
podman run -d \
--name my-limited-httpd \
--replace \
--memory=512m \ # hard memory ceiling
--memory-swap=512m \ # swap ceiling equal to memory = no swap allowed
--cpus=1.5 \ # CPU quota: 1.5 cores worth of CPU time
--cpu-shares=512 \ # relative CPU weight during contention
--pids-limit=100 \ # max 100 processes/threads in this container
--blkio-weight=500 \ # relative block I/O weight
-p 8081:80 \
registry.access.redhat.com/ubi8/httpd-24
This starts a container with the given resources. You can check
krish-local:~ # pstree -pT
systemd(1)─┬─NetworkManager(944)
├─agetty(1011)
├─agetty(1012)
├─auditd(835)
├─chronyd(1007)
# ConMon is container monitor. HTTPD is the container we ran
├─conmon(13669)───httpd(13684)─┬─cat(13727)
│ ├─cat(13728)
│ ├─cat(13729)
│ ├─cat(13730)
│ ├─httpd(13731)
│ ├─httpd(13734)
│ └─httpd(35659)
├─dbus-broker-lau(854)───dbus-broker(856)
├─dotd(959)───sleep(3951905)
├─irqbalance(858)
├─rsyslogd(1159)
├─sshd(1562)───sshd-session(3949831)───sshd-session(3950094)───bash(3950096)───pstree(3954963)
├─systemd(3950077)───(sd-pam)(3950079)
├─systemd-journal(621)
├─systemd-logind(890)
├─systemd-udevd(657)
└─wpa_supplicant(891)
From podman run to a Process - What Actually Happens
When you ran podman run, Podman did not start httpd directly. It handed the work to an OCI runtime. On SLES 16, that runtime is crun (you can verify with podman info | grep -i runtime). crun is the thing that actually calls clone() with the right namespace flags, writes the cgroup limits into /sys/fs/cgroup, sets up the root filesystem from the image layers, and then calls execve() to start your process.
runc does the same job but its original OCI reference runtime is written in Go. crun is a C rewrite that is lighter and faster, and is now the default on most modern distros.
conmon in the pstree output is the container monitor that acts as a small supervisor process that holds the container's stdio open, watches for the OCI runtime to exit, and reports container exit codes back to Podman. It is not part of your workload. It is bookkeeping infrastructure.
So the full chain is:
podman → crun → clone() + execve() → your process
By the time httpd appears in the process table, crun has already exited. Its job was setup, not supervision.
Four Words the Industry Uses Interchangeably — and Shouldn't
These four terms get collapsed into each other constantly. The confusion is not accidental. Vendors benefit from the blurring. But it costs engineers clarity when something breaks.
| Term | What it actually means |
|---|---|
| Kernel | The core software that mediates between hardware and every process running on the machine. It owns CPU scheduling, memory management, syscall handling, and namespace/cgroup enforcement. |
| Operating System | The full stack: kernel plus user-space tooling (libc, shell, package manager, init system) that makes the machine usable by humans or services. RHEL is an OS. The kernel alone is not. |
| Process | A running instance of a program. The kernel has allocated it a PID, some memory, and file descriptors. It exists in kernel memory; /proc is a virtual filesystem that exposes it, not where it lives. |
| Image | Explained in full below. |
What an Image Actually Is
An image is a blueprint. A container is one running instance of it. You can start ten containers from the same image simultaneously and each one gets its own thin writable layer on top of the shared read-only filesystem layers underneath. The image itself never runs. It is inert until a runtime turns it into a process.
At the OCI spec level: an image is a stack of read-only filesystem layers plus a JSON config manifest. When you run a container, the OCI runtime (crun on most modern Linux systems, runc historically) unpacks those layers into a root filesystem, applies the runtime constraints, and hands a process off to the kernel.
- Image = what to run.
- Runtime spec = how to run it
- The container = the process that results.
The PID 1 Myth
The belief: there is a mini Linux inside the container, and PID 1 is its init system — the root of a private process universe.
The reality: PID 1 is namespace-relative. The first process spawned inside a new PID namespace gets assigned PID 1 within that namespace. It is simultaneously visible on the host with a completely different PID. One process, two identities, depending on which namespace you are observing from.
Let us verify this directly. The nsenter command lets us step into a process's namespaces and run a command from inside them.
Stepping into the host's namespaces (PID 1 = systemd):
# Enter PID 1's PID namespace (-p) and mount namespace (-m, needed so
# /proc reflects the target namespace) and run ps
krish-local:~ # nsenter -t 1 -p -m ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Mar29 ? 00:00:05 /usr/lib/systemd/systemd --switched-root --system --deserialize=47
root 2 0 0 Mar29 ? 00:00:00 [kthreadd]
root 3 2 0 Mar29 ? 00:00:00 [pool_workqueue_release]
root 4 2 0 Mar29 ? 00:00:00 [kworker/R-kvfree_rcu_reclaim]
root 5 2 0 Mar29 ? 00:00:00 [kworker/R-rcu_gp]
...
Stepping into the container's namespaces (container process host PID = 13684):
# Same command, but targeting the container's PID namespace.
# Inside this namespace, httpd was the first process spawned — so it gets PID 1.
krish-local:~ # nsenter -t 13684 -p -m ps -ef
UID PID PPID C STIME TTY TIME CMD
default 1 0 0 01:47 ? 00:00:29 httpd -D FOREGROUND
default 38 1 0 01:47 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default 39 1 0 01:47 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default 40 1 0 01:47 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default 41 1 0 01:47 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default 42 1 0 01:47 ? 00:00:29 httpd -D FOREGROUND
default 45 1 0 01:47 ? 00:00:17 httpd -D FOREGROUND
root 615091 0 0 10:11 ? 00:00:00 ps -ef
That httpd process sitting at PID 1 inside the container namespace? On the host, it is PID 13684. Same process, different lens.
The Shared Kernel - uname Does Not Lie
If there were a mini Linux inside, it would have its own kernel, its own kernel version, its own architecture identity. Run uname on the host and inside the container and compare:
Host:
krish-local:~ # uname -s && uname -m && uname -o && uname -r
Linux
x86_64
GNU/Linux
6.12.0-160000.9-default
Inside the container:
krish-local:~ # podman exec my-limited-httpd `uname -s && uname -m && uname -o && uname -r`
Linux
x86_64
GNU/Linux
6.12.0-160000.9-default
Identical output. The container is not running a different kernel but, it is sharing the host kernel. The UTS namespace gives it an isolated hostname, but kernel identity is not part of that isolation. There is no guest kernel to find.
Resource Isolation: cgroups Are the Real Enforcement Mechanism
Namespaces control what a process can see. cgroups control what a process can consume. Together they are the two pillars of container isolation.
Note on cgroup versions: The demos below use cgroup v2 paths (
/sys/fs/cgroup/<scope>/memory.max), which is the unified hierarchy used by SLES 16 with kernel 6.12. If you are on an older distro still running cgroup v1, your paths will look different (/sys/fs/cgroup/memory/<scope>/memory.limit_in_bytes).
Now compare cgroup limits for systemd (PID 1 on the host) versus the container process.
systemd - no enforced ceiling:
krish-local:~ # CGROUP=$(cat /proc/1/cgroup | cut -d: -f3)
krish-local:~ # echo $CGROUP
/init.scope
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/memory.max
max
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/pids.max
max
max means unlimited. The init process is not constrained.
The container process — enforced ceiling:
krish-local:~ # PID=13684 # container's PID on the host
krish-local:~ # CGROUP=$(cat /proc/$PID/cgroup | cut -d: -f3)
krish-local:~ # echo $CGROUP
/machine.slice/libpod-a67892f3083285e34c738fd1e75cccd7eaadbda71f5a8c60a522e73546c0d5a2.scope
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/memory.max
536870912
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/pids.max
100
536870912 bytes = 512 MiB. Exactly the --memory=512m flag we passed at startup. The pids.max of 100 matches --pids-limit=100. The kernel is enforcing these budgets directly — not Podman, not any container runtime abstraction sitting above the kernel.
Putting It Together
A container is a process. It shares the host kernel and uname proves it. Its PID 1 is an artifact of namespace isolation, not evidence of a private OS and nsenter proves it. Its resource limits are enforced by cgroups in the kernel, not by any runtime magic and /sys/fs/cgroup proves it.
The image is the blueprint. crun/runc is the assembly line. The running container is just another entry in the host's process table, one that happens to have a restricted worldview and a constrained resource budget.
That is the mental model. Everything in Part 2 builds on top of it.
Top comments (0)