Sahar Mhenni

Posted on Nov 17

Intro: Docker Isn’t Magic

#devops #linux #containers #docker

All we hear in development these days is Docker here, Docker there.
But what is Docker exactly? How does it work, and why do we use it so much?

Personally, I believe the best way to truly learn a tool or concept is to rebuild it from scratch using any language you're comfortable with.
That's why, in this series, we'll be building step-by-step our own container engine, just like Docker.

Before diving into the code, we first need to understand a few core ideas. So, the first question is: why is everyone obsessed with containerization?
You probably already know the answer it's one of the first things we prepare for in interviews.

In simple terms: imagine you've built an application.
This app depends on certain packages and configurations that exist in your local environment, and it only works there.
Now, you want to deliver it to someone else, but you're not sure if their system matches yours.
Even a tiny difference could break everything. So, what do you do?

You package your app inside a little box that contains exactly the same environment it needs or even build it directly inside that box.
Then, you ship the box as-is.
No matter where it runs, it behaves exactly the same.
That "thing" is your application, and that box is the container.
That's the core idea behind containerization (along with many other benefits)
But the real question remains:
How does it actually work?
How can we package an entire environment inside a container and run it anywhere with identical results?

Well, it's not magic, it's just a clever use of what Linux already provides.
In this part of the series, we'll break down the Linux concepts that make containers possible

So, What Is a Container and What Does It Actually Do?
Let's break it down simply

Self-contained: it includes everything your app needs from code to runtime, libraries, and system tools.
Lightweight: compared to virtual machines, containers share the host's kernel, making them extremely efficient (we'll see why soon).
Portable: build once, run anywhere whether it's your laptop, a server, or the cloud.
Isolated: each container runs in its own separate environment.
Consistent: no more "It works on my machine" nightmares.

But here's the twist all of this is not "magic" at all.
It's simply made possible by two core Linux features:
Namespaces and Control Groups (cgroups)
A simple way to remember them is:

Namespaces define what you're allowed to see.
Cgroups define what you're allowed to use.

Together, they create the illusion of a "mini machine" inside your machine which is really just an isolated process with restricted access and resources.
Let's see that in action by running our first process…
Let's break that down.

Linux Namespaces

Namespaces create the illusion of a separate system for each process.
When a process is placed inside a namespace, it gets its own isolated view of system resources - even though it's still running on the same kernel.
Here are the main types you'll encounter (and what Docker uses):
In simple terms:
UTS makes you believe you're on another machine, isolates Hostname & domain
NET makes you believe you have your own network, isolates Network interfaces, routes, IP tables
PID makes you believe you're the only process running, isolates process ids
USER makes you believe you're root , isolate user and group ids
MNT makes you believe you have your own filesystem, isolate mount points

Together - this is what gives containers their illusion of independence.

Linux Control Groups (Cgroups)

Namespaces are about what you see.
Cgroups are about what you can use.
Cgroups (Control Groups) limit, prioritize, and monitor resource usage like CPU, memory, disk I/O, or network bandwidth for a set of processes.
They are organized in a hierarchy under /sys/fs/cgroup/
Common Cgroup Controllers (cpu, memory, disk I/O, pids..)
So, for now we will use namespaces and cgroup to create a docker-like container
We'll be using these commands :
debootstrap : installs a basic Debian-based system (like Debian or Ubuntu) into a new directory on an existing system.
unshare allows a process to run in its own, isolated namespaces, which are a form of resource isolation
let's first isolates the hostname :

sudo unshare --uts /bin/bash

Now, inside that shell, run:

hostname

You'll get the current hostname
Let's change it:

hostname container-lab

Now check again:

hostname

You'll see container-lab but open another terminal or exit and run hostname there it still shows the original one.
but we need more than just the hostname to have a container, we still have access to the process running on the host.
we need more than uts namespace , so now let's run

sudo unshare --pid --mount-proc --uts /bin/bash

Inside that new shell:

ps aux

Notice how only a few processes are visible normally just your shell and its children.
Outside the namespace (ps aux in another terminal), you'll still see everything.

Now we'll give it a minimal root environment (like a container rootfs).

mkdir ~/mycontainer
sudo debootstrap stable ~/mycontainer <http://deb.debian.org/debian>
sudo unshare --mount-proc --uts --pid --fork --root ~/mycontainer /bin/bash

Now you're literally in a mini Linux system inside your system.
Try:

ls /
hostname
ps aux

It feels like a separate machine, but it's just a process!

Create a small cgroup and limit CPU:

sudo mkdir /sys/fs/cgroup/mycontainer
echo $$ | sudo tee /sys/fs/cgroup/mycontainer/cgroup.procs
echo 20000 | sudo tee /sys/fs/cgroup/mycontainer/cpu.max

This limits the shell process to 20ms of CPU every 100ms (so it can't hog the CPU).

So what did we just do?
You've just manually built the building blocks of Docker:
Namespaces → isolated views of the system
Cgroups → resource control
Rootfs → custom filesystem

That's literally all a container is a regular process with some clever Linux configurations.
But wait ! Why are containers so lighweights ? If each container has its own filesystem, shouldn't that take a lot of space? and here's where another Linux gem comes in: the Union Filesystem (UnionFS).

Union Filesystem

A UnionFS is a special type of filesystem that can merge multiple directories (layers) into one unified view
without actually modifying the original sources.
Think of it as a layer cake :
The bottom layer might be a base image (like Ubuntu).
When you install new packages or add files, those changes are stored in an upper writable layer.
What you see is the illusion of a single filesystem - but under the hood, it's just layers stacked together.

If you delete a file, it's not really removed from the lower layers -
it's just hidden by a "whiteout" marker in the upper layer.

But unlike the cake layers each layer is independent and reusable.
That means multiple containers can share the same base layers, like Ubuntu or Python,
while keeping their own small writable layer for runtime changes.
So instead of duplicating a whole OS for every container (like virtual machines do),
Docker just reuses the same layers making containers incredibly lightweight and fast to spin up.
So far, we've seen how containers are isolated using namespaces and controlled with cgroups, and how UnionFS makes them lightweight.
But a container is still just a regular Linux process running on the host.
That means if it runs as root inside the container, it's technically still root on the host, just with some restrictions.
That's why containers add extra layers of security using Linux capabilities and seccomp.

Traditionally, Linux privileges were binary:
you were either a normal user or root, with full control.
Capabilities change that by splitting root's powers into smaller pieces.
Each piece (capability) gives a process permission to do one specific privileged action like binding to a port below 1024 or modifying network interfaces.
For example:
CAP_NET_ADMIN : Manage network interfaces, routes, iptables
CAP_SYS_ADMIN : Mount filesystems, manage devices (very powerful)
By default, Docker containers run with a reduced set of capabilities, not full root power -
so even if an attacker escapes into the host, their privileges are limited.
We can see them with:

capsh --print

And Docker lets us tweak them:

docker run --cap-drop=ALL --cap-add=NET_ADMIN alpine

This removes all capabilities and adds back only the one we need

But even with reduced capabilities, processes can still call syscalls (the low-level functions to interact with the kernel).
And some syscalls can be dangerous like loading kernel modules or changing namespaces.
That's where seccomp (secure computing mode) comes in.
It acts like a firewall for syscalls.
Each container runs under a seccomp profile that tells the kernel which syscalls are allowed, denied, or logged.
For example, Docker's default seccomp profile blocks around 44 risky syscalls, such as:
keyctl (used to access kernel keyrings)
unshare (could break isolation)
mount (mounting new filesystems)

You can check which profile is being used:

docker info | grep Seccomp

And we can run our own:

docker run --security-opt seccomp=/path/to/profile.json ubuntu

And that's it from namespaces to seccomp, you now understand the real building blocks that make Docker possible. In the next part, we'll start coding our own mini container runtime to see it all come to life.

DEV Community

Intro: Docker Isn’t Magic

Linux Namespaces

Linux Control Groups (Cgroups)

Union Filesystem

Top comments (0)