This post is part of the Ultimate Container Security Series, a structured, multi-part guide covering container security from foundational concepts to runtime protection. For an overview of the series structure, scope, and update schedule, see the series introduction post here.
Understanding Linux capabilities is a fundamental step in mastering container security, as it allows us to move beyond the "all-or-nothing" approach of the traditional root user. By breaking down the monolithic power of root into granular privileges, we can grant a container exactly what it needs to function while significantly reducing the potential blast radius of an exploit.
Introduction: Understanding capabilities
To understand how to secure a container, we first need to understand how the Linux kernel handles privileges. The security model of containers is built directly on top of a kernel feature called Capabilities.
The "All or Nothing" Problem
Traditionally, UNIX-like systems operated on a binary permission model. For the purpose of permission checks, the kernel distinguished between only two categories of processes:
- Privileged processes (Root): Processes with an effective User ID (UID) of 0.
- Unprivileged processes (Standard User): Processes with a non-zero UID.
This created a significant security gap known as the "All or Nothing" problem. A privileged process (UID 0) bypasses almost all kernel permission checks, allowing it to modify system files, install software, and reconfigure the network stack. A standard user, conversely, is strictly bound by permission checks.
The problem arises when a standard user needs to perform a specific action that requires elevated privileges, such as opening a network socket (like ping using ICMP) or binding to a restricted port (like a web server on port 80). In the old model, the only solution was to give the process full root privileges, usually via the SUID (Set User ID) bit.
As discussed in previous chapters, the SUID bit is a security risk. It effectively grants a program full superuser powers just to perform one minor task. If a hacker exploits a bug in a SUID binary, they don't just compromise that specific application, they gain full control over the entire system.
What are Capabilities?
To solve this "security risk," kernel developers introduced a more nuanced solution called Capabilities. Starting with Linux Kernel 2.2 (in 1999), the privileges traditionally associated with the superuser were broken down into distinct, independent units. These units are called capabilities.
The concept is straightforward: instead of checking "Is this user root?" the kernel checks "Does this thread have the specific capability to perform this action?"
For example:
- Instead of being "Root," a process might only have
CAP_NET_BIND_SERVICE(to bind to ports < 1024). - Instead of being "Root," a process might only have
CAP_CHOWN(to change file ownership).
While this feature was originally scoped only to processes, support for assigning capabilities directly to files was added in 2008. This evolution allows us to assign fine-grained permissions to executables so that processes that previously required UID 0/root permissions no longer need them to function.
Capabilities are the technical implementation of the Principle of Least Privilege. This security principle dictates that a process should possess only the bare minimum privileges necessary to perform its function and nothing more.
By using capabilities, we can drastically reduce the attack surface. If a web server runs as a non-root user with only the minimal required capabilities (e.g., CAP_NET_BIND_SERVICE), then the impact of a compromise can be reduced.
The Capability Sets
Up to this point, capabilities sound simple: break root privileges into smaller pieces and assign only what is necessary. The real complexity begins when we look at how capabilities are stored, inherited, and transformed between processes and files. If you read the man capabilities page, you might find it terse and difficult to map to real-world scenarios.
The confusion often stems from two sources:
- Naming Collisions: The kernel uses the same names (like "Effective" or "Inheritable") for both processes and files, but they function quite differently depending on where they are applied.
- Counter-Intuitive Behavior: Capabilities don't behave like the simple "SUID Root" model we are used to. Just because a parent process has a capability doesn't automatically mean the child process gets it.
To demystify this, we first need to distinguish between the Process (the active entity) and the File (the passive storage).
Process vs. File Capabilities
execve()is a Linux system call that replaces the current running process with a new program. It loads the new executable into memory and starts it, keeping the same process ID but with new code and data.
-
Process Capabilities: When we talk about a "process" having capabilities, we are technically talking about a thread. In Linux, capability sets are maintained per thread.
- Role: These determine what the running task is actually allowed to do right now.
-
Lifecycle: Thread capability sets are copied during a
fork()(creating a new thread/process) and are specially transformed during anexecve()(running a new program). Capabilities are especially important duringexecve(), because that's when capability transformation rules apply. - Note: Most normal processes (like your text editor or shell) have and need zero capabilities. They rely on standard file permissions. Capabilities are generally only needed for system-level administration tasks.
-
File Capabilities: Binaries on the disk can also have capabilities associated with them.
- Role: These are not "active" permissions. Instead, they are a set of instructions that tell the kernel: "When this file is executed, grant the process these specific privileges."
-
Storage: These are stored in the file's Extended Attributes (xattrs), specifically within
security.capability. File capabilities depend on filesystem support for extended attributes (most modern filesystems support this). For example, in ext3/ext4, extended attributes are stored in the inode or in additional disk blocks. Many backup tools do not preserve extended attributes by default. Without preserving xattrs, file capabilities will be silently lost. - When copied from one place to another, a binary will lose its capabilities. In order to keep capabilities, you can copy the file with
--preserve=alloption. Example:cp --preserve=all /origin/path /dest/path -
Constraint: Writing to this extended attribute requires the
CAP_SETFCAPcapability. This ensures that standard users cannot simply grant themselves superpowers by editing a binary's attributes.
The 5 Capability Sets
To manage how privileges are granted, inherited, and limited, Linux uses five distinct "sets" of capabilities (which are represented as bit masks). Think of these as five different buckets that a process carries.
| Set | Purpose | Process Capabilities | File Capabilities |
|---|---|---|---|
| Permitted (P) | The superset of what a process can do. A process can move capabilities from here to the Effective set, but it cannot add new ones that aren't already here. | ✅ | ✅ |
| Effective (E) | The Active set. This is the only set the kernel actually checks when a process tries to do something (like open a port). If a capability is in Permitted but not Effective, the action fails. | ✅ | ❌ |
| Inheritable (I) | Capabilities that can be passed down to a child process. However, simply having a capability here isn't enough; the child executable must also be "willing" to receive it (via File Inheritable sets). | ✅ | ✅ |
| Bounding (B) | The hard limit. No capability can ever be added to the Permitted or Inheritable sets if it doesn't exist in the Bounding set. | ✅ | ❌ |
| Ambient (A) | Added in newer kernels to fix the "Inheritance Problem." It allows non-SUID binaries (which aren't capability-aware) to blindly inherit capabilities from their parent. | ✅ | ❌ |
Linux defines five capability sets for each thread:
-
Thread Permitted Set (P): The permitted set is the thread's upper bound of capabilities. It defines the maximum privilege scope the thread can ever exercise. A thread may call
capset()to move capabilities from Permitted into the Effective set (the capabilities that are actually checked by the kernel), and it may also usecapset()to place capabilities into the Inheritable set (capabilities it is allowed to pass across anexecve()when combined with the executed file's inheritable capabilities). A thread cannot usecapset()to add new capabilities to its permitted set (i.e., capabilities it doesn't already have) unless it hasCAP_SETPCAPin its effective set. - Thread Effective Set (E): This is the set that the kernel actually checks during permission evaluation. If a capability is not in the effective set, the kernel behaves as if the process does not have it. The effective set is what truly matters during system calls.
-
Thread Inheritable Set (I): The inheritable set controls what capabilities may be passed across
execve()to a different binary. A capability in the thread inheritable set is not automatically granted to child processes. It only influences what may become permitted in the new program. Both the thread inheritable set and the file inheritable set must agree. The thread inheritable set and file inheritable set are different things (This is where many people get confused - more on that later). -
Bounding Set (B): The bounding set acts as a hard ceiling on what capabilities a process can ever gain through
execve(). Even if a file has a capability marked as permitted, if that capability isn't in the bounding set, the process can never acquire it. It also limits which capabilities can be added to the inheritable set. -
Ambient Set (A): The ambient set was introduced in Linux 4.3 to solve the problem of passing capabilities to ordinary binaries that have no file capabilities set. Any capability in the ambient set is automatically added to both the permitted and inheritable sets of the new process after
execve(), even for plain, unmodified binaries. To add a capability to the ambient set it must already be in both your permitted and inheritable sets, and dropping it from either one automatically removes it from the ambient as well.
Files only have:
-
File Permitted Set: The file permitted set defines the capabilities a binary is allowed to gain when executed, regardless of what the thread already has. These capabilities are added to the new process's permitted set after
execve(), but only if they are also allowed by the bounding set. -
File Inheritable Set: The file inheritable set specifies which capabilities the binary is willing to accept from the thread's inheritable set during
execve(). Only capabilities present in both the thread's inheritable set and the file's inheritable set will be carried over into the new process's permitted set. -
File Effective Flag: Unlike the other file sets, the effective field is just a single bit, not a set. When set, it tells the kernel to automatically move all of the new process's permitted capabilities into its effective set after
execve(), which is needed for older binaries that don't explicitly callcapset()to raise their own capabilities.
How Capabilities are Calculated
When a process executes a binary (via execve()), the kernel calculates the new capabilities for the process based on a specific formula. This formula combines what the parent thread had and what the file allows.
When a thread executes a new binary the logic can be simplified as follows:
-
Permitted Set Calculation: The new permitted set is the union of two sources: capabilities that exist in both the thread's inheritable set and the file's inheritable set, plus capabilities that exist in the file's permitted set filtered through the bounding set.
- Formula:
New Permitted = (Old Inheritable AND File Inheritable) OR (File Permitted AND Bounding Set)
- Formula:
-
Effective Set Calculation: If the file's effective bit is set, the new effective set equals the full new permitted set, meaning all capabilities are immediately active. Otherwise the effective set starts empty and the process must raise them manually.
- Formula:
New Effective = New Permitted if File Effective Flag is set, else 0
- Formula:
-
Inheritable Set Calculation: The inheritable set is simply carried over unchanged from the old thread,
execve()does not modify it.- Formula:
New Inheritable = Old Inheritable
- Formula:
The following diagram shows the relationship between the different capability sets and how they interact during process creation and execution.
You might notice a gap in the logic described above. If you wanted to run an ordinary binary or script with capabilities, say a plain Python script, you were stuck. Putting a capability in the inheritable set had no effect unless the target binary also had that capability in its file inheritable set, which meant you couldn't pass privileges down to unmodified binaries without touching the files themselves.
The ambient set solves this. Any capability in the ambient set is automatically added to the new process's permitted and effective sets after execve(), even if the binary has no file capabilities set at all. This is how modern container runtimes can run standard, unmodified applications with specific privileges without needing to alter the binaries inside the container image.
Inspecting & Manipulating Capabilities
Common Linux Capabilities
Before we can meaningfully inspect anything, it helps to have a mental map of the most important capabilities and what they actually allow. Linux defines over 40 capabilities, but a handful appear constantly in security-relevant contexts. The full list can be found in the documentation.
| Capability | Short Description |
|---|---|
CAP_CHOWN |
Allows a process to make arbitrary changes to file UIDs and GIDs. CAP_SETUID and CAP_SETGID allow a process to change its own UID and GID, which is how su and sudo work. A process with CAP_SETUID can effectively become any user on the system, including root, making it nearly as dangerous as having full root. |
CAP_DAC_OVERRIDE |
Stands for "Discretionary Access Control Override." A process with this capability can bypass standard file read, write, and execute permission checks. In practical terms, this means it can read or write any file on the system regardless of its ownership or permissions. It does not bypass MAC (Mandatory Access Control) systems like SELinux or AppArmor, but it completely defeats the traditional UNIX permission model. |
CAP_NET_BIND_SERVICE |
Allows a process to bind to privileged ports, those below 1024, without needing full root access. This is the correct and minimal capability to assign to a web server that needs to listen on port 80 or 443. Without it, only root processes can bind to these ports. |
CAP_NET_RAW |
Grants the ability to use raw and packet sockets, and to bind to any address for transparent proxying. This is what the ping command historically needed to craft ICMP packets. It is also what an attacker needs to perform packet sniffing or craft arbitrary network packets, making it a capability worth watching closely. |
CAP_NET_ADMIN |
This is one of the most powerful networking capabilities. It grants permission to perform a broad range of network configuration tasks: configuring network interfaces, managing routing tables, setting firewall rules with iptables, enabling promiscuous mode on a network interface, and modifying network namespaces. Because it covers so much ground, it's a frequent target during container escapes. A container that has CAP_NET_ADMIN can potentially reconfigure the host's network stack if it escapes its namespace. |
CAP_SYS_ADMIN |
This is often described as the "new root." It is by far the broadest capability in Linux, covering an enormous range of system administration operations: mounting and unmounting filesystems, managing namespaces via clone() and unshare(), loading kernel modules (in combination with CAP_SYS_MODULE), performing chroot(), and dozens of other privileged operations. The Linux man page lists so many permissions under CAP_SYS_ADMIN that security practitioners generally treat its presence in a container as equivalent to running the container as root. If you see it, treat it as a red flag. |
CAP_SYS_PTRACE |
Allows a process to trace arbitrary processes using ptrace(), the system call that debuggers like gdb rely on. In a container context, this is particularly dangerous because ptrace() can be used to inspect and modify the memory of other processes, potentially leading to container escape if the target process runs in a different namespace or with higher privileges. |
CAP_SYS_MODULE |
Allows loading and unloading kernel modules. This is an extremely high-risk capability because a kernel module runs in kernel space with no restrictions whatsoever. A process with this capability can load a malicious module that does anything the kernel itself can do. |
Inspecting Capabilities
Linux provides several tools for examining what capabilities are assigned, whether to a running process or to a file on disk. Using them together gives you a complete picture of the privilege landscape on any system.
The examples in the following sections are run on a standard Ubuntu Server 24.04 VM. Always run these exercises on a disposable test environment, as you may encounter binaries with capabilities that can be dangerous if misused.
Inspecting File Capabilities with getcap
The getcap command reads the security.capability extended attribute from a file and displays it in a human-readable format. This is the primary tool for checking what privileges an executable binary has been granted.
$ getcap /usr/bin/ping
You would typically see output like:
/usr/bin/ping cap_net_raw=ep
This tells you that the ping binary has the cap_net_raw capability, and the =ep suffix tells you which sets it's in. The letter e means the Effective flag is set, and p means the capability is in the Permitted file set. Referring back to our capability calculation formula, this means that when ping is executed, cap_net_raw will be added to the new process's permitted set, and because the effective flag is set, it will also be immediately active in the effective set.
Historical Note: Due to an update where "ping sockets" were added directly to the kernel, the
pingcommand technically no longer requires any additional Linux capabilities to work (though this is gated by a config setting disabled by some distros). TheCAP_NET_RAWcapability is still commonly assigned to the binary for backward compatibility with older kernels and configurations where raw sockets are still required.
You'll commonly see these suffixes in capability strings:
-
p- The capability is in the file's Permitted set. -
e- The file's Effective bit is set (applies all permitted caps to the effective set immediately). -
i- The capability is in the file's Inheritable set. -
ep- Both Effective and Permitted; the most common combination for binaries that need to self-elevate.
One particularly useful flag for getcap is -r, which enables recursive searching. To scan an entire filesystem for any binary that has capabilities assigned, run:
$ getcap -r / 2>/dev/null
The 2>/dev/null part discards permission errors from directories you can't read. This one-liner is a standard step in security audits and CTF (Capture the Flag) challenges alike, since a misconfigured binary with an overly broad capability is a common privilege escalation vector.
Inspecting Process Capabilities with getpcaps
While getcap deals with files, getpcaps shows you the capabilities of a running process, identified by its PID. Let's look at the difference between a normal user process and a root process.
First, find the PID of your current shell and inspect it:
user@container-security:~$ ps
PID TTY TIME CMD
2736 pts/0 00:00:00 bash
2755 pts/0 00:00:00 ps
user@container-security:~$ getpcaps 2736
2736: =
The = output means the process has an empty capability set across all five sets. This is exactly what you'd expect for an ordinary shell running as a non-root user. It doesn't need any capabilities because it relies entirely on standard file permission checks for everything it does.
Now compare that to a shell running as root:
user@container-security:~$ sudo bash
[sudo] password for user:
root@container-security:/home/user#
root@container-security:/home/user# ps
PID TTY TIME CMD
2761 pts/1 00:00:00 sudo
2762 pts/1 00:00:00 bash
2769 pts/1 00:00:00 ps
root@container-security:/home/user# getpcaps 2762
2762: =ep
A root shell carries the full complement of capabilities (=ep) in both its Permitted and Effective sets, giving it unconstrained access to virtually every privileged operation on the system. This is exactly the scenario that the Principle of Least Privilege is designed to avoid.
One subtle and dangerous pitfall to be aware of is the empty capability set. When you inspect such a process with getpcaps, you'll see something like (what we got in the the root shell example above):
<PROCESS_PID>: =ep
This looks like the file has no specific capabilities, and one might assume it's harmless. It is the exact opposite. An empty capability set with the ep flags means all capabilities are enabled. The empty set before =ep is shorthand for "all capabilities" making this the equivalent of <PROCESS_PID>: all=ep. The same is true for files.
Inspecting the Current Shell with capsh
capsh (Capability Shell) is a versatile tool for both inspecting and launching processes with specific capability sets. Its --print flag dumps a comprehensive view of the current shell's capability state:
user@container-security:~$ capsh --print
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
Ambient set =
Current IAB:
Securebits: 00/0x0/1'b0 (no-new-privs=0)
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
secure-no-ambient-raise: no (unlocked)
uid=1000(user) euid=1000(user)
gid=1000(user)
groups=27(sudo),1000(user)
Guessed mode: HYBRID (4)
The output tells you several things at once. Current is the thread's effective and permitted capability sets. Bounding set shows the hard ceiling, notice that even for a non-root user, the bounding set may contain many capabilities, but they won't appear in the current set unless explicitly granted. Ambient set is empty here, meaning no capabilities will be passed to child processes automatically.
This is much richer than getpcaps for understanding the full capability context of your current process.
Reading Raw Bitmasks from /proc
For low-level inspection or scripting, you can read capability information directly from the kernel's process filesystem. Every running process has a status file under /proc/<pid>/status that contains raw hexadecimal bitmask values for each capability set:
user@container-security:~$ ps
PID TTY TIME CMD
2736 pts/0 00:00:00 bash
3184 pts/0 00:00:00 ps
user@container-security:~$ cat /proc/2736/status | grep -i cap
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
Each line corresponds to a capability set: CapInh (Inheritable), CapPrm (Permitted), CapEff (Effective), CapBnd (Bounding), and CapAmb (Ambient). The values are 64-bit hexadecimal bitmasks where each bit position corresponds to a specific capability number.
Reading these raw masks directly isn't very human-friendly, but capsh can decode them for you with the --decode flag:
user@container-security:~$ capsh --decode=000001ffffffffff
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,...
This is especially useful in automated scripts or when you need to understand the capabilities of a process that getpcaps can't reach, such as inside a container's namespace.
Assigning Capabilities with setcap
setcap writes a capability set directly into the security.capability extended attribute of a file. The general syntax is:
$ sudo setcap <capability>+<sets> /path/to/binary
For example, to grant a binary the CAP_SETUID capability in both the Permitted and Effective sets:
$ sudo setcap cap_setuid+ep /path/to/file
Note that running setcap itself requires the CAP_SETFCAP capability. This privilege is automatically granted to root, which is why the sudo prefix is needed when running as a normal user.
An important subtlety: setcap is not additive. Each invocation of setcap completely replaces the capability set of the file. If you want to assign multiple capabilities, you must specify all of them in a single command:
$ sudo setcap cap_net_bind_service,cap_net_raw+ep /path/to/binary
Running setcap twice with different capabilities will result in only the second set being stored.
Removing Capabilities with setcap -r
To strip all capabilities from a file, use the -r (remove) flag:
$ sudo setcap -r /path/to/program
After this, getcap on that file will return no output, and the binary will run with whatever privileges the executing user's process has, just like any other ordinary binary.
A Practical Example: Assigning CAP_NET_BIND_SERVICE to a Custom Binary
On Linux, ports below 1024 are called privileged ports. Binding to them is restricted by the kernel to prevent unprivileged users from impersonating well-known services like HTTP (port 80) or HTTPS (port 443). Traditionally, the only way to bind to these ports was to run your process as root. With CAP_NET_BIND_SERVICE we can grant exactly that one permission to a specific binary, and nothing else.
Try to start a Python HTTP server on port 80 as a non-root user:
user@container-security:~$ python3 -m http.server 80
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/lib/python3.12/http/server.py", line 1314, in <module>
test(
File "/usr/lib/python3.12/http/server.py", line 1261, in test
with ServerClass(addr, HandlerClass) as httpd:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/socketserver.py", line 457, in __init__
self.server_bind()
File "/usr/lib/python3.12/http/server.py", line 1308, in server_bind
return super().server_bind()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/server.py", line 136, in server_bind
socketserver.TCPServer.server_bind(self)
File "/usr/lib/python3.12/socketserver.py", line 473, in server_bind
self.socket.bind(self.server_address)
PermissionError: [Errno 13] Permission denied
The kernel blocks it immediately. Checking the capabilities of the Python binary confirms that it has no special permissions:
user@container-security:~$ getcap /usr/bin/python3.12 # empty output, no capabilities assigned, so binding to a privileged port is forbidden
Port 1024 and above work fine without any capabilities:
user@container-security:~$ python3 -m http.server 8080
Serving HTTP on 0.0.0.0 port 8080 (http://0.0.0.0:8080/) ...
This confirms the problem is specifically about privileged ports, not Python itself.
Assign CAP_NET_BIND_SERVICE to the Python Binary:
user@container-security:~$ which python3
/usr/bin/python3
user@container-security:~$ readlink -f /usr/bin/python3
/usr/bin/python3.12
user@container-security:~$ sudo setcap cap_net_bind_service+ep /usr/bin/python3.12
user@container-security:~$ getcap /usr/bin/python3.12
/usr/bin/python3.12 cap_net_bind_service=ep
And confirm the file permissions are completely unchanged:
user@container-security:~$ ls -l /usr/bin/python3.12
-rwxr-xr-x 1 root root 8020928 Jan 22 20:57 /usr/bin/python3.12
No SUID bit. No ownership change. Nothing visible to a ls check.
Confirm It Works:
user@container-security:~$ python3 -m http.server 80
Serving HTTP on 0.0.0.0 port 80 (http://0.0.0.0:80/) ...
It binds successfully. From another terminal, verify the running process has exactly one capability and nothing more:
user@container-security:~$ pgrep python3
3257
user@container-security:~$ getpcaps 3257
3257: cap_net_bind_service=ep
user@container-security:~$ cat /proc/3257/status | grep -i cap
CapInh: 0000000000000000
CapPrm: 0000000000000400
CapEff: 0000000000000400
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
user@container-security:~$
user@container-security:~$ capsh --decode=0000000000000400 # Decode the bitmask to confirm
0x0000000000000400=cap_net_bind_service
Bit 10 (0x400) is CAP_NET_BIND_SERVICE and nothing else. The process cannot read arbitrary files, cannot change file ownership, cannot kill other processes. It can only bind to privileged ports.
To remove the capability and confirm it no longer works:
user@container-security:~$ sudo setcap -r /usr/bin/python3.12
user@container-security:~$ python3 -m http.server 80
Traceback (most recent call last):
...
File "/usr/lib/python3.12/socketserver.py", line 473, in server_bind
self.socket.bind(self.server_address)
PermissionError: [Errno 13] Permission denied
Capabilities Security Implications
While capabilities were designed to implement the Principle of Least Privilege and secure your system, they can become a massive liability if misconfigured. Assigning the wrong capability to the wrong binary effectively hands an attacker a clean, built-in mechanism for privilege escalation.
As we saw in the previous chapter, an empty capability set assigned with the Effective and Permitted flags (=ep) is actually shorthand for granting all available capabilities. If a system administrator mistakenly applies this or even just a specific capability like CAP_SETUID to a script interpreter or a common binary, the entire security model collapses.
Let's look at how easily an attacker can exploit this using Python. Assume an administrator accidentally ran sudo setcap =ep /usr/bin/python3.12 (or sudo setcap cap_setuid+ep /usr/bin/python3.12) while trying to fix a permissions issue. In that scenario, escalating privileges to root becomes trivial. All an attacker needs to do is write a one-liner to change their User ID (UID) to root and spawn a shell.
user@container-security:~$ sudo setcap cap_setuid+ep /usr/bin/python3.12
user@container-security:~$ getcap /usr/bin/python3.12
/usr/bin/python3.12 cap_setuid=ep
user@container-security:~$
user@container-security:~$ python3 -c 'import os; os.setuid(0); os.execl("/bin/bash", "bash")'
root@container-security:~# # notice the prompt changed to root, we are now running a root shell
root@container-security:~# exit
exit
user@container-security:~$
# remove the capability to prevent this from happening again
user@container-security:~$ sudo setcap -r /usr/bin/python3.12
Let's break down exactly what is happening here:
-
python3 -c: Tells the Python interpreter to execute the following inline code string. -
import os: Imports the standard OS module required to make system calls. -
os.setuid(0): Leverages theCAP_SETUIDcapability to change the process's effective UID to 0, which is the root user. -
os.execl("/bin/bash", "bash"): Replaces the current Python process with a brand-new bash shell. Because the UID was just changed to 0, this new shell runs entirely as root.
Just like that, a standard user account is transformed into a superuser, bypassing all traditional access controls.
The privilege escalation scenarios above become significantly more consequential in a containerized environment, and this is a topic we will explore in depth in a dedicated chapter.
This article is one piece of the Ultimate Container Security Series, an ongoing effort to organize and explain container security concepts in a practical way. If you want to explore related topics or see what’s coming next, the series introduction post provides the complete roadmap.

Top comments (0)