Ivo Brett

Posted on Jun 3

Docker on Proxmox LXC: What Actually Works (and Why Unprivileged Doesn't)

#devops #docker #linux #tutorial

The Setup

I run a Proxmox 9 homelab (pve-manager/9.0.5, kernel 6.14.8-2-pve) and I needed to run Docker inside an LXC container — not a VM — to test a customer-style "bring-your-own-VPS" deployment path for a PaaS I'm building. The container had to act like a standard Ubuntu cloud VM: Docker, systemd, the works.

LXC over a full VM gets me near-bare-metal performance, a fraction of the RAM overhead, and instant boots. The catch: the "Docker on LXC" recipes you'll find on most blog posts and Proxmox forum threads are out of date. They assume kernel 5.x and runc 1.1.x. On a modern Proxmox (kernel 6.14 + runc 1.2+ shipped with Docker 29) those recipes fail in two new and confusing ways before you even reach the workarounds we used to know about.

This article walks through exactly what fails, why it fails, and the config that actually works in 2026 — plus an honest look at the security tradeoffs, because spoiler: the working config is privileged, and that matters.

The Goal

A Proxmox LXC container that can:

Run docker run hello-world cleanly
Pull and build complex images (multi-stage builds, overlay2 storage driver)
Run nested containers with their own systemd
Use systemd --user for per-service lingering processes

Attempt 1: Unprivileged LXC (the path you "should" take)

Conventional wisdom says: use unprivileged LXC. The container's root is mapped to an unprivileged UID on the host (typically 100000), so even a full container compromise can't escape to host root. Modern Proxmox makes this the default and recommended mode.

I started with an unprivileged Ubuntu 25.04 container, added the now-standard features:

pct set <vmid> -features nesting=1,keyctl=1,fuse=1

Then inside the container, installed Docker from the official download.docker.com repo and ran:

docker run --rm hello-world

Here's what happened:

docker: Error response from daemon: failed to create task for container:
failed to create shim task: OCI runtime create failed: runc create failed:
unable to start container process: error during container init:
open sysctl net.ipv4.ip_unprivileged_port_start file:
reopen fd 8: permission denied

The cause: runc 1.2+ unconditionally writes to net.ipv4.ip_unprivileged_port_start=0 in the new container's network namespace, so non-root processes inside the container can bind to ports below 1024. Setting that sysctl requires CAP_NET_ADMIN over the namespace owning the sysctl. In an unprivileged LXC, the network namespace is owned by the host's root, not the container's mapped root. The mapped root has no capability over it. runc can't open the file for write, and the container fails to start.

I tried every workaround in the Proxmox forum playbook:

lxc.apparmor.profile: unconfined — different error (next section)
lxc.cap.drop: (clear the drop list) — no change
lxc.cgroup2.devices.allow: a — no change
Bind-mounting /sys/kernel/security from the host — no change

Then with lxc.apparmor.profile: unconfined, a new error replaces the sysctl one:

docker: Error response from daemon:
Could not check if docker-default AppArmor profile was loaded:
open /sys/kernel/security/apparmor/profiles: permission denied

The cause: in the unprivileged container, /sys/kernel/security/apparmor/profiles is owned by host-root which maps to nobody:nogroup in the user namespace. No chmod or bind-mount changes that — it's a fundamental property of the user namespace. Docker's AppArmor probe at startup fails.

This is a hard ceiling, not a configuration problem. Two independent kernel/runtime decisions both require capabilities the user namespace can't grant. Stop trying to make unprivileged Docker work on recent Proxmox kernels unless you want to maintain a forked runc.

Attempt 2: Privileged LXC (what actually works)

Convert in place — no need to lose the container content:

# Snapshot first (optional but cheap insurance)
pct snapshot <vmid> pre-conversion

# Stop, back up, destroy, restore as privileged
pct stop <vmid>
vzdump <vmid> --storage local --compress zstd --mode stop
LATEST=$(ls -t /var/lib/vz/dump/vzdump-lxc-<vmid>-*.tar.zst | head -1)
pct destroy <vmid>
pct restore <vmid> "$LATEST" --unprivileged 0 --storage local-lvm

A 40 GB container backed up and restored in ~10 minutes on a single consumer SSD.

Now the privileged-LXC-Docker config. Add these to /etc/pve/lxc/<vmid>.conf:

features: nesting=1,keyctl=1,fuse=1
lxc.apparmor.profile: unconfined
lxc.cap.drop:

Why each one matters:

nesting=1 — required for the container's own systemd + cgroup namespace. Docker won't start without it.
keyctl=1 — Docker's containerd-shim uses kernel keyrings during overlay2 setup. Without it you get cryptic "operation not permitted" errors on first container start, even on privileged.
fuse=1 — overlay2 sometimes wants fuse mounts for fallback storage drivers.
lxc.apparmor.profile: unconfined — leaves the container's processes outside the host's AppArmor confinement. Without this, Docker's apparmor_parser invocation gets blocked.
lxc.cap.drop: (empty value) — clears Proxmox's default capability drops. Proxmox drops mac_admin and mac_override even on privileged LXCs by default. Without those caps, apparmor_parser can't replace the docker-default AppArmor profile and Docker bails with Access denied. You need policy admin privileges to manage profiles.

Start the container, exec in, and run docker run --rm hello-world. It works. Cleanly.

The Security Tradeoff (Be Honest About This)

Privileged LXC is not unprivileged LXC. The container's root is the host's root. There's no user-namespace remapping between them. This means:

Anything that can break out of the container's mount namespace (via a kernel vulnerability, a hostile docker image with broken seccomp, a misconfigured volume mount, etc.) is running as host-root.
AppArmor confinement is off (unconfined). The host's normal MAC defenses don't apply to processes inside the container.
mac_admin and mac_override capabilities mean the container can modify the host's AppArmor policy. Compromise the container → modify the host's security policy.
Bind-mounting host paths into containers (especially /, /var/run/docker.sock, or anything under /sys) gives those containers and any process they spawn full host access.

If you're running this configuration:

Treat the LXC as a privileged peer of the host, not a sandbox. A compromised Docker container inside is one (any) kernel CVE away from owning your hypervisor.
Don't expose this LXC to untrusted workloads. Build pipelines, CI runners, containers from registries you don't fully trust — those belong in a VM, not in a privileged LXC.
Keep the Proxmox host and the container both patched. Kernel CVEs are how this configuration goes bad. Subscribe to pve-user and run apt update && apt upgrade on both regularly.
Network-segment the LXC if it's running anything internet-facing. Proxmox SDN or a separate VLAN.
Don't put anything in this LXC you wouldn't put on the host directly. That's the right mental model — it's basically a chrooted view of the host, with no extra confinement.

The honest summary: privileged LXC + Docker is a performance and simplicity choice, not a security choice. If your threat model includes hostile workloads or you're running for a customer, run Docker in a full VM (KVM via Proxmox) instead. The 200 MB of extra RAM and 2-second boot delay are the price of real isolation.

Why Not Just Use a VM?

Fair question. A KVM VM gives you actual hardware-level isolation, identical Docker behavior to a bare-metal install, and zero of the kernel-namespace gymnastics. Use a VM if:

You're running customer workloads.
You need to test things that interact with the kernel (drivers, low-level networking).
You don't trust the images you're pulling.
Performance overhead doesn't matter to you.

Use a privileged LXC if:

You control everything running inside.
You need bare-metal-ish performance for many short-lived containers.
RAM density matters (homelab on a NUC, edge box, etc.).
The workloads are functionally equivalent to running on the host.

Verification

After conversion, sanity-check:

# Inside the LXC
uname -r              # Should show host kernel (6.14.8-2-pve)
docker info | grep -E "Storage Driver|Cgroup Driver|Kernel Version"
docker run --rm hello-world
docker run --rm alpine sh -c "uname -r; cat /proc/1/cgroup | head -3"

You should see overlay2, systemd cgroup driver, and a fresh container starts and exits cleanly in under a second.

TL;DR for the Forum-Hopping Crowd

If you're here from a Google search and just need the working config for Proxmox 9 / kernel 6.14 / Docker 29:

# /etc/pve/lxc/<vmid>.conf
features: nesting=1,keyctl=1,fuse=1
lxc.apparmor.profile: unconfined
lxc.cap.drop:

…and the container must be privileged (unprivileged: 0 or that line absent). Unprivileged Docker is fundamentally broken on recent kernels because of runc's per-namespace sysctl write and Docker's AppArmor probe. Don't fight it.

And read the security section above — privileged LXC is closer to "Docker on the host with extra steps" than to "Docker in a sandbox." Architect accordingly.

If this saved you the four hours it took me to discover all of this empirically, drop a comment. If you found another path that gets unprivileged working without forking runc, definitely drop a comment.