TL;DR: There’s a namespace bug affecting Ubuntu 20.04, 22.04, and 24.04 servers that causes random service failures. It’s been reported since 2021 across systemd, Ubuntu, Fedora, and Red Hat trackers. Most reports are either expired or labeled “not-our-bug.” Only a reboot fixes it.
If you’re running Ubuntu servers and have ever seen this in your logs:
Failed to set up mount namespacing: /run/systemd/unit-root/dev: Invalid argument
Failed at step NAMESPACE spawning: Invalid argument
Main process exited, code=exited, status=226/NAMESPACE
Congratulations. You’ve encountered one of the most frustrating bugs in the Linux ecosystem — one that’s been bouncing between the kernel and systemd teams for years with no resolution.
What Happens
Random systemd services — including critical ones like systemd-resolved, systemd-timesyncd, systemd-journald, and your own custom services — suddenly refuse to start. The error mentions “mount namespacing” and “Invalid argument.”
Restarting the service doesn’t help. systemctl daemon-reload doesn’t help. The only reliable fix is a full system reboot.
If you’re running containerized workloads (LXC, LXD, Proxmox), it gets worse: the bug can affect the entire host node, and container reboots won’t fix it — you need to reboot the hypervisor itself.
The Blame Game
I’ve tracked this bug across multiple issue trackers:
- systemd/systemd #24798 — Ubuntu 20.04, September 2022
- systemd/systemd #19926 — Labeled
not-our-bug, June 2021 - Ubuntu Launchpad #1990659 — Expired due to inactivity
- Fedora CoreOS #1296 — Affects PXE/diskless boot
- Red Hat Bugzilla #2111863 — Migrated to Jira, status unknown
- dbus-broker #297 — CentOS Stream 9
The pattern is always the same: user reports the bug, maintainers ask for debug logs, user either provides them or doesn’t respond fast enough, bug expires or gets closed with “not-our-bug.”
The systemd team says it’s a kernel issue. The kernel team… well, I haven’t found anyone from the kernel team actively investigating this.
Root Causes (As Best We Can Tell)
The bug appears to involve:
-
Race conditions in mount namespace setup — systemd tries to remount
/sysand/devwhile other unmount operations are happening -
Mount propagation issues — systemd changes the default from
MS_PRIVATEtoMS_SHARED, causing unexpected interactions -
Resource exhaustion — sometimes related to inotify limits (
fs.inotify.max_user_instances) - Container/virtualization edge cases — more prevalent in LXC/LXD environments
But nobody has done a definitive root cause analysis. The bug is intermittent, hard to reproduce on demand, and affects systems that have been running fine for weeks or months.
The Irony
Remember when /etc/init.d/ scripts “just worked”? When starting a service meant running a shell script that executed a binary?
Systemd brought us dependency management, socket activation, cgroups integration, and dozens of security features like PrivateDevices=, ProtectSystem=, and PrivateTmp=. These are genuinely useful features.
But they also introduced complexity. The namespace isolation that causes this bug exists because systemd creates a private mount namespace for services with security hardening enabled. It’s a feature. Until it breaks.
The old init system didn’t have this bug because it didn’t have namespaces. Services ran in the global namespace. Less secure? Yes. But also fewer moving parts to fail.
Workarounds
If you’re affected, here are your options:
1. Disable namespace isolation for affected services:
sudo systemctl edit your-service.service
[Service]
PrivateDevices=no
ProtectHome=no
ProtectSystem=no
2. Clear corrupted systemd state:
sudo rm -rf /run/systemd/unit-root/
sudo systemctl daemon-reload
3. Increase inotify limits:
echo "fs.inotify.max_user_instances=512" >> /etc/sysctl.conf
sysctl -p
4. Monitor and auto-restart:
* */3 * * * systemctl list-units --failed | grep -q NAMESPACE && reboot
Yes, that last one is a scheduled reboot. That’s where we are.
What Should Happen
Someone — Canonical, Red Hat, or the systemd team — needs to:
- Create a reliable reproduction case
- Add instrumentation to capture the exact kernel/systemd state when the failure occurs
- Do a proper root cause analysis
- Fix it in either the kernel, systemd, or both
Until then, we’re all just rebooting servers and hoping.
Have you encountered this bug? What’s your workaround?
I’d love to hear from anyone who has done deeper investigation or found a permanent fix.
Top comments (0)