Pratyush Mishra

Posted on Apr 12 • Edited on Apr 13

Post-Mortem: Why My Ubuntu Docker Homelab Failed (And Why I Killed It)

#docker #linux #ubuntu #devops

The Origin Story

It started with a spare 500GB hard drive and a problem that had been quietly annoying me for months.

My digital life was fragmented across three different services I didn't control, two hard drives I kept forgetting to back up, and a Google Photos library that was one policy change away from becoming someone else's problem. July 2024. I had a virtualization host, a free weekend, and the dangerous combination of too much free time and just enough Linux knowledge to be overconfident.

The original goal was modest — a basic NAS. Centralize the drives, access files from anywhere, stop worrying about losing photos. That was the plan.

Scope creep had other ideas.

Within a few weeks, the "basic NAS" had grown into something I started calling Ghar Labs v1 — a monolithic microservices host running:

Media Automation — Jellyfin with the full *Arr stack for automated library management
Sovereign Cloud — Nextcloud replacing Google Drive and Photos entirely
Observability — Prometheus, Grafana, and Netdata for real-time monitoring

The inspiration was equal parts practical and philosophical. I'd been reading about self-hosting communities — r/homelab, r/selfhosted — and the idea of owning your own data stack resonated. Not paranoia. Just the quiet satisfaction of knowing exactly where your files live and who has access to them.

I also wanted to learn. Running production services on your own hardware teaches you things no tutorial covers — because tutorials don't page you at 3 a.m. when something breaks.

The Technical Architecture

The entire system ran as a single Ubuntu Server 24.04 LTS virtual machine, orchestrating 10+ containers via Docker Compose.

Component	Choice	Why
Hypervisor	Oracle VirtualBox (Windows host)	Already available, zero cost
Resources	4 vCPUs, 4GB RAM	Proof of concept allocation
Networking	Bridged Adapter	Direct LAN access for SMB
Public routing	Cloudflare Tunnels	CGNAT bypass, no exposed ports
Private routing	Tailscale	Secure SSH and SMB access

All services were containerized. To prevent I/O errors and duplicate media files, the downloader and media player were pointed at the exact same physical path using strict PUID/PGID permissions:

/mnt/data/
├── media/              # The unified directory
│   ├── movies/
│   └── shows/
└── nextcloud/          # Sovereign cloud data

Clean architecture on paper. Production, as usual, had other plans.

The CGNAT Problem

The first real engineering challenge wasn't Docker. It was my ISP.

CGNAT — Carrier-Grade NAT — means your ISP assigns you a private IP address shared with potentially hundreds of other subscribers. Port forwarding is impossible. Your server is invisible to the public internet. No A record is going to help you there.

The naive solution is to rent a VPS and reverse proxy everything through it. That works, but it routes all your traffic — including 4K media — through a server you're paying for by the gigabyte. Expensive, slow, and architecturally wasteful.

After research and a lot of trial and error, I landed on a split-tunneling approach:

Public routing: Cloudflare Tunnels handled inbound HTTP traffic (Nextcloud web interface, Jellyfin, dashboards) without ever exposing my origin IP. No open ports required. The tunnel runs as a lightweight daemon on the host and registers itself with Cloudflare's edge network.
Private routing: Tailscale handled everything that didn't need to be public — SMB shares, direct SSH access, internal dashboards. Zero-config mesh VPN with WireGuard under the hood.

# Cloudflare tunnel config
tunnel: <YOUR_TUNNEL_ID>
credentials-file: /root/.cloudflared/<YOUR_TUNNEL_ID>.json

ingress:
  - hostname: nextcloud.yourdomain.com
    service: http://localhost:80
  - hostname: jellyfin.yourdomain.com
    service: http://localhost:8096
  - service: http_status:404

The result was a server that was fully accessible from anywhere in the world, with zero open firewall ports and zero VPS costs. Public traffic through Cloudflare's edge. Private traffic through Tailscale's encrypted mesh. The ISP's CGNAT became irrelevant.

It was the part of this project I was most proud of. It was also completely unrelated to why the server eventually died.

Failure Point 1: The Zombie Process Leak

Over weeks of uptime, RAM usage would creep upward with no corresponding spike in CPU. The server wasn't doing more work — it just wasn't cleaning up after itself.

Logging into the terminal eventually surfaced a warning: 2 zombie processes.

A subsequent htop audit confirmed the diagnosis.

Docker containers were not reaping child processes correctly. When you run an application inside a container without an init system — dumb-init or tini — PID 1 inside the container doesn't know how to adopt orphaned processes. They linger in the process table, unconsumed, until the host reboots.

The fix is straightforward: add --init to your docker run call, or in Compose:

services:
  your-app:
    init: true

I learned this after the fact. The server did not.

Failure Point 2: Silent OOM Kills

Core services like Nextcloud held up reasonably well. Heavier JVM and Go-based monitoring tools did not — they were fighting over the same 4GB ceiling.

During my final audit before decommissioning, a routine docker ps -a revealed what I had missed for months:

CONTAINER ID   IMAGE     COMMAND   STATUS
a3f1b2c9d4e5   grafana   ...       Exited (255) 87 days ago

Grafana had silently crashed — exit code 255, OOM killed — and never came back. Docker's restart policy tried, the kernel said no, and the container just quietly stopped existing. No alert. No notification. The dashboard I thought was watching my stack had itself gone dark.

The lesson: docker ps -a is not optional. Automate the check, or instrument a watchdog. A monitoring tool that nobody monitors is just a pretty corpse.

Failure Point 3: The Single Point of Failure

The zombie leak and the OOM kills were annoying. This one was existential.

The entire lab lived on one virtual disk (.vdi). One volume, no redundancy:

❌ No ZFS bit-rot protection
❌ No RAID parity
❌ No snapshots
❌ No off-host backups of the database

A single bad sector — or a host crash mid-write — could corrupt the Nextcloud database and take years of personal data with it. I had built a fairly sophisticated networking layer on top of a foundation that was, architecturally, one fsck error away from disaster.

This is the part where I stopped calling it a "proof of concept" and started calling it "a liability."

The Resolution

Ghar Labs v1 was a successful learning environment. In twelve months, it taught me things no tutorial would:

How Docker networking actually behaves under sustained load
How to engineer around CGNAT without spending money on a VPS
What happens to PID 1 when you don't think about it
Why storage architecture is the foundation, not an afterthought

But a single-node VM with no storage redundancy, no init system, and a 4GB RAM ceiling is not where you keep data you care about.

I decommissioned the Ubuntu host, wiped the drives, and migrated the entire stack to a dedicated bare-metal machine running TrueNAS Scale with proper ZFS redundancy. The services are the same. The foundation is not.

The spare 500GB drive that started this whole thing is now sitting in a ZFS mirror pool, bit-rot protected, snapshotted nightly.

Sometimes the best thing you can do with a legacy server is document what it taught you, shut it down gracefully, and build the next one right.

The full configuration archive — Compose files, Cloudflare tunnel configs, Tailscale ACLs — is preserved here for reference:

→ devpratyushh/homelab-v1-archive

It's retired. But it earned its README.

Next post: Ghar Labs v2 — migrating the entire stack to TrueNAS Scale with ZFS, proper jail isolation, and a networking layer that survives a power cut.