<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pratyush Mishra</title>
    <description>The latest articles on DEV Community by Pratyush Mishra (@devpratyush).</description>
    <link>https://dev.to/devpratyush</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F578087%2F18077700-9d04-4041-84b1-2153d6df9866.png</url>
      <title>DEV Community: Pratyush Mishra</title>
      <link>https://dev.to/devpratyush</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devpratyush"/>
    <language>en</language>
    <item>
      <title>Post-Mortem: Why My Ubuntu Docker Homelab Failed (And Why I Killed It)</title>
      <dc:creator>Pratyush Mishra</dc:creator>
      <pubDate>Sun, 12 Apr 2026 11:23:13 +0000</pubDate>
      <link>https://dev.to/devpratyush/post-mortem-why-my-ubuntu-docker-homelab-failed-and-why-i-killed-it-iem</link>
      <guid>https://dev.to/devpratyush/post-mortem-why-my-ubuntu-docker-homelab-failed-and-why-i-killed-it-iem</guid>
      <description>&lt;p&gt;For a year, I ran a monolithic microservices host on a single Ubuntu 24.04 LTS virtual machine. The goal was simple: centralize my data and route around my ISP's Carrier-Grade NAT (CGNAT).&lt;/p&gt;

&lt;p&gt;It started as a proof of concept — 4 vCPUs, 4 GB of RAM, and the quiet confidence of someone who has never been paged at 3 a.m.&lt;/p&gt;

&lt;p&gt;It ended up running 10+ containers via Docker Compose: Nextcloud, a full media stack, Prometheus, Netdata, and Grafana. (More on Grafana later. Spoiler: it did not survive.)&lt;/p&gt;

&lt;p&gt;It worked. And then, slowly, it started to break.&lt;/p&gt;

&lt;p&gt;Here is the post-mortem of &lt;strong&gt;Ghar Labs v1&lt;/strong&gt; — the bottlenecks I hit, the failures I missed, and why I ultimately put the server down.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture &amp;amp; The CGNAT Problem
&lt;/h2&gt;

&lt;p&gt;My ISP uses CGNAT, which means port forwarding to the public internet is not possible — my server shares a public IP with potentially hundreds of other subscribers. No &lt;code&gt;A&lt;/code&gt; record is going to help you there.&lt;/p&gt;

&lt;p&gt;To route around this, I engineered a split-tunneling setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public routing:&lt;/strong&gt; Cloudflare Tunnels handled inbound HTTP traffic (Nextcloud web interface, dashboards) without ever exposing my origin IP. No open ports required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private routing:&lt;/strong&gt; Tailscale handled everything that didn't need to be public — SMB shares, SSH, internal dashboards.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All services were containerized. To prevent I/O errors and duplicate media files, the downloader and media player were pointed at the &lt;strong&gt;exact same physical path&lt;/strong&gt; using strict &lt;code&gt;PUID&lt;/code&gt;/&lt;code&gt;PGID&lt;/code&gt; permissions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/mnt/data/
├── media/              &lt;span class="c"&gt;# The unified directory&lt;/span&gt;
│   ├── movies/
│   └── shows/
└── nextcloud/          &lt;span class="c"&gt;# Sovereign cloud data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clean architecture on paper. Production, as usual, had other plans.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Point 1: The Zombie Process Leak
&lt;/h2&gt;

&lt;p&gt;Over weeks of uptime, RAM usage would creep upward with no corresponding spike in CPU. The server wasn't doing more work — it just wasn't cleaning up after itself.&lt;/p&gt;

&lt;p&gt;Logging into the terminal eventually surfaced a warning: &lt;strong&gt;2 zombie processes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A subsequent &lt;code&gt;htop&lt;/code&gt; audit confirmed the diagnosis.&lt;/p&gt;

&lt;p&gt;Docker containers were not reaping child processes correctly. When you run an application inside a container without an init system — &lt;code&gt;dumb-init&lt;/code&gt; or &lt;code&gt;tini&lt;/code&gt; — &lt;strong&gt;PID 1 inside the container doesn't know how to adopt orphaned processes.&lt;/strong&gt; They linger in the process table, unconsumed, until the host reboots.&lt;/p&gt;

&lt;p&gt;The fix is straightforward: add &lt;code&gt;--init&lt;/code&gt; to your &lt;code&gt;docker run&lt;/code&gt; call, or in Compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;your-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;init&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I learned this after the fact. The server did not.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Point 2: Silent OOM Kills
&lt;/h2&gt;

&lt;p&gt;Core services like Nextcloud held up reasonably well. Heavier JVM and Go-based monitoring tools did not — they were fighting over the same 4 GB ceiling.&lt;/p&gt;

&lt;p&gt;During my final audit before decommissioning, a routine &lt;code&gt;docker ps -a&lt;/code&gt; revealed what I had missed for months:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CONTAINER ID   IMAGE     COMMAND   STATUS
a3f1b2c9d4e5   grafana   ...       Exited (255) 87 days ago
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Grafana had silently crashed — &lt;strong&gt;exit code 255, OOM killed&lt;/strong&gt; — and never came back. Docker's restart policy tried, the kernel said no, and the container just quietly stopped existing. No alert. No notification. The dashboard I thought was watching my stack had itself gone dark.&lt;/p&gt;

&lt;p&gt;The lesson: &lt;code&gt;docker ps -a&lt;/code&gt; is not optional. Automate the check, or instrument a watchdog. A monitoring tool that nobody monitors is just a pretty corpse.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Point 3: The Single Point of Failure
&lt;/h2&gt;

&lt;p&gt;The zombie leak and the OOM kills were annoying. This one was existential.&lt;/p&gt;

&lt;p&gt;The entire lab lived on one virtual disk (&lt;code&gt;.vdi&lt;/code&gt;). One volume, no redundancy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ No ZFS bit-rot protection&lt;/li&gt;
&lt;li&gt;❌ No RAID parity&lt;/li&gt;
&lt;li&gt;❌ No snapshots&lt;/li&gt;
&lt;li&gt;❌ No off-host backups of the database&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single bad sector — or a host crash mid-write — could corrupt the Nextcloud database and take years of personal data with it. I had built a fairly sophisticated networking layer on top of a foundation that was, architecturally, one &lt;code&gt;fsck&lt;/code&gt; error away from disaster.&lt;/p&gt;

&lt;p&gt;This is the part where I stopped calling it a "proof of concept" and started calling it "a liability."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Resolution
&lt;/h2&gt;

&lt;p&gt;Ghar Labs v1 was a successful learning environment. In twelve months, it taught me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker networking and Compose service dependencies&lt;/li&gt;
&lt;li&gt;Reverse proxying through CGNAT without opening a single port&lt;/li&gt;
&lt;li&gt;Linux process management (the hard way)&lt;/li&gt;
&lt;li&gt;Why storage architecture is not an afterthought&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But a single-node VM with no storage redundancy, no init system, and a 4 GB RAM ceiling is not where you keep data you care about.&lt;/p&gt;

&lt;p&gt;I decommissioned the Ubuntu host, wiped the drives, and migrated the entire stack to a dedicated bare-metal machine running &lt;strong&gt;TrueNAS Scale&lt;/strong&gt; with proper ZFS redundancy. The services are the same. The foundation is not.&lt;/p&gt;

&lt;p&gt;Sometimes the best thing you can do with a legacy server is document what it taught you, shut it down gracefully, and build the next one right.&lt;/p&gt;

&lt;p&gt;The full configuration archive — Compose files, Cloudflare tunnel configs, Tailscale ACLs — is preserved here for reference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;→ &lt;a href="https://github.com/devpratyushh/homelab-v1-archive/" rel="noopener noreferrer"&gt;devpratyushh/homelab-v1-archive&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It's retired. But it earned its README.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>ubuntu</category>
      <category>devops</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
