Nicolai Matthew

Posted on Mar 27

How I Built a Production-Grade Homelab on a $200 Mini PC

#homelab #linux #docker #proxmox

I work helpdesk. I wanted to break into DevOps. So I did what any reasonable person would do, I spent three months building an increasingly complicated home server that nobody asked for.

What started as "I'll just run Jellyfin on an old PC" turned into a full infrastructure project with containerized services, a monitoring stack, automated backups, infrastructure-as-code, and a CI/CD pipeline. This is a quick story of what I built, the decisions I made and the things that broke along the way.

The Hardware

I picked up a Lenovo ThinkCentre M920q for around $200 on eBay. These small form factor machines are practically designed for homelabs. Low power draw, quiet with 8th/9th gen Intel CPUs that have QuickSync hardware transcoding which matters a lot for Jellyfin.

For storage I added a Terramaster D4-320 DAS (Direct Attached Storage) enclosure with two 4TB HDDs, I had drives laying around that had ~12,700 hours on them, which I'll come back to. The two drives are combined into a single 7.3TB LVM volume.

Why Proxmox Instead of Just Running Docker

The first decision was the foundation. I could have just installed Ubuntu and run Docker Compose directly. Instead I went with Proxmox VE - a bare metal hypervisor that lets you run multiple isolated VMs and LXC containers.

The reasoning:

Isolation - each service gets its own container. If something breaks or gets compromised, it can't affect everything else. Sonarr crashing doesn't take down Prometheus. A misconfigured container doesn't expose your monitoring stack.

LXC over VMs - Linux Containers are lighter than full VMs. An LXC for Prometheus uses 512MB RAM and starts in two seconds. A full VM for the same workload would use 2GB+ and take 30 seconds to boot.

Unprivileged containers - I run everything in unprivileged LXCs. This means the container's root user maps to an unprivileged user on the host (UID 101000 instead of 0). Even if someone escapes the container, they land as an unprivileged user on the host with no meaningful access. The tradeoff is bind mount complexity - I'll get to this a bit later.

The result is 15 LXCs each running a specific role, isolated from each other, with the Proxmox host managing them all.

The Media Stack

This was the original goal and it's where I spent the most time.

The full pipeline looks like this:

Jellyseerr (request portal)
    ↓
Sonarr/Radarr (TV and movie automation)
    ↓
Prowlarr (indexer aggregation)
    ↓
qBittorrent (download client)
    ↓
Jellyfin (media server with hardware transcoding)

Everything runs as Docker containers inside a single Docker LXC. I use Docker Compose to manage the stack - a single docker-compose.yml defines all 20+ services, their networking and their volumes.

The VPN Kill Switch

Every torrent goes through a VPN.

Most people configure qBittorrent to bind to the VPN interface. The problem is this relies on the application to enforce the kill switch — if the VPN drops, there's a window where traffic might leak before qBittorrent detects it.

Instead I use Gluetun - a VPN container that creates a network namespace, and run qBittorrent inside that namespace:

qbittorrent:
  network_mode: "service:gluetun"

This is enforced at the kernel level. qBittorrent physically cannot route traffic outside the Gluetun network namespace. If Gluetun stops, qBittorrent loses all network access because the network interface practically disappears.

Hardware Transcoding

The M920q's Intel QuickSync is passed through to the Docker LXC via the Proxmox container config:

lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.mount.entry: /dev/dri/card1 dev/dri/card1 none bind,optional,create=file
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file

Jellyfin uses QSV to hardware decode/encode H.264, HEVC, VP9 and AV1. Transcoding a 4K stream uses about 15% CPU with hardware acceleration vs 100%+ without it on this hardware.

The Hardlink Trick

Downloaded media and the Jellyfin library live on the same filesystem (/mnt/data). When Sonarr/Radarr import a file they create a hardlink rather than copying it - the file appears in both the downloads folder and the media library simultaneously while only consuming disk space once.

A 30GB season pack is "moved" to the media library instantaneously with zero additional disk usage. This only works when both locations are on the same filesystem, which is why the folder structure matters.

Unprivileged LXC Bind Mounts

This tripped me up for an entire evening.

When you bind mount a directory from the Proxmox host into an unprivileged LXC, the UID mapping means the LXC's root user (UID 0 inside) maps to UID 101000 on the host.

So when you chown a directory on the host to give the LXC access, you don't do:

chown -R 1000:1000 /mnt/das  # WRONG

You do:

chown -R 101000:101000 /mnt/das  # CORRECT

Using 1000 means the LXC can't write to the directory even though it looks like it should work. I was confused by the error messages, the fix is a single number change.

Monitoring Stack

I wanted to know when things were going wrong before users (me) noticed.

The stack is Prometheus + Grafana + Alertmanager, each in their own LXC:

node_exporter on the Proxmox host collects CPU, RAM, disk, and network metrics
PVE Exporter collects per-LXC and per-VM resource usage
Scraparr exports metrics from Sonarr, Radarr, Prowlarr, and Jellyseerr
A custom bash script exports SMART drive health metrics via node_exporter's textfile collector

Alert rules cover:

Any LXC or VM going offline
Disk filling up (warning at 85%, critical at 95%)
Drive SMART health failure
Drive temperature exceeding 55°C
DAS storage filling up

The Phantom Alert Bug

The SMART monitoring took two weeks to get right because of a subtle conflict I didn't know existed.

When you install prometheus-node-exporter on Debian via apt, it ships with a built-in SMART monitoring service: prometheus-node-exporter-smartmon.service and its accompanying systemd timer. This service runs every 15 minutes and writes SMART metrics to the same textfile directory I was using.

The problem: the built-in script couldn't read my USB-attached DAS drives. When it ran, it wrote 0 for all health metrics — overwriting my correct values.

My cron job ran every minute and would fix it within 60 seconds. But Prometheus scrapes every 30 seconds. So there was a brief window every 15 minutes where Prometheus might scrape during the "bad" state. With my alert rule set to fire after 20 minutes of sustained failure, it took several consecutive bad scrapes to trigger — which happened a few times per night.

The fix was renaming my metrics from smartmon_* to custom_smart_* and writing to a differently named file. The conflict disappeared immediately.

Backup Strategy

I run three backup tiers:

Daily local backups - Docker configs and LXC configuration files tarball to the DAS. If the OS SSD dies, the configs are on the DAS.

Weekly Proxmox snapshots - VZDump snapshots of all LXCs to the DAS. Full restore from these takes about 15 minutes.

Weekly offsite backups - rclone syncs config files to Cloudflare R2. The 10GB free tier is more than enough for text configs. This survives the house burning down scenario.

The offsite backup is config files only. At 7.3TB, media isn't worth backing up remotely. The ARR stack can rebuild the library automatically from scratch.

I monitor R2 bucket size in Grafana with custom Prometheus metrics and have alerts set at 8GB (warning) and 9GB (critical) so I never accidentally hit the free tier limit.

Infrastructure as Code with Ansible

Once the homelab was running I realized I had no reproducible way to rebuild it. Everything was manual commands I half-remembered.

I wrote an Ansible project to fix this:

Dynamic inventory - instead of maintaining a static list of hosts, Ansible queries the Proxmox API directly. Every LXC automatically appears in inventory. New LXC created? It's immediately manageable. LXC destroyed? Gone from inventory.

Base role - applied to every LXC on creation: SSH hardening, standard packages, timezone, MOTD, unattended upgrades. One command hardens 15 machines simultaneously.

Provisioning playbook - spins up a new LXC from scratch. Change six variables and run one command. The new LXC is created, started, SSH key injected, packages installed, and ready to use.

Update playbook - runs apt upgrade across all 15 machines in parallel. What used to take 20 minutes of typing in commands now takes one command and 90 seconds.

GitHub Actions CI - every push to the Ansible repo triggers yamllint and ansible-lint automatically. The pipeline fails if any playbook has syntax errors or style violations.

The Ansible repo: github.com/NicolaiMatthew/homelab-ansible

Public Access via Cloudflare Tunnel

I wanted to access Jellyfin remotely without opening ports on my router or exposing my home IP.

Cloudflare Tunnel creates an outbound-only encrypted connection from my server to Cloudflare's edge. Traffic flows:

Internet → Cloudflare (SSL terminated) → Tunnel → Nginx Proxy Manager → Jellyfin

No ports open on my router (yay). Cloudflare handles DDoS protection, SSL certificates, and bot filtering for free.

One gotcha that cost me an hour of debugging: Nginx Proxy Manager was set to Force SSL which caused an infinite redirect loop. Cloudflare already terminates SSL before the request reaches my network — NPM doesn't need to force it again. Disabling Force SSL in NPM fixed it immediately.

Another gotcha: cloudflared runs inside the NPM LXC and needs to connect to NPM via localhost:80, not via the LXC's external IP. LXCs can't connect to their own external IP from within themselves, which is by design.

What I Learned

Isolation is worth the complexity. Running 15 LXCs instead of one big Docker host adds overhead but has paid off repeatedly. Services can be restarted, updated, and debugged independently.

Document as you go. My Obsidian vault syncs to GitHub automatically. Every service has a markdown doc covering the setup, decisions, and gotchas. When something breaks at 2am, future me is grateful.

The bug is rarely where you're looking. The SMART monitoring conflict, the Docker iptables blocking NPM connections, the missing gateway on a newly created LXC - all of these were caused by something outside the thing I was debugging. Check your assumptions first.

Automate the maintenance, not just the deployment. The update playbook gets used weekly. The provisioning playbook has been used a handful of times. Automation that saves you time on recurring tasks has more value than automation that saves time on one-time setups.

The Stack at a Glance

Service	Purpose
Proxmox VE	Hypervisor
Tailscale	VPN + remote access
Gluetun + qBittorrent	Torrenting behind VPN kill switch
Prowlarr + FlareSolverr	Indexer aggregation
Sonarr + Radarr + Bazarr	Media automation
Jellyfin	Media server (QuickSync HW transcode)
Jellyseerr	Request portal
Prometheus + Grafana + Alertmanager	Monitoring and alerting
Homepage	Unified dashboard
n8n	Workflow automation
Ansible + GitHub Actions	Infrastructure as code + CI
Cloudflare Tunnel + NPM	Public access without open ports
Cloudflare R2 + rclone	Offsite config backups

What's Next

RHCSA certification — everything in this build is RHCSA material
Kubernetes — migrate some services to k3s for the learning experience
Paperless-ngx — document management to go with the homelab theme of replacing cloud services

The full documentation lives at github.com/NicolaiMatthew/homelab.

If you're thinking about building something similar, the Lenovo ThinkCentre M series (M720q, M920q, M920x) are the go-to recommendation for budget homelabs. Low power, quiet, capable, and cheap on the used market. The community-scripts project for Proxmox makes spinning up new services trivial.

DEV Community