I run a 3-node Proxmox cluster at home with 11 LXC containers. Last week one of them turned into an incident.
Not a dramatic one. No data loss. No outage that affected anyone else. But it hit the same failure modes I see documented in enterprise postmortems — and handling it the same way taught me more than any homelab YouTube video has.
Here's what happened and what I changed.
The incident
00:47 — My homelab control panel stops responding. The web UI that ties together monitoring, service status, and agent health is down.
00:47–01:09 — PM2 restarts the service. Then restarts it again. 32 times total, with exponential backoff, over about 22 minutes.
01:09 — Prometheus alert fires. Wazuh catches the anomaly in PM2 process metrics. I get paged.
01:11 — I SSH in. pm2 logs sjvik-control-panel shows the immediate cause: Cannot find module tsx. The package is gone from node_modules.
01:13 — npm install && pm2 restart sjvik-control-panel. Service is back.
Total downtime from first failure to recovery: 26 minutes. Time from alert to recovery: 4 minutes.
The postmortem
If you've done incident response at work, this format will look familiar. I use it at home too — not because it's bureaucratic overhead, but because it forces honest thinking.
What failed: tsx missing from node_modules on LXC 101 (nx-web-01). Root cause unknown — most likely a stale node_modules state after a system update or partial npm ci run that didn't complete cleanly.
What detected it: External monitoring (Prometheus + pm2-prometheus-exporter tracking restart counts per process). NOT the service itself.
What slowed detection: The 22-minute gap between first failure and alert. PM2's default backoff delays mean a fast-dying service doesn't trip alerts immediately. I had no alert threshold on restart rate — only on sustained high restart counts.
What enabled fast recovery: The service is stateless. No data to recover. npm install is idempotent. Recovery procedure existed in muscle memory.
What didn't exist: A startup health check that validates dependencies before PM2 marks the service alive. The service was failing fast, not failing safe.
The fixes
1. Prestart dependency validation
Added to package.json:
"prestart": "npm ls --depth=0 --silent || (echo 'Dependency validation failed — run npm install' && exit 1)"
Now on restart, PM2 gets a clean exit with an actionable message. Restart 1 tells me what's wrong. Not restart 32.
2. Alert on restart rate, not just count
Updated the Prometheus alert rule:
- alert: PM2ServiceRestartRateSpiking
expr: rate(pm2_restarts_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "{{ $labels.name }} restarting frequently"
This fires within 2 minutes of a sustained restart loop instead of waiting for a count threshold.
3. Runbook entry
I added a one-page runbook to my Obsidian vault:
Service: sjvik-control-panel
Recovery: cd /root/projects/sjvik-control-panel && npm install && pm2 restart sjvik-control-panel
Verify: curl -s http://localhost:3456/health | jq .status
Escalate if: health endpoint returns non-200 after npm install
Runbooks feel like overkill for a homelab until 2am when you're tired and need the answer fast.
Why homelab IR matters
The enterprise version of this incident would involve a runbook, a Slack war room, a timeline in PagerDuty, and a written postmortem shared with the team. The homelab version is: you fixing something alone at 1am with nobody watching.
But the thinking is the same.
- Detection time matters. Alert on behavior, not just state.
- Recovery time improves with runbooks. Write them while you still remember what you did.
- Postmortems find gaps that didn't feel like gaps until something broke.
I have 11 containers. At least a few of them will have incidents. Treating each one as a real IR exercise is how I actually get better at this — not just at homelab ops, but at the SOC analyst work I do professionally.
The next time I'm in an enterprise war room following an incident response process, I'll have done it 15 times already on my own infrastructure.
Running a homelab security stack? My Proxmox Homelab guide covers the full setup — Proxmox cluster, PBS backups, Wazuh SIEM, and the PM2 service patterns that kept this recoverable in under 5 minutes.
Top comments (0)