DEV Community

Steven J. Vik
Steven J. Vik

Posted on

I ran incident response on my own homelab. Here's the postmortem.

I run a 3-node Proxmox cluster at home with 11 LXC containers. Last week one of them turned into an incident.

Not a dramatic one. No data loss. No outage that affected anyone else. But it hit the same failure modes I see documented in enterprise postmortems — and handling it the same way taught me more than any homelab YouTube video has.

Here's what happened and what I changed.


The incident

00:47 — My homelab control panel stops responding. The web UI that ties together monitoring, service status, and agent health is down.

00:47–01:09 — PM2 restarts the service. Then restarts it again. 32 times total, with exponential backoff, over about 22 minutes.

01:09 — Prometheus alert fires. Wazuh catches the anomaly in PM2 process metrics. I get paged.

01:11 — I SSH in. pm2 logs sjvik-control-panel shows the immediate cause: Cannot find module tsx. The package is gone from node_modules.

01:13npm install && pm2 restart sjvik-control-panel. Service is back.

Total downtime from first failure to recovery: 26 minutes. Time from alert to recovery: 4 minutes.


The postmortem

If you've done incident response at work, this format will look familiar. I use it at home too — not because it's bureaucratic overhead, but because it forces honest thinking.

What failed: tsx missing from node_modules on LXC 101 (nx-web-01). Root cause unknown — most likely a stale node_modules state after a system update or partial npm ci run that didn't complete cleanly.

What detected it: External monitoring (Prometheus + pm2-prometheus-exporter tracking restart counts per process). NOT the service itself.

What slowed detection: The 22-minute gap between first failure and alert. PM2's default backoff delays mean a fast-dying service doesn't trip alerts immediately. I had no alert threshold on restart rate — only on sustained high restart counts.

What enabled fast recovery: The service is stateless. No data to recover. npm install is idempotent. Recovery procedure existed in muscle memory.

What didn't exist: A startup health check that validates dependencies before PM2 marks the service alive. The service was failing fast, not failing safe.


The fixes

1. Prestart dependency validation

Added to package.json:

"prestart": "npm ls --depth=0 --silent || (echo 'Dependency validation failed — run npm install' && exit 1)"
Enter fullscreen mode Exit fullscreen mode

Now on restart, PM2 gets a clean exit with an actionable message. Restart 1 tells me what's wrong. Not restart 32.

2. Alert on restart rate, not just count

Updated the Prometheus alert rule:

- alert: PM2ServiceRestartRateSpiking
  expr: rate(pm2_restarts_total[5m]) > 0.1
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.name }} restarting frequently"
Enter fullscreen mode Exit fullscreen mode

This fires within 2 minutes of a sustained restart loop instead of waiting for a count threshold.

3. Runbook entry

I added a one-page runbook to my Obsidian vault:

Service: sjvik-control-panel
Recovery: cd /root/projects/sjvik-control-panel && npm install && pm2 restart sjvik-control-panel
Verify: curl -s http://localhost:3456/health | jq .status
Escalate if: health endpoint returns non-200 after npm install
Enter fullscreen mode Exit fullscreen mode

Runbooks feel like overkill for a homelab until 2am when you're tired and need the answer fast.


Why homelab IR matters

The enterprise version of this incident would involve a runbook, a Slack war room, a timeline in PagerDuty, and a written postmortem shared with the team. The homelab version is: you fixing something alone at 1am with nobody watching.

But the thinking is the same.

  • Detection time matters. Alert on behavior, not just state.
  • Recovery time improves with runbooks. Write them while you still remember what you did.
  • Postmortems find gaps that didn't feel like gaps until something broke.

I have 11 containers. At least a few of them will have incidents. Treating each one as a real IR exercise is how I actually get better at this — not just at homelab ops, but at the SOC analyst work I do professionally.

The next time I'm in an enterprise war room following an incident response process, I'll have done it 15 times already on my own infrastructure.


Running a homelab security stack? My Proxmox Homelab guide covers the full setup — Proxmox cluster, PBS backups, Wazuh SIEM, and the PM2 service patterns that kept this recoverable in under 5 minutes.

Top comments (0)