Steven J. Vik

Posted on Apr 1

I ran incident response on my own homelab. Here's the postmortem.

#security #linux #devops #homelab

I run a 3-node Proxmox cluster at home with 11 LXC containers. Last week one of them turned into an incident.

Not a dramatic one. No data loss. No outage that affected anyone else. But it hit the same failure modes I see documented in enterprise postmortems — and handling it the same way taught me more than any homelab YouTube video has.

Here's what happened and what I changed.

The incident

00:47 — My homelab control panel stops responding. The web UI that ties together monitoring, service status, and agent health is down.

00:47–01:09 — PM2 restarts the service. Then restarts it again. 32 times total, with exponential backoff, over about 22 minutes.

01:09 — Prometheus alert fires. Wazuh catches the anomaly in PM2 process metrics. I get paged.

01:11 — I SSH in. pm2 logs sjvik-control-panel shows the immediate cause: Cannot find module tsx. The package is gone from node_modules.

01:13 — npm install && pm2 restart sjvik-control-panel. Service is back.

Total downtime from first failure to recovery: 26 minutes. Time from alert to recovery: 4 minutes.

The postmortem

If you've done incident response at work, this format will look familiar. I use it at home too — not because it's bureaucratic overhead, but because it forces honest thinking.

What failed: tsx missing from node_modules on LXC 101 (nx-web-01). Root cause unknown — most likely a stale node_modules state after a system update or partial npm ci run that didn't complete cleanly.

What detected it: External monitoring (Prometheus + pm2-prometheus-exporter tracking restart counts per process). NOT the service itself.

What slowed detection: The 22-minute gap between first failure and alert. PM2's default backoff delays mean a fast-dying service doesn't trip alerts immediately. I had no alert threshold on restart rate — only on sustained high restart counts.

What enabled fast recovery: The service is stateless. No data to recover. npm install is idempotent. Recovery procedure existed in muscle memory.

What didn't exist: A startup health check that validates dependencies before PM2 marks the service alive. The service was failing fast, not failing safe.

The fixes

1. Prestart dependency validation

Added to package.json:

"prestart": "npm ls --depth=0 --silent || (echo 'Dependency validation failed — run npm install' && exit 1)"

Now on restart, PM2 gets a clean exit with an actionable message. Restart 1 tells me what's wrong. Not restart 32.

2. Alert on restart rate, not just count

Updated the Prometheus alert rule:

- alert: PM2ServiceRestartRateSpiking
  expr: rate(pm2_restarts_total[5m]) > 0.1
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.name }} restarting frequently"

This fires within 2 minutes of a sustained restart loop instead of waiting for a count threshold.

3. Runbook entry

I added a one-page runbook to my Obsidian vault:

Service: sjvik-control-panel
Recovery: cd /root/projects/sjvik-control-panel && npm install && pm2 restart sjvik-control-panel
Verify: curl -s http://localhost:3456/health | jq .status
Escalate if: health endpoint returns non-200 after npm install

Runbooks feel like overkill for a homelab until 2am when you're tired and need the answer fast.

Why homelab IR matters

The enterprise version of this incident would involve a runbook, a Slack war room, a timeline in PagerDuty, and a written postmortem shared with the team. The homelab version is: you fixing something alone at 1am with nobody watching.

But the thinking is the same.

Detection time matters. Alert on behavior, not just state.
Recovery time improves with runbooks. Write them while you still remember what you did.
Postmortems find gaps that didn't feel like gaps until something broke.

I have 11 containers. At least a few of them will have incidents. Treating each one as a real IR exercise is how I actually get better at this — not just at homelab ops, but at the SOC analyst work I do professionally.

The next time I'm in an enterprise war room following an incident response process, I'll have done it 15 times already on my own infrastructure.

Running a homelab security stack? My Proxmox Homelab guide covers the full setup — Proxmox cluster, PBS backups, Wazuh SIEM, and the PM2 service patterns that kept this recoverable in under 5 minutes.

DEV Community

I ran incident response on my own homelab. Here's the postmortem.

The incident

The postmortem

The fixes

Why homelab IR matters

Top comments (0)