DEV Community: Steven J. Vik

How I Added SIEM to My Homelab With Wazuh — and What It Found on Day One

Steven J. Vik — Wed, 01 Apr 2026 15:58:11 +0000

I've been running Grafana and Prometheus on my homelab for about a year. CPU usage, RAM, disk, container uptime — the usual infrastructure metrics. I thought that was monitoring.

Then I deployed Wazuh and found out I had no idea what was happening on my network.

What Wazuh Is

Wazuh is an open-source SIEM (Security Information and Event Management) and XDR platform. It collects logs and events from agents you deploy on your systems, runs them through detection rules, and alerts you when something looks wrong. It's the same class of tool that security teams use in production environments — and it's free.

The key mental model: Grafana asks "is this working?" Wazuh asks "is this being abused?" You need both questions answered.

The Setup

I'm running a 3-node Proxmox VE 8.x cluster. Wazuh 4.9.2 all-in-one lives in LXC 107 on my main node (nx-core-01). Container specs: 4 vCPU, 8GB RAM, 50GB disk. Wazuh is memory-hungry — don't go below 6GB or the indexer will struggle.

The all-in-one install script handles the indexer, server, and dashboard:

curl -sO https://packages.wazuh.com/4.9/wazuh-install.sh
bash wazuh-install.sh -a

About 15 minutes. Dashboard on HTTPS port 443. Default credentials are in /home/admin/.wazuh-install-files/wazuh-passwords.txt.

One gotcha for LXC: Wazuh needs to read system audit logs. Some LXC security profiles block this. If ossec-logcollector can't read /var/log/audit/audit.log, check your container's capability settings.

Enrolling Agents

I enrolled 7 agents: the main Proxmox node plus the LXC containers I care most about — the web server, Traefik, the monitoring stack, uptime-kuma, and Garrison (my AI orchestrator).

Per-agent on each host:

curl -s https://packages.wazuh.com/key/GPG-KEY-WAZUH | apt-key add -
echo "deb https://packages.wazuh.com/4.x/apt/ stable main" \
  > /etc/apt/sources.list.d/wazuh.list
apt-get update && apt-get install wazuh-agent

WAZUH_MANAGER="<wazuh-ip>" WAZUH_AGENT_NAME="$(hostname)" \
  systemctl enable --now wazuh-agent

Custom Detection Rules

Wazuh ships with thousands of built-in rules. I added custom rules for things specific to my setup in /var/ossec/etc/rules/local_rules.xml:

<!-- SSH brute force — T1110 -->
<rule id="100001" level="10">
  <if_group>syslog</if_group>
  <match>pam_unix.*authentication failure</match>
  <same_source_ip />
  <frequency>5</frequency>
  <timeframe>120</timeframe>
  <description>Multiple SSH auth failures from same IP</description>
  <mitre><id>T1110</id></mitre>
</rule>

<!-- Root SSH login — should never happen -->
<rule id="100002" level="15">
  <if_sid>5715</if_sid>
  <match>^Accepted.*root@</match>
  <description>Root login via SSH detected</description>
  <mitre><id>T1078</id></mitre>
</rule>

Mapping rules to MITRE ATT&CK IDs is worth the extra minute. It surfaces in the Wazuh dashboard and makes it easier to reason about what class of attack you're detecting.

Email Alerts

Wazuh can send email alerts directly. I use msmtp as a relay with a Gmail app password. In ossec.conf:

<email_notification>yes</email_notification>
<smtp_server>localhost</smtp_server>
<email_from>wazuh@homelab.local</email_from>
<email_to>you@yourdomain.com</email_to>
<email_maxperhour>12</email_maxperhour>
<email_alert_level>10</email_alert_level>

Level 10+ fires an email. Root login would be level 15 — that's a middle-of-the-night wake-up.

What It Found on Day One

This is the part that made me feel like I'd been flying blind.

SSH brute-force from 4 external IPs. This was ongoing. My SSH is on the default port (yes, I know — now I've moved it) and there were continuous auth attempts from IPs in Ukraine, China, and Romania. None of this was in Grafana. It wasn't breaking anything, but I wasn't informed.

A container service in a restart loop. One of my LXC containers had a service that was failing and restarting every few minutes. The container showed as "running" in Uptime Kuma. The service itself was not. Wazuh caught it in the syslog stream.

Two containers with outdated packages. Wazuh's rootcheck module audits package state. Both flagged containers had packages with known CVEs. Neither was critical, but I wouldn't have known without a scan.

Grafana + Wazuh: Better Together

This isn't an either/or. Grafana tells me if my services are healthy. Wazuh tells me if my systems are being probed, if accounts are being misused, if files are changing unexpectedly.

The combined setup is a proper monitoring stack for a homelab that's exposed to the internet — even partially.

If you're running self-hosted infrastructure and haven't added a SIEM layer, Wazuh is the most accessible path to getting there. The all-in-one install makes it genuinely feasible without a dedicated security team.

I've documented the full Proxmox cluster setup, monitoring stack, and security hardening approach at sjvik-labs.stevenjvik.tech/guides if you want to go deeper.

I ran incident response on my own homelab. Here's the postmortem.

Steven J. Vik — Wed, 01 Apr 2026 15:57:07 +0000

I run a 3-node Proxmox cluster at home with 11 LXC containers. Last week one of them turned into an incident.

Not a dramatic one. No data loss. No outage that affected anyone else. But it hit the same failure modes I see documented in enterprise postmortems — and handling it the same way taught me more than any homelab YouTube video has.

Here's what happened and what I changed.

The incident

00:47 — My homelab control panel stops responding. The web UI that ties together monitoring, service status, and agent health is down.

00:47–01:09 — PM2 restarts the service. Then restarts it again. 32 times total, with exponential backoff, over about 22 minutes.

01:09 — Prometheus alert fires. Wazuh catches the anomaly in PM2 process metrics. I get paged.

01:11 — I SSH in. pm2 logs sjvik-control-panel shows the immediate cause: Cannot find module tsx. The package is gone from node_modules.

01:13 — npm install && pm2 restart sjvik-control-panel. Service is back.

Total downtime from first failure to recovery: 26 minutes. Time from alert to recovery: 4 minutes.

The postmortem

If you've done incident response at work, this format will look familiar. I use it at home too — not because it's bureaucratic overhead, but because it forces honest thinking.

What failed: tsx missing from node_modules on LXC 101 (nx-web-01). Root cause unknown — most likely a stale node_modules state after a system update or partial npm ci run that didn't complete cleanly.

What detected it: External monitoring (Prometheus + pm2-prometheus-exporter tracking restart counts per process). NOT the service itself.

What slowed detection: The 22-minute gap between first failure and alert. PM2's default backoff delays mean a fast-dying service doesn't trip alerts immediately. I had no alert threshold on restart rate — only on sustained high restart counts.

What enabled fast recovery: The service is stateless. No data to recover. npm install is idempotent. Recovery procedure existed in muscle memory.

What didn't exist: A startup health check that validates dependencies before PM2 marks the service alive. The service was failing fast, not failing safe.

The fixes

1. Prestart dependency validation

Added to package.json:

"prestart": "npm ls --depth=0 --silent || (echo 'Dependency validation failed — run npm install' && exit 1)"

Now on restart, PM2 gets a clean exit with an actionable message. Restart 1 tells me what's wrong. Not restart 32.

2. Alert on restart rate, not just count

Updated the Prometheus alert rule:

- alert: PM2ServiceRestartRateSpiking
  expr: rate(pm2_restarts_total[5m]) > 0.1
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.name }} restarting frequently"

This fires within 2 minutes of a sustained restart loop instead of waiting for a count threshold.

3. Runbook entry

I added a one-page runbook to my Obsidian vault:

Service: sjvik-control-panel
Recovery: cd /root/projects/sjvik-control-panel && npm install && pm2 restart sjvik-control-panel
Verify: curl -s http://localhost:3456/health | jq .status
Escalate if: health endpoint returns non-200 after npm install

Runbooks feel like overkill for a homelab until 2am when you're tired and need the answer fast.

Why homelab IR matters

The enterprise version of this incident would involve a runbook, a Slack war room, a timeline in PagerDuty, and a written postmortem shared with the team. The homelab version is: you fixing something alone at 1am with nobody watching.

But the thinking is the same.

Detection time matters. Alert on behavior, not just state.
Recovery time improves with runbooks. Write them while you still remember what you did.
Postmortems find gaps that didn't feel like gaps until something broke.

I have 11 containers. At least a few of them will have incidents. Treating each one as a real IR exercise is how I actually get better at this — not just at homelab ops, but at the SOC analyst work I do professionally.

The next time I'm in an enterprise war room following an incident response process, I'll have done it 15 times already on my own infrastructure.

Running a homelab security stack? My Proxmox Homelab guide covers the full setup — Proxmox cluster, PBS backups, Wazuh SIEM, and the PM2 service patterns that kept this recoverable in under 5 minutes.

I Ran 7 Autonomous AI Agents on My Homelab Proxmox Cluster — Here's What Actually Happened

Steven J. Vik — Wed, 18 Mar 2026 16:08:10 +0000

I've been building homelab infrastructure seriously for a few years — 3-node Proxmox cluster, 13 LXC containers, self-hosted everything. But last month I did something different: I deployed a self-hosted AI agent orchestration platform and gave it a task list. Seven autonomous agents, each waking on a schedule, each writing back to my Obsidian vault, each filing issues when something breaks.

Here's what I actually learned.

The Setup

The cluster: 3 nodes named "nexus." nx-core-01 handles most services (64GB RAM), nx-ai-01 runs Ollama for local inference (32GB, NVMe), nx-store-01 handles Samba shares and Proxmox Backup Server. All on Proxmox VE 8.x with corosync clustering.

The agents run in LXC 109 on nx-core-01, using a platform called Paperclip. Each agent has a HEARTBEAT.md — a plain-text checklist it follows every time it wakes up. That's it. No complex prompt engineering. Just: "Here's who you are. Here's what to check. Write the results here."

The seven agents:

SRE (every 3h): SSHes into all nodes, checks service health, files an issue if something is down
Content (daily): Reads my Obsidian vault for recent lab work, drafts LinkedIn/Reddit posts
Career (daily): Queries my job-tracker SQLite DB, ranks opportunities by fit, summarizes to daily notes
DevOps (daily): Audits GitHub repos, checks CI status, verifies nightly git pushes
Analytics (daily): Queries Prometheus + Grafana, writes a daily metrics digest
Product (daily): Audits Gumroad products, checks for new reviews or feedback
Comms (every 12h): Triages Gmail via Google Workspace API

What Worked Immediately

The SRE agent was the most reliable from day one. On the first run, it SSHed into all three nodes, pulled service status, noticed that LXC 107 (twingate-connector) was stopped, and filed a Paperclip issue with the container ID and status output. I didn't tell it about twingate. It just found it in the container list, noticed it was stopped while everything else was running, and flagged it.

The career tracker integration was the pleasant surprise. I have a SQLite database of job postings I've scraped over time. The agent queries it with something like SELECT * FROM jobs WHERE applied = 0 ORDER BY match_score DESC LIMIT 10, formats the top results into a readable summary, and writes it to my daily note. It turned a manual process I was doing twice a week into something that happens automatically — and surfaces opportunities I might have scrolled past.

What Needed Fixing

Session management was the first real problem.

LLM-based agents maintain conversational context across turns. That's useful for a single session, but in a heartbeat model — where the agent wakes up, does work, and goes to sleep — stale context is actively harmful. An agent that "remembered" it already checked disk health yesterday would skip the check today. An agent that "remembered" a task was in-progress would try to continue work that had already been resolved.

The fix: clear session IDs whenever agent instructions change. A fresh context every heartbeat. Stateless by design.

The second problem was race conditions. Two agents would both see an unhandled task and try to work it simultaneously. The checkout system in Paperclip handles this (first agent to POST /checkout wins, second gets a 409), but I had to make sure agents honored the 409 and moved on instead of retrying.

The third problem was scope creep. Agents without tight constraints would start doing "helpful" things outside their mandate — refactoring files they weren't asked to touch, commenting on issues that weren't assigned to them. The fix was explicit constraints in the HEARTBEAT.md: MAY and NEVER blocks that spell out what the agent can and can't do.

The Honest Assessment

This is not magic. It's infrastructure work with an LLM attached.

The agents don't reason in any interesting way — they follow instructions and fill in the gaps with pattern matching. They're good at: structured tasks with clear outputs, pulling data from known sources, writing formatted summaries, and catching obvious anomalies.

They're not good at: novel situations, tasks that require deep context they don't have, or anything where "good enough" isn't good enough.

What's running reliably now: daily SRE health checks, content drafts, job alert summaries, GitHub audits. These are all tasks that were already defined processes — the agents just execute them faster and without me having to remember to do them.

The Infrastructure Mindset Shift

The thing that helped most was stopping thinking of these as "AI assistants" and starting thinking of them as distributed processes with unreliable executors.

You don't debug an agent by asking it what went wrong. You look at its output, trace the failure back to the instruction that produced it, and fix the instruction. Same as debugging any script.

You don't trust an agent's memory. You design the system so memory doesn't matter — idempotent operations, fresh context per run, explicit state in files rather than agent recall.

You don't scale by adding intelligence. You scale by tightening the feedback loop — better outputs → better prompts → tighter heartbeats.

If you're building self-hosted infrastructure and want to go deeper on the cluster setup, monitoring stack, and LXC templates that underpin all of this — I've documented it in full at sjvik-labs.stevenjvik.tech/guides.

Happy to answer questions about specific components in the comments — Traefik config, PBS setup, Prometheus scrape targets, or the agent architecture.