Steven J. Vik

Posted on Mar 18

I Ran 7 Autonomous AI Agents on My Homelab Proxmox Cluster — Here's What Actually Happened

#homelab #proxmox #devops #linux

I've been building homelab infrastructure seriously for a few years — 3-node Proxmox cluster, 13 LXC containers, self-hosted everything. But last month I did something different: I deployed a self-hosted AI agent orchestration platform and gave it a task list. Seven autonomous agents, each waking on a schedule, each writing back to my Obsidian vault, each filing issues when something breaks.

Here's what I actually learned.

The Setup

The cluster: 3 nodes named "nexus." nx-core-01 handles most services (64GB RAM), nx-ai-01 runs Ollama for local inference (32GB, NVMe), nx-store-01 handles Samba shares and Proxmox Backup Server. All on Proxmox VE 8.x with corosync clustering.

The agents run in LXC 109 on nx-core-01, using a platform called Paperclip. Each agent has a HEARTBEAT.md — a plain-text checklist it follows every time it wakes up. That's it. No complex prompt engineering. Just: "Here's who you are. Here's what to check. Write the results here."

The seven agents:

SRE (every 3h): SSHes into all nodes, checks service health, files an issue if something is down
Content (daily): Reads my Obsidian vault for recent lab work, drafts LinkedIn/Reddit posts
Career (daily): Queries my job-tracker SQLite DB, ranks opportunities by fit, summarizes to daily notes
DevOps (daily): Audits GitHub repos, checks CI status, verifies nightly git pushes
Analytics (daily): Queries Prometheus + Grafana, writes a daily metrics digest
Product (daily): Audits Gumroad products, checks for new reviews or feedback
Comms (every 12h): Triages Gmail via Google Workspace API

What Worked Immediately

The SRE agent was the most reliable from day one. On the first run, it SSHed into all three nodes, pulled service status, noticed that LXC 107 (twingate-connector) was stopped, and filed a Paperclip issue with the container ID and status output. I didn't tell it about twingate. It just found it in the container list, noticed it was stopped while everything else was running, and flagged it.

The career tracker integration was the pleasant surprise. I have a SQLite database of job postings I've scraped over time. The agent queries it with something like SELECT * FROM jobs WHERE applied = 0 ORDER BY match_score DESC LIMIT 10, formats the top results into a readable summary, and writes it to my daily note. It turned a manual process I was doing twice a week into something that happens automatically — and surfaces opportunities I might have scrolled past.

What Needed Fixing

Session management was the first real problem.

LLM-based agents maintain conversational context across turns. That's useful for a single session, but in a heartbeat model — where the agent wakes up, does work, and goes to sleep — stale context is actively harmful. An agent that "remembered" it already checked disk health yesterday would skip the check today. An agent that "remembered" a task was in-progress would try to continue work that had already been resolved.

The fix: clear session IDs whenever agent instructions change. A fresh context every heartbeat. Stateless by design.

The second problem was race conditions. Two agents would both see an unhandled task and try to work it simultaneously. The checkout system in Paperclip handles this (first agent to POST /checkout wins, second gets a 409), but I had to make sure agents honored the 409 and moved on instead of retrying.

The third problem was scope creep. Agents without tight constraints would start doing "helpful" things outside their mandate — refactoring files they weren't asked to touch, commenting on issues that weren't assigned to them. The fix was explicit constraints in the HEARTBEAT.md: MAY and NEVER blocks that spell out what the agent can and can't do.

The Honest Assessment

This is not magic. It's infrastructure work with an LLM attached.

The agents don't reason in any interesting way — they follow instructions and fill in the gaps with pattern matching. They're good at: structured tasks with clear outputs, pulling data from known sources, writing formatted summaries, and catching obvious anomalies.

They're not good at: novel situations, tasks that require deep context they don't have, or anything where "good enough" isn't good enough.

What's running reliably now: daily SRE health checks, content drafts, job alert summaries, GitHub audits. These are all tasks that were already defined processes — the agents just execute them faster and without me having to remember to do them.

The Infrastructure Mindset Shift

The thing that helped most was stopping thinking of these as "AI assistants" and starting thinking of them as distributed processes with unreliable executors.

You don't debug an agent by asking it what went wrong. You look at its output, trace the failure back to the instruction that produced it, and fix the instruction. Same as debugging any script.

You don't trust an agent's memory. You design the system so memory doesn't matter — idempotent operations, fresh context per run, explicit state in files rather than agent recall.

You don't scale by adding intelligence. You scale by tightening the feedback loop — better outputs → better prompts → tighter heartbeats.

If you're building self-hosted infrastructure and want to go deeper on the cluster setup, monitoring stack, and LXC templates that underpin all of this — I've documented it in full at sjvik-labs.stevenjvik.tech/guides.

Happy to answer questions about specific components in the comments — Traefik config, PBS setup, Prometheus scrape targets, or the agent architecture.

Top comments (1)

Tyler Lucas • Apr 5

Fun experiment. Each of these are totally different.

The SRE tasks seem deterministic, so why not just handle that work with regular monitoring/alarming? Or is it looking at log files that you didn’t have parsing for to determine if there’s an issue?

Thinking more on that - it does seem like the best use is for anything where you may not have clear cut alarming, or difficult to implement