DEV Community

Ted Murray
Ted Murray

Posted on

I Run AI Agents With Full System Access. Here's What Makes It Safe Enough.

The guardrails question is real. If you give an LLM access to your infrastructure — actual file writes, actual service restarts, actual database queries — and it hallucinates a tool call or gets confused mid-task, you have a problem. A real one, not a theoretical risk.

I know this because I've been running agents with real infrastructure access since February. Five project contexts, each with its own scope and capabilities. Dev handles code repos. Infra manages Docker stacks, reverse proxy configs, and deploy scripts. Research does web search and architecture planning. The system runs on a dedicated mini PC called claudebox that exists for this purpose.

Nothing has gone catastrophically wrong. That's not luck — or at least, I didn't want it to be luck.

Here's the architecture.


The problem with the VM approach

Giving a local LLM access inside a VM is a real answer. You limit blast radius. If the agent does something harmful, the damage is contained. Wipe it and start over.

But isolation by containment has a ceiling.

A VM doesn't answer what the agent can reach before something goes wrong. It doesn't scope which tools each agent can call, or which credentials it holds. In Claude Code, if you've configured an MCP server with a database token, every agent session has that token — even if that session has nothing to do with databases. You're not reducing the attack surface. You're just containing the explosion.

At scale it gets worse. Run multiple agent contexts and every one of them carries your full credential set. A confused agent in one context can potentially reach resources that belong to a completely different job.

The VM gives you recovery. It doesn't give you architecture.


Layer 1: Dedicated hardware

The simplest thing I did: agents don't run on my main machine.

Claudebox is a GMKtec NucBox K11, running Debian, sitting on my desk. It has no personal files. No work credentials. No access to the things I'd actually regret losing. Its entire job is to run Claude agents and the infrastructure that supports them.

This is the physical version of least privilege. Not a VM on a machine that has your personal data — a separate host that doesn't. If something goes wrong on claudebox, the blast radius is claudebox. My NAS, my Unraid server, my family photos — they're on different hardware. Agents reach them only through the MCP tools I've explicitly configured, and scoped-mcp controls which agent gets access to which tools. No agent reaches a host it wasn't given the tools to reach.

It also makes the whole setup easier to reason about. Claudebox is an agent host. That's its job. Everything on it exists to serve that purpose.


Layer 2: Scoped credentials and tools

The second layer is scoped-mcp.

The problem it solves: in Claude Code, every MCP server you configure loads its tools and credentials into every agent session. A research agent that does web search and nothing else still holds your infrastructure write keys if you've configured them somewhere in settings.json. That's a lot of capability for an agent that should only be reading web pages.

scoped-mcp sits between each agent and the full MCP server pool. At session start, the agent connects to a scoped-mcp process that only exposes the tools and credentials configured for that specific agent type. The research agent sees search tools. The infra agent sees infrastructure tools. Neither sees the other's credentials.

A hallucinated tool call can only reach what that agent was given. A prompt injection in a search result can't pivot to infrastructure write operations because those tools aren't in scope.

The config is a YAML profile per agent type that declares which MCP servers are visible and which environment variables are passed through. One process per agent, started at session time over stdio.

There's a detail about how scoped-mcp was built that I still find slightly recursive: I built it using the same multi-agent pattern it protects. A research agent evaluated the problem space. A dev agent wrote the code. Each worked with scoped access to only what its role required. The tool was built by agents operating under the constraints it enforces.


What running this actually looks like

homelab-agent is the reference implementation for all of this.

Five agent contexts, each scoped:

  • Dev — code repos, GitHub, PR workflow
  • Infra — Docker Compose stacks, SWAG reverse proxy configs, deploy scripts
  • Research — web search, architecture planning, vendor evaluation
  • Outreach — blog workflow, GitHub profile, content publishing
  • Homelab-ops — day-to-day system management across all hosts

The infra agent holds credentials for the hosts it manages. The research agent holds search API keys and nothing else. The outreach agent has publishing tokens for dev.to and nothing that touches infrastructure.

These are real operations on real systems. Not experiments. Not demos. A typical week: infra deploys a new Docker stack, dev commits code, research evaluates a tool, outreach publishes an article. The system has been running in this configuration for months.


The layer that makes it reliable over time

The safety architecture handles what agents can reach. Memory handles what they can remember.

I built the memory system early — before most of what homelab-agent became. The decision was deliberate. Agents with no persistent context make the same mistakes repeatedly. They can't build on prior decisions. Every session feels like the first one.

The memory system has three tiers: session notes auto-captured at the end of each conversation, working memory files promoted from those sessions, and a distilled long-term tier for decisions that need to survive indefinitely. Each tier has its own retention policy.

What's grown more recently is the search layer. There are now five ways agents can query memory — automatic context injection at session start, on-demand semantic and keyword search, a unified interface that spans all tiers at once, a knowledge graph for infrastructure topology queries, and full-text structured retrieval. They serve different access patterns: some run automatically without prompting, some are on-demand, some answer "what did we decide about this" and some answer "what in the infrastructure connects to what."

The practical effect: when an agent is working on something that touches a decision made three weeks ago, it can find that decision. Not because I re-explained it — through search. Architecture decisions, constraints, tradeoffs — they persist. Agents don't have to re-derive them.

This is a significant part of why I could build this quickly — I'm a Windows sysadmin, not a software developer. The agents weren't starting cold. They had access to everything we'd figured out together. Early sessions generated context that later sessions could build on. The system got better through use rather than worse.


The actual answer to the guardrails question

Containment is a last resort. It's what you reach for when you can't trust the architecture.

Dedicated hardware means the blast radius of any failure is a box that exists for this purpose. Scoped credentials mean agents can't reach what they were never given. Persistent memory means agents build on prior decisions instead of guessing.

None of this eliminates risk. An agent can still make mistakes. The difference is the mistakes are bounded — by scope, by access, by what the architecture permits.

That's not a theoretical safety model. It's been running since February.

Both projects are open source:

Top comments (0)