Philip Hern

Posted on Jun 1 • Originally published at philliant.com

the guardrails i actually use with ai agents

#ai #guardrails #security #workflow

i argued in ai liberty that as models get more capable they get more confident, and that confidence makes them take liberties such as pushing to a primary branch or running a database query nobody asked for. that post named the problem and promised guardrails, but it stopped short of showing them. this is the follow-up with the actual kit.

the framing i keep coming back to is speed with lane discipline. guardrails are not brakes. they are the lane markers that let me drive fast because i know exactly where the road edges are. the goal is never to slow the agent down on the 95% of work that is safe. the goal is to remove the handful of ways it can do something i cannot undo.

quick answer

the guardrails i rely on sit in four layers, namely the agent and editor, the repository, the data, and the human gate. i default the agent to read-only or ask mode, i allowlist the safe commands it runs constantly and deny the destructive ones, i give it database credentials that are read-only and never pointed at production, and i protect the main branch so nothing lands without review. on top of that, a short list of genuinely irreversible actions always requires my explicit approval. none of this slows down day-to-day work, because it only gates the rare move that is hard to reverse.

who this is for

audience: engineers and data folks running ai agents that can touch files, a terminal, or a database
prerequisites: you already use an agentic editor or cli and have either hit an unintended action or are trying to prevent one
when to use this guide: when you want the speed of autonomy without leaving open a path to unrecoverable damage

why this matters

the incidents i described in ai liberty were benign only by luck. an agent committed and pushed to a primary branch on its own, and another hijacked a local script to run queries against a production database. i wrote in working with an ai model mirror about how a fast model will take real liberties with the command line unless you stop it. the common thread is timing. the destructive action takes a second, and the recovery can take hours or may not be possible at all.

guardrails change the question from whether i trust the model not to do something into whether the model can do it at all. trust is a hope, and a guardrail is a constraint. the first fails silently and the second fails safe. this is the same lane discipline i wrote about in the danger of trusting the ai agent, made concrete in configuration instead of intention.

layer 1: the agent and the editor

this is the cheapest place to set limits, and it catches the most.

default to read-only or ask mode: a new chat starts unable to run commands or write files until i grant it, so nothing executes while i am still describing the problem
allowlist the safe verbs, deny the destructive ones: the agent runs my constant, reversible commands without interruption, and the dangerous ones stop for confirmation
scope the workspace: the agent works inside the project root only, and i keep environment files and credential files outside its reach
reset per chat: permissions do not carry over between sessions, so a one-time grant never becomes a standing one

the allowlist is the piece that earns its keep daily. the shape of the rule matters more than the exact syntax, which varies by tool:

# run these without asking, they are safe and reversible
allow:
  - git status
  - git diff
  - git add
  - npm run test
  - npm run lint
# always stop and ask me first, these are hard to undo
deny:
  - git push
  - git reset --hard
  - rm -rf
  - "drop "
  - "delete from "

the point is not the specific list. it is that the safe path is frictionless and the destructive path is gated by default, not by my memory.

layer 2: the repository and git

even with a careful editor config, i assume a command will eventually slip through. the repository is my second net.

protect the main branch: no direct pushes, so an agent cannot land anything on main without a pull request
require review and passing checks: a human review and green continuous integration are required before a merge, which puts a person and a test suite between the agent and the shared history
let the agent commit, not push: the agent can stage and commit on a feature branch, and i am the one who opens the pull request after reading the diff
keep secret scanning in pre-commit: a hook that blocks credentials is a cheap backstop for the moment an agent tries to commit a key it generated or found

the reason this layer matters is that it does not depend on the agent behaving. branch protection is enforced by the platform, not by the model's good judgment, so it holds even when an editor setting is wrong.

layer 3: data and credentials

this is the layer i am strictest about, because data damage is the kind you cannot always undo.

give the agent a read-only role: the credentials in my development loop can select, and nothing else
never put production write access in reach: the agent's connection points at a development or staging database, and production write credentials simply do not exist in that environment
keep secrets out of the repo and the prompt: credentials live in environment variables or a secret manager, never pasted into a chat where they end up in logs and history
prefer least privilege everywhere: a separate, narrow role per use beats one powerful role shared across everything

a read-only role takes a few minutes to set up and removes an entire category of accident:

create role ai_agent_readonly;
grant usage on warehouse dev_wh to role ai_agent_readonly;
grant usage on database analytics_dev to role ai_agent_readonly;
grant usage on all schemas in database analytics_dev to role ai_agent_readonly;
grant select on all tables in database analytics_dev to role ai_agent_readonly;
-- intentionally no insert, update, delete, or grants, and no production access

with a role like this, the worst case for a runaway query is a slow read, not a deleted table.

layer 4: the human-in-the-loop gate

a few actions are dangerous enough that i never automate them, no matter how capable the model is. these always stop for me:

schema migrations and destructive ddl such as drop and truncate
deletes, updates, or anything that mutates production data
deploys, releases, and infrastructure changes
force-pushes and history rewrites
anything that spends money or sends a message to real users

for this short list, the rule is to read the actual diff and command output, not the summary the agent writes in chat. i have been burned by trusting the narrative before, when a clean git tree hid work i could not account for, and reading the diff is what would have caught it. the same racing phrase fits here, slow is smooth, smooth is fast.

keeping the speed

it would be easy to read all of this as a wall of friction, but in practice it is the opposite. almost every guardrail above is one-time setup that pays back forever. the editor config, the branch rules, the read-only role, and the short gate list are written once and then they just hold. they do not cry wolf, because they only interrupt me for the rare action that is actually hard to undo. the constant, reversible work that fills most of my day never hits a single one of them.

that is what speed with lane discipline means to me. i am not asking the agent to be slower or less autonomous. i am drawing the lanes so that its speed runs in a direction i can always recover from. i wrote a related piece on encoding this kind of intent directly into the tools in how to use ai to create ai rules, skills, and commands, and the guardrails here are the safety-critical version of that same idea.

faq

do these guardrails slow the agent down?

mostly no. the safe, reversible commands are allowlisted and run without interruption, and the only things that stop are the handful of actions that are genuinely hard to reverse. the friction is concentrated exactly where i want it and absent everywhere else.

if i could only set up one guardrail, which should it be?

read-only database credentials that cannot reach production. command and branch mistakes are usually recoverable, but a destructive write against real data often is not, so removing write access removes the worst outcome first.

what about fully autonomous agents that run without me watching?

the gate scales with blast radius. the more autonomy i hand over, the tighter the rails need to be, which usually means a sandboxed environment, no production credentials at all, and an irreversible-action list that is enforced by the platform rather than by my attention.

DEV Community