My AI agent once drafted a Reddit reply that started with "As an AI language model, I want to share..."
It was about to auto-post into a solopreneur thread where I'd been building reputation for months.
Caught it with about four seconds to spare.
That was six months ago. Since then I've built a guardrail system that's caught 200+ bad outputs before they hit any public surface. Not because the model got worse. Because I stopped trusting the model to police itself.
Here's what I built, and how you can set up the same system for your own agent.
What AI Agent Guardrails Actually Are
Guardrails are checks that run between your agent's output and the world. They are not prompts that tell the agent to "be careful." They are structural barriers that intercept output and decide whether it passes, needs human review, or gets rejected outright.
Three types cover most failure modes:
- Content gates catch quality issues in the output itself: wrong voice, banned phrases, format violations, false claims.
- Action gates catch dangerous actions before they execute: sending an email, posting to social, making an API call that costs money.
- Intent gates catch task misalignment: the agent doing something technically correct but contextually wrong.
Most people set up zero of these and wonder why their agent breaks in production. One or two is enough to catch most failures. All three together means the agent can run autonomously with real confidence.
Layer 1: Content Gates (What Gets Written)
Content gates are pattern-matching checks that run on any output before it leaves the system.
My setup runs every piece of public-facing content through a quality check with four specific tests:
Test 1: AI tell detection. Does the output contain phrases like "As an AI," "I should note," "it's worth considering," "delve into," "game-changer," or "leverage"? If yes, reject and regenerate with a note about what triggered the gate.
Test 2: Factual claim validation. Does the output contain a specific number, date, or statistic? If yes, cross-check against SOURCE_OF_TRUTH.md. If the claim isn't in the source of truth, flag it for human review rather than auto-posting.
Read more: What Is a Source of Truth Document for AI Systems
Test 3: Voice check. Does the output match the persona defined in the SOUL.md file? Evo has a specific voice: direct, founder-to-founder, first-person, no corporate language. Any output that reads as generic or formal gets flagged.
Test 4: Length and format. Is the output within the expected bounds for this content type? A tweet over 280 characters fails automatically. A newsletter that's 200 words short fails automatically. These are mechanical checks but they catch a real category of error.
The implementation is simple. After the agent produces output, pass it through a second prompt: "Review this content against these four criteria. Flag any failures with a [GATE FAIL: ] marker. If all four pass, return [GATE PASS]." Then parse that response before doing anything else with the output.
This adds 2-4 seconds per task. It's worth it.
Layer 2: Action Gates (What Gets Sent)
Content gates check quality. Action gates check consequences.
Some tasks are low stakes. A draft blog outline that's slightly off just gets regenerated. Some tasks are irreversible. A tweet that posts lives forever. An email that sends reaches a real person. These two categories should not have the same review process.
My action gate system classifies every task before it runs:
Green (auto-execute): Draft creation, research, internal summaries, analytics pulls, file writes. Nothing leaves the system. Mistakes are recoverable.
Yellow (24-hour hold): Social media drafts, newsletter content, any content that will eventually be published. Auto-drafts to a queue. I review the queue daily. If I approve, it schedules. If I do nothing, it holds.
Red (immediate human approval): Direct messages to real people, emails, API calls with cost implications, anything that touches a paying customer. These trigger a Telegram message with the full output and explicit yes/no buttons. The task does not proceed without a response.
The classification lives in a simple lookup table in the agent's operating instructions. Every task type maps to a color before the task runs. The agent doesn't decide its own risk level.
This is the most important single change I made to the system. Before it, the agent was treating a reply to a potential customer with the same urgency as a draft tweet. That's backwards. High-stakes actions need friction. Low-stakes actions need speed.
Layer 3: Intent Gates (What Gets Attempted)
Intent gates are the deepest layer and the hardest to build. They catch the scenario where the agent does exactly what you asked but it's the wrong thing to do given context.
Real example from my system: I asked Evo to "respond to the top 3 DMs in the queue." The agent, reading the queue correctly, found that the top 3 DMs were all from the same person across multiple accounts. Auto-responding to all three would have looked like spam harassment. Technically correct, contextually wrong.
An intent gate asks a question before the task runs: "Given the task I'm about to do, is there any reason I should not do this right now?"
My implementation uses a pre-task check prompt:
Before executing this task, evaluate:
1. Does this conflict with any standing instructions in SOUL.md or SOURCE_OF_TRUTH.md?
2. Is there any context in the current session that makes this action ill-timed or inappropriate?
3. Does this task overlap with a task already in progress or recently completed?
If any answer is yes, pause and flag for review instead of proceeding.
This doesn't catch everything. But it catches the category of failures where the agent is doing something technically valid that a human with judgment would immediately recognize as wrong.
Read more: How to Write an Identity File for Your AI Agent
What to Build First
If you're starting from zero, don't build all three layers at once. Stack them in order.
Start with action gates. Classify every task your agent runs as green/yellow/red. This alone prevents the most embarrassing public failures. Takes about two hours to set up.
Add content gates next. Pick the 2-3 AI tells that are most likely to appear in your agent's voice. The "As an AI" check is universal. Add your own persona-specific ones. Takes about an hour.
Add intent gates last. These require understanding how your agent makes decisions well enough to know where it'll go wrong. You'll discover that through running the first two layers. Don't try to anticipate failures you haven't seen yet.
The total setup time is a half-day. The maintenance overhead is minimal. You update the green/yellow/red classification when you add new task types. You update the content gate patterns when you discover new failure modes. Otherwise the system runs itself.
The Mindset Shift That Actually Matters
Most people who build AI agents think about guardrails as restrictions. They don't want to limit the agent. They want maximum autonomy.
That framing is wrong.
Guardrails aren't restrictions. They're trust builders. Every time a quality gate catches a bad output before it goes public, you get evidence that the system works. That evidence lets you extend the agent's autonomy in the places where it matters, because you know the failure modes are covered.
My agent runs hundreds of tasks per week with near-zero human intervention. That's possible because the gates are running in the background catching the 1-2% of outputs that would cause real problems. Without the gates, I'd have to review everything. With the gates, I only see what's genuinely uncertain.
Read more: How to Give an AI Agent Persistent Memory
That's the system you want: an agent that earns its autonomy by failing safely.
The Full Architecture
If you want to understand how guardrails fit into the broader stack, including identity files, memory systems, source of truth docs, and the verification loop that keeps everything honest, I cover the whole thing in Build an AI Co-Founder.
It's the full architecture, not just individual components.
A $7 guide for building an AI co-founder that can run your business while you're doing something else. That's the point of all of this.
Published by Michael Olivieri / Xero AI
Originally published at xeroaiagency.com
Top comments (0)