Real guardrails for autonomous agents after one almost destroyed my infrastructure
I'll be straight with you: yesterday's post about agents that deploy on their own I wrote with my heart still pounding. Because what I didn't detail — because I was still processing it — is that the agent didn't just "break something minor". It got as far as executing a DROP TABLE against a staging table that mirrored production's structure. Railway's diff showed it to me in bright red at 11:47pm. I had exactly 4 seconds to cancel the pipeline before the commit reached the right environment.
Four seconds.
That made it crystal clear to me that running an agent that deploys on its own without a real control layer isn't "living on the edge". It's Russian roulette with your infrastructure.
My thesis, now that the adrenaline has worn off: guardrails aren't an optional feature of autonomous agents — they are the architecture. Without them, the agent isn't autonomous: it's an uncontrolled process with LLM context. That distinction matters.
AI agents in production: the concrete problem guardrails solve
The promise of agents is beautiful on paper. You give it a goal, the agent breaks it down into steps, executes, corrects, iterates. I tested it against my real stack and there are cases where it works surprisingly well.
The problem shows up at the edges. And the edges in production are exactly where the cost of getting it wrong is highest.
What I found in my incident logs:
[2026-07-14T23:47:11Z] AGENT_STEP: Running obsolete schema cleanup
[2026-07-14T23:47:11Z] SQL_INTENT: DROP TABLE sessions_legacy
[2026-07-14T23:47:12Z] ENV_CONTEXT: staging → production (ambiguity detected in RAILWAY_ENV variable)
[2026-07-14T23:47:12Z] EXEC: psql -c "DROP TABLE sessions_legacy" $DATABASE_URL
See the problem? ENV_CONTEXT: staging → production (ambiguity detected). The agent knew there was ambiguity. It logged it. And executed anyway.
That's not an LLM bug. That's the absence of policy. The agent had no instruction to stop when facing destructive ambiguity. It had an instruction to complete the objective.
The guardrails architecture I built: real code and real decisions
After the incident I built a layer I internally call the gatekeeper. It's not fancy. It's a module that sits between the agent and any execution with real consequences.
1. Destructive intent classifier
// guardrails/intent-classifier.ts
// Classifies whether an action has destructive potential before executing it
const DESTRUCTIVE_PATTERNS = [
/DROP\s+(TABLE|DATABASE|SCHEMA)/i,
/DELETE\s+FROM\s+\w+\s*(?!WHERE)/i, // DELETE without WHERE
/TRUNCATE/i,
/rm\s+-rf/i,
/railway\s+down/i,
/docker\s+system\s+prune/i,
/git\s+push\s+.*--force/i,
] as const;
const AMBIGUOUS_ENV_SIGNALS = [
'staging',
'production',
'prod',
'DATABASE_URL', // without environment prefix
] as const;
export type IntentRisk = 'safe' | 'review' | 'block';
export function classifyIntent(action: string, context: AgentContext): IntentRisk {
const isDestructive = DESTRUCTIVE_PATTERNS.some(p => p.test(action));
if (!isDestructive) return 'safe';
// Destructive action: check the environment context
const hasEnvAmbiguity = AMBIGUOUS_ENV_SIGNALS.some(signal =>
context.environmentHints?.includes(signal) && !context.environmentConfirmed
);
// Env ambiguity + destructive action = full block
if (hasEnvAmbiguity) return 'block';
// Destructive action but clear environment = manual review required
return 'review';
}
The classifier is deterministic. I don't ask the LLM whether something is dangerous — because the LLM can convince itself that it isn't. The regexes are blunt and that's exactly what I want.
2. The execution wrapper with a stop policy
// guardrails/execution-wrapper.ts
// Intercepts every agent execution before it touches real infrastructure
import { classifyIntent } from './intent-classifier';
import { notifySlack } from '../notifications/slack';
interface ExecutionResult {
executed: boolean;
reason?: string;
output?: string;
}
export async function safeExecute(
action: string,
context: AgentContext,
executor: () => Promise<string>
): Promise<ExecutionResult> {
const risk = classifyIntent(action, context);
// Always log, no exceptions — the logs saved me the first time
await logAgentAction({ action, risk, context, timestamp: new Date().toISOString() });
if (risk === 'block') {
await notifySlack({
level: 'critical',
message: `🚫 AGENT BLOCKED\nAction: ${action}\nReason: destructive ambiguity detected\nEnvironment: ${context.environment}`,
});
return {
executed: false,
reason: `Action blocked: destructive pattern with ambiguous environment context. Requires human intervention.`,
};
}
if (risk === 'review') {
// For review actions: wait for approval with timeout
const approved = await waitForHumanApproval(action, context, { timeoutMs: 5 * 60 * 1000 });
if (!approved) {
return {
executed: false,
reason: 'Human approval not received in time (5 min). Action cancelled.',
};
}
}
// Safe or approved: execute and log output
const output = await executor();
await logAgentAction({ action, risk, context, output, timestamp: new Date().toISOString() });
return { executed: true, output };
}
The key point is waitForHumanApproval. It's not a loop that blocks the process — it's a promise that resolves when a webhook arrives from Slack (an "Approve" / "Reject" button). If nothing comes in 5 minutes, it cancels.
3. The environment context: the variable the incident agent never had
// guardrails/environment-context.ts
// Builds the environment context before handing control to the agent
export function buildAgentContext(): AgentContext {
const env = process.env.RAILWAY_ENVIRONMENT_NAME;
// Explicit fallback — if no variable exists, it's ambiguous
if (!env) {
return {
environment: 'unknown',
environmentConfirmed: false,
environmentHints: [],
isProduction: false,
};
}
const isProduction = env.toLowerCase() === 'production';
return {
environment: env,
environmentConfirmed: true,
environmentHints: [env],
isProduction,
// In production: additional constraints in the agent's system prompt
agentConstraints: isProduction ? PRODUCTION_CONSTRAINTS : STAGING_CONSTRAINTS,
};
}
const PRODUCTION_CONSTRAINTS = `
ENVIRONMENT RESTRICTIONS - PRODUCTION:
- Prohibited from executing destructive database operations without explicit approval
- Prohibited from modifying environment variables without confirmation
- Prohibited from stopping services without a documented rollback plan
- When in any doubt about the scope of an action: STOP and report
- The goal of completing the task is SECONDARY to system integrity
`;
That last line in PRODUCTION_CONSTRAINTS is the one that cost me the most to finally write: the goal of completing the task is secondary to system integrity. Agents are trained to complete objectives. You have to explicitly rewrite their value hierarchy.
The mistakes I made (and that you'll make if you don't read this first)
Mistake 1: trusting that the agent "understands" the environment context
The incident agent had access to process.env. It could read the variables. But "reading" isn't the same as "using as a constraint". You need to inject the environment context as an explicit constraint in the system prompt, not as available data.
Mistake 2: logging only errors, not intentions
My original logs recorded outputs. After the incident I changed them to record intentions — every step the agent wants to take, before executing it. It's the difference between knowing what happened and being able to intervene before it happens.
This connects to something I noticed when inspecting Chrome installing models without asking permission: when an automated process acts without an intent log, you always arrive late. You only see consequences.
Mistake 3: guardrails only on the happy path
I put my first controls on the agent's normal flow. But the incident didn't happen in the normal flow — it happened in a cleanup step the agent generated itself as a subtask. Guardrails have to wrap all execution, including the actions the agent autogenerates.
// WRONG: guardrails only at the entry point
async function runAgent(task: string) {
checkGuardrails(task); // ← only checks the initial task
await agent.execute(task); // ← subtasks run uncontrolled
}
// RIGHT: guardrails on the executor, not the entry point
async function runAgent(task: string) {
// The agent calls safeExecute() for EVERY action it wants to take
await agent.execute(task, { executor: safeExecute });
}
Mistake 4: ignoring the agent's own warnings
When I reviewed the incident logs, the agent had logged ambiguity detected before executing. I had no alert on that string. Now I do:
// monitoring/agent-log-watcher.ts
// Immediate alert on warning keywords in agent logs
const ALERT_KEYWORDS = [
'ambiguity',
'ambiguous',
'not confirmed',
'unconfirmed',
'assuming',
'inferring environment',
];
export function watchAgentLogs(logStream: Readable) {
logStream.on('data', (chunk: string) => {
const hasWarning = ALERT_KEYWORDS.some(kw =>
chunk.toLowerCase().includes(kw.toLowerCase())
);
if (hasWarning) {
// Immediate alert — don't wait for the next monitoring cycle
notifySlack({ level: 'warning', message: `⚠️ Agent reporting uncertainty:\n${chunk}` });
}
});
}
FAQ: Guardrails for AI agents in production
Isn't it just easier to not use autonomous agents in production?
Yes, it's easier. It's also easier to not use Docker because "it works fine without containers". It took me 6 months to really understand Docker and the productivity jump was real. Well-constrained agents give me a similar jump. The point isn't to avoid them — it's to not use them without architecture.
Aren't regex-based guardrails too blunt?
Deliberately, yes. I don't want sophistication in the blocking layer. I want it to be impossible to bypass with clever LLM reasoning. If there's a DROP TABLE in the action string, I don't care about the context: it gets blocked. Nuance can live in other layers of the system.
How do you handle human approvals when the agent runs at night?
With the 5-minute timeout configured. If no approval comes, the action is cancelled and the agent logs the reason. The next day I review what it wanted to do and if it made sense, I run it manually. I'd rather lose one automation than lose data.
Do these guardrails work with any LLM or are they Claude-specific?
The intent classifier and execution wrapper are model-agnostic — they act on the agent's output, not on the model itself. The PRODUCTION_CONSTRAINTS in the system prompt vary in effectiveness by model, but the blocking layer works the same regardless. Even if the LLM ignores the instructions, the wrapper intercepts execution.
How much overhead does this layer add to agent execution time?
In my measurements: between 80ms and 200ms per action, depending on whether there's a pending approval. For safe actions, it's just the log — nearly nothing. The real overhead is the human wait time on review actions, which is intentional.
What if the agent tries to evade the guardrails by generating code that bypasses them?
That's a real attack vector. I mitigated it two ways: first, the agent doesn't have access to the guardrails code (it's outside the context it receives). Second, the execution wrapper is invoked from the runtime, not from the agent — the agent can only declare intentions, not execute them directly. It's the same privilege separation as any well-designed system. If I ever find evidence of active evasion, that's a signal the model changed behavior — something I've been monitoring ever since I started thinking about how models change in production without warning.
What I learned: the agent isn't the problem, the absence of a contract is the problem
Something became clear to me after this incident that I haven't seen articulated in any post about autonomous agents: the LLM doesn't know what's valuable to you. It knows what instructions it received. If the instructions say "complete the task", it will complete it — including the part that destroys something you considered untouchable but never explicitly told it was.
Same thing I criticize about certain tools that act without asking permission: autonomy without declared limits isn't autonomy, it's unpredictability.
My final position, unvarnished: autonomous agents in production are a technically valid bet if — and only if — you treat guardrails as first-order architecture. Not as a security feature you'll add later. Not as documentation of what the agent "shouldn't" do. As an executable contract with consequences.
Everything else I've written this week about Rust with real edge cases or about supply chain attacks in dependencies comes from the same place: production doesn't accept "I'll add it later". Those four seconds I had that night — nobody's giving them back to me.
If you're building agents, start with the gatekeeper. Then build the agent.
Got an agent incident you still haven't talked about? Send me the log. Seriously.
This article was originally published on juanchi.dev
Top comments (0)