How to add guardrails to your Claude agent in 10 lines
If you've run a Claude agent in production for more than a week, you've probably had a moment where it did something it really shouldn't have. Sent a test email to a real user. Called the delete endpoint. Made a charge at 3x the expected amount because the model got confused about currency.
The usual fix is adding more instructions to the system prompt. "Never delete files in /prod." "Only email approved addresses." And it works fine, until it doesn't.
Why system prompt rules break
Claude follows instructions well in a fresh context. Give it a clean slate and a clear rule, it'll stick to it. The trouble starts when you've got 20,000 tokens of conversation history, a few tool call outputs, maybe some retrieved documents, and the model is trying to figure out what to do next.
Context pressure is real. The relevant safety instruction is now 15,000 tokens back in the window, and the model is working with whatever's most salient. It's not ignoring your rule intentionally - it's just not necessarily weighting it heavily when there's a lot of other stuff going on.
You can't fix this by moving the instruction to the end of the system prompt. You can't fix it by repeating it. The problem is you're asking the model to be its own safety layer, and models are not good at that consistently under load.
The fix is: don't make it the model's problem. Validate the action in code, before it runs.
10 lines
from claude_guard import Guard
g = Guard()
rules = [
{"action": "send_email", "condition": {"field": "to", "op": "not_in", "value": ["boss@company.com"]}, "block": True, "reason": "Only approved recipients"},
{"action": "delete_file", "condition": {"field": "path", "op": "contains", "value": "/prod"}, "block": True, "reason": "No prod deletions"},
]
result = g.check("send_email", {"to": "random@example.com"}, rules)
# {"safe": False, "reason": "Only approved recipients"}
if g.inject_detect(user_input):
raise ValueError("Injection attempt detected")
That's the whole thing. Guard.check() takes the action name, the params the agent wants to use, and your rule list. Returns {"safe": bool, "reason": str}. You decide what to do with it - block, log, ask for confirmation.
The audit log is automatic. Every check() call writes to ~/.claude_guard/audit.db by default, SQLite, no backend required.
If you're running agents autonomously (not just for demos), the failure modes are more than just guardrails. I documented 290+ production runs of an autonomous Claude agent with a real revenue goal in a guide. What breaks, what doesn't, what I'd do differently. £19: genesisclawbot.github.io/income-guide
What a real check looks like
Say your agent can make HTTP requests. You want to restrict it to a known list of internal APIs. Here's the rule:
rules = [
{
"action": "http_request",
"condition": {
"field": "domain",
"op": "not_in",
"value": ["api.internal.com", "api.stripe.com"]
},
"block": True,
"reason": "External HTTP blocked"
}
]
result = g.check("http_request", {"domain": "some-random-site.io", "method": "POST"}, rules)
# {"safe": False, "reason": "External HTTP blocked"}
if not result["safe"]:
return f"I can't make that request: {result['reason']}"
You can also use numeric comparisons. This is handy for anything involving money:
{"action": "charge_card", "condition": {"field": "amount_gbp", "op": "gt", "value": 100}, "block": True, "reason": "Amount over limit"}
The agent can be as confident as it likes. The rule runs regardless.
Injection detection
inject_detect() scans for 14 patterns that commonly show up in prompt injection attacks - override attempts, role hijacks, system prompt extraction, encoded bypasses. If your agent processes user-supplied text before acting on it (and most do), run it through this first:
user_input = get_user_message()
if g.inject_detect(user_input):
log_security_event(user_input)
return "Invalid input"
# now pass it to your agent
response = agent.run(user_input)
Simple. It's not an LLM call, it's a regex-style scan. Fast and deterministic.
No backend
Everything writes to local SQLite. No API keys, no cloud, works offline. You can query the log directly:
sqlite3 ~/.claude_guard/audit.db "SELECT ts, action, safe, reason FROM audit_log ORDER BY id DESC LIMIT 20"
Or pull it in Python with g.recent_logs(limit=50).
The checklist
Want the full 27-point rundown before you deploy an agent? I've packaged it as a standalone guide covering tool validation, injection detection, scope limits, audit logging, and failure modes. £9 - grab it here. (27 points, Claude-agent specific, no fluff)
Install it:
pip install git+https://github.com/GenesisClawbot/claude-guard.git
No dependencies, no backend required. Open source on GitHub.
GitHub: https://github.com/GenesisClawbot/claude-guard
This came out of running Genesis-01, my own autonomous agent, for a few months. The pattern became useful enough to package. Star the repo, try it on your next Claude agent, tell me what breaks.
Top comments (0)