DEV Community

MrClaw207
MrClaw207

Posted on

4 Safety Boundaries Your AI Agent Needs Before Production (And How to Wire Them)

4 Safety Boundaries Your AI Agent Needs Before Production (And How to Wire Them)

Every week, someone posts in an AI agent community: "My agent deleted my database" or "It spent $400 on API calls overnight" or "It sent my API keys to a third-party endpoint."

The common thread? No safety boundaries.

Not because the developers were careless. Because the defaults on most agent platforms give you rope to hang yourself, and there's no standard checklist for what "safe" actually looks like.

I've been running agents in production for over a year. Here's the four-boundary framework I wire up before any agent touches real infrastructure — and the exact patterns that actually work.


Boundary 1: The Kill Switch

The kill switch is the most basic safety mechanism, and the one most agents skip.

It needs to work at three levels:

  • Network level: Can the agent make outbound requests? To which hosts?
  • Process level: Can the agent spawn processes or run exec commands?
  • Human level: Can you interrupt the agent mid-operation and force a stop?

For OpenClaw agents, the kill switch combines three layers:

  1. The native approval system for exec commands (every shell command gets a yes/no prompt)
  2. A killswitch flag file the agent checks between operations
  3. The cron cancel mechanism to halt scheduled runs mid-flight

OpenClaw's built-in ask: true approval mode on the exec tool is your first kill switch — every shell command waits for human confirmation before running. But that gets old fast if you're doing 50 legitimate ops a day. The pattern I use is an approval-then-trust loop:

# Kill switch — create this file to halt all agent operations
touch ~/.openclaw/agent_killswitch

# Agent checks for it at the top of every work cycle:
if [ -f ~/.openclaw/agent_killswitch ]; then
  echo "[KILLSWITCH] Halting. Remove ~/.openclaw/agent_killswitch to resume."
  exit 0
fi
Enter fullscreen mode Exit fullscreen mode
# Kill switch wrapper — add to your agent's exec handler
function exec_with_guard {
  local cmd="$1"
  local MAX_DURATION=30

  # Check kill switch flag
  if [ -f /tmp/agent_killswitch ]; then
    echo "[KILLED] Kill switch is active. Command blocked."
    return 1
  fi

  # Timeout guard
  timeout $MAX_DURATION bash -c "$cmd" || {
    echo "[TIMEOUT] Command exceeded ${MAX_DURATION}s"
    return 124
  }
}
Enter fullscreen mode Exit fullscreen mode

The key insight: the kill switch has to be checked before the dangerous operation, not after. "Oops it already ran" is not a safety boundary.


Boundary 2: Budget Rails

If your agent can spend money, it will eventually spend more than you expect.

Budget rails are spending caps that trigger a pause and human notification before the cap is hit, rather than after. There are two types:

Hard cap: Absolute maximum. The agent cannot exceed this under any circumstances.

Soft cap: Triggers a warning and waits for human confirmation before continuing.

# Budget rail decorator for agent API calls
class BudgetRail:
    def __init__(self, daily_limit_usd=10.0, per_call_limit_usd=1.0):
        self.daily_spend = 0.0
        self.daily_limit = daily_limit_usd
        self.per_call_limit = per_call_limit_usd
        self.last_reset = datetime.date.today()

    def check(self, estimated_cost):
        if datetime.date.today() > self.last_reset:
            self.daily_spend = 0.0
            self.last_reset = datetime.date.today()

        if self.daily_spend + estimated_cost > self.daily_limit:
            raise BudgetExceededError(
                f"Would exceed daily budget: ${self.daily_spend:.2f}/$
            )
        if estimated_cost > self.per_call_limit:
            raise BudgetExceededError(
                f"Per-call estimate ${estimated_cost:.2f} exceeds limit ${
            )

        self.daily_spend += estimated_cost

    def remaining(self):
        return self.daily_limit - self.daily_spend
Enter fullscreen mode Exit fullscreen mode

This is especially important for agents that call LLM APIs where a loop or recursive call pattern can compound costs fast.


Boundary 3: Permission Default-Deny

By default, your agent should have zero permissions. It should request specific permissions for specific tasks, and those permissions should expire.

This is the principle behind the Hermes Agent "Blank Slate" mode that's been trending in agent communities: the agent starts with no tools enabled, and you grant access as needed.

In practice, this looks like:

# Permission manifest for a data-processing agent
PERMISSIONS = {
    "read": ["~/data/input/*", "~/data/processed/*"],
    "write": ["~/data/output/*"],
    "network": ["api.stripe.com", "api.sendgrid.com"],
    "exec": False,  # No shell execution by default
    "timeout_seconds": 300,
    "expires_at": "2026-06-22T18:00:00Z"
}
Enter fullscreen mode Exit fullscreen mode

Before any operation, the agent checks: "Does this fall within my permission manifest?" If not, it asks.

For OpenClaw, the allowFrom config and per-tool ask flags handle this natively. Here's how I configure it for a new agent:

{
  "tools": {
    "exec": { "ask": true },
    "browser": { "ask": true },
    "gateway": { "ask": false }
  },
  "allowFrom": ["telegram:188*******"],
  "toolsAllow": ["read", "write", "edit", "exec", "cron", "browser"]
}
Enter fullscreen mode Exit fullscreen mode

The ask: true on exec and browser means those two tools always pause for human confirmation. The rest run autonomously. This is the right default-deny posture: you explicitly grant trust per tool, not globally.


Boundary 4: Output Guardrails

Your agent's output goes somewhere — to users, to databases, to webhooks. Each destination is an attack surface.

Output guardrails validate what the agent produces before it leaves your system:

# Output guard: scan agent output for sensitive patterns before sending
guard_output() {
  local output="$1"
  local destination="$2"

  # Patterns that should never leave your system unfiltered
  local sensitive_patterns=(
    "sk-[a-zA-Z0-9]{20,}"      # OpenAI keys
    "-----BEGIN.*PRIVATE KEY-----"  # Private keys
    "password\s*=\s*\S+"        # Passwords in config
    " Bearer [a-zA-Z0-9\-_]+\.[a-zA-Z0-9\-_]+\.[a-zA-Z0-9\-_]+"  # JWTs
  )

  for pattern in "${sensitive_patterns[@]}"; do
    if echo "$output" | grep -qE "$pattern"; then
      echo "[GUARD] Blocked output to $destination — matched pattern: $pattern"
      return 1
    fi
  done

  # Size guard: prevent prompt injection via oversized output
  if [ $(echo "$output" | wc -c) -gt 100000 ]; then
    echo "[GUARD] Output exceeds 100KB limit, truncating"
    output=$(echo "$output" | head -c 100000)
  fi

  return 0
}
Enter fullscreen mode Exit fullscreen mode

This is the boundary most people skip. They think "it's just going to a Slack channel" — but if the agent can output arbitrary text to a Slack channel, it can be used for prompt injection attacks on anyone reading that channel.


The Full Picture

None of these boundaries are complicated. The kill switch is a file check. The budget rail is a decorator. The permission manifest is a config. The output guard is a regex scan.

What's complicated is remembering to build them before production — not after the first incident.

Here's the sequence I use before any agent goes live:

  1. Wire the kill switch — verify it works with touch /tmp/agent_killswitch and confirm operations stop
  2. Set budget rails — start with a $1/day soft cap, $5/day hard cap
  3. Write the permission manifest — be explicit about every access path
  4. Add output guardrails — before any first outbound call, run the output through the scanner

One hour of setup. A fraction of the incident response time if something goes wrong.

The agent that ships without boundaries is not a time-saver. It's a liability with good marketing.


If you want the full permission manifest template and the budget rail implementation, the production-ready checklist has both — links in my profile.

Top comments (0)