Diven Rastdus

Posted on Mar 28 • Edited on Mar 30

Why Your AI Agent Needs a Kill Switch (and How to Build One)

#ai #agents #python #typescript

Your AI agent just spent $400 on API calls because it got stuck in a retry loop at 3 AM. Nobody was watching. The monitoring dashboard? It sent an alert to a Slack channel nobody checks on weekends.

This happens more often than anyone admits. Agents that loop endlessly, agents that send duplicate emails to clients, agents that overwrite production configs because the LLM hallucinated a file path. The failure mode of autonomous agents is not that they stop working. The failure mode is that they keep working, confidently, in the wrong direction.

If you are building agents that run without constant human supervision, you need kill switches. Not as an afterthought. As core infrastructure.

The Three Layers of Agent Safety

After running autonomous agents in production for months, I have landed on three layers that catch different failure modes:

Layer 1: Budget and rate limits (catches runaway costs)
Layer 2: Behavioral guardrails (catches wrong actions)
Layer 3: Watchdog processes (catches silent failures)

Each layer is independent. If one fails, the others still protect you.

Layer 1: Budget Circuit Breakers

The simplest kill switch is a spending cap. Before every API call, check accumulated cost against a threshold.

class BudgetCircuitBreaker {
  private spent: number = 0;
  private readonly limit: number;

  constructor(limitUSD: number) {
    this.limit = limitUSD;
  }

  async call(fn: () => Promise<any>, estimatedCost: number): Promise<any> {
    if (this.spent + estimatedCost > this.limit) {
      throw new BudgetExceededError(
        `Budget exhausted: $${this.spent.toFixed(2)} / $${this.limit}`
      );
    }
    const result = await fn();
    this.spent += estimatedCost;
    return result;
  }
}

// Usage: hard cap at $5 per agent run
const budget = new BudgetCircuitBreaker(5.0);
const response = await budget.call(
  () => anthropic.messages.create({ model: 'claude-sonnet-4-6', ... }),
  0.015 // estimated cost per call
);

This is table stakes. Every production agent should have this. But cost limits alone are not enough, because an agent can do massive damage on a single cheap API call (sending a wrong email, deleting a file, posting to social media).

Layer 2: Behavioral Guardrails (Pre-execution Hooks)

This is where it gets interesting. Instead of just limiting how much the agent can spend, you limit what it can do.

The pattern: intercept every tool call before execution. Check it against a set of rules. Block or modify if needed.

# Pre-execution hook system
class GuardRail:
    def __init__(self):
        self.rules: list[callable] = []

    def add_rule(self, check_fn):
        """Each rule receives (tool_name, tool_input) and returns
        (allow: bool, reason: str)"""
        self.rules.append(check_fn)

    def check(self, tool_name: str, tool_input: dict) -> tuple[bool, str]:
        for rule in self.rules:
            allowed, reason = rule(tool_name, tool_input)
            if not allowed:
                return False, reason
        return True, "ok"


guard = GuardRail()

# Block destructive file operations
guard.add_rule(lambda tool, inp: (
    False, "Blocked: rm -rf or force delete"
) if tool == "bash" and any(
    cmd in inp.get("command", "")
    for cmd in ["rm -rf", "DROP TABLE", "git push --force"]
) else (True, "ok"))

# Block emails to external domains during testing
guard.add_rule(lambda tool, inp: (
    False, "Blocked: external email in test mode"
) if tool == "send_email" and not inp.get("to", "").endswith("@yourcompany.com")
else (True, "ok"))

# Enforce rate limits on outbound actions
from collections import defaultdict
import time

action_timestamps = defaultdict(list)

def rate_limit_rule(tool: str, inp: dict) -> tuple[bool, str]:
    now = time.time()
    # Clean old timestamps (older than 60s)
    action_timestamps[tool] = [
        t for t in action_timestamps[tool] if now - t < 60
    ]
    if len(action_timestamps[tool]) >= 5:
        return False, f"Rate limited: {tool} called 5+ times in 60s"
    action_timestamps[tool].append(now)
    return True, "ok"

guard.add_rule(rate_limit_rule)

The key insight: guardrails should be deterministic, not probabilistic. Do not ask another LLM to judge whether an action is safe. Use pattern matching, allowlists, and hard rules. An LLM judging another LLM is just adding more stochastic behavior to a system that already has too much.

Rules I have found essential in production:

Rule	Why
Block destructive shell commands	`rm -rf /`, `DROP TABLE`, force push
Rate limit outbound actions	Emails, API calls, social posts: max N per minute
Allowlist file paths	Agent can only write to specific directories
Block duplicate actions	Hash recent actions, reject exact repeats
Require human approval above threshold	Any spend > $20, any public-facing action

Layer 3: Watchdog Processes

Layers 1 and 2 catch bad actions. Layer 3 catches inaction, where the agent silently dies, gets stuck in a loop, or stops making progress.

The pattern: a separate process monitors the agent's heartbeat. If the heartbeat goes stale, the watchdog takes action (restart, alert, or kill).

import json
import time
import subprocess
from pathlib import Path

HEARTBEAT_FILE = Path("./runtime/heartbeat.json")
STALE_THRESHOLD = 300  # 5 minutes
CHECK_INTERVAL = 60

def write_heartbeat():
    """Called by the agent after every successful action."""
    HEARTBEAT_FILE.write_text(json.dumps({
        "timestamp": time.time(),
        "status": "alive",
        "last_action": "tool_call_completed"
    }))

def watchdog_loop():
    """Runs as a separate process (cron job or systemd service)."""
    while True:
        try:
            heartbeat = json.loads(HEARTBEAT_FILE.read_text())
            age = time.time() - heartbeat["timestamp"]

            if age > STALE_THRESHOLD:
                print(f"STALE heartbeat ({age:.0f}s). Restarting agent...")
                # Kill the stuck process
                subprocess.run(["pkill", "-f", "agent_main.py"])
                time.sleep(2)
                # Restart
                subprocess.Popen(["python3", "agent_main.py"])

        except (FileNotFoundError, json.JSONDecodeError):
            print("No heartbeat file. Agent may not have started.")

        time.sleep(CHECK_INTERVAL)

Run the watchdog as a cron job or systemd timer, not inside the agent process itself. The whole point is that it is independent. If the agent crashes, the watchdog still runs.

# cron: check every 5 minutes
*/5 * * * * /usr/bin/python3 /opt/agent/watchdog.py --check-once

Putting It All Together

Here is how the three layers compose in a real agent:

User Request
    |
    v
[Budget Check] -- over limit? --> STOP, alert human
    |
    v
[Tool Call Planned]
    |
    v
[Guardrail Check] -- blocked? --> Log, skip, try alternative
    |
    v
[Execute Tool]
    |
    v
[Write Heartbeat]
    |
    v
[Watchdog monitors heartbeat externally]

The agent itself only knows about layers 1 and 2. Layer 3 runs outside the agent, which is the whole point. You cannot trust a broken agent to report that it is broken.

The Boring Stuff That Matters

A few operational details that matter more than the architecture:

Log everything. Every tool call, every guardrail check, every heartbeat write. When something goes wrong at 3 AM, the logs are all you have. Structured JSON, not print statements.

Test your kill switches. Intentionally trigger every guardrail in staging. Send a command that should be blocked. Let the heartbeat go stale. Verify the watchdog restarts correctly. Kill switches that have never fired are kill switches that do not work.

Degrade gracefully. When a guardrail blocks an action, the agent should not crash. It should log the block, skip the action, and continue with the next step. A blocked action is information, not an error.

Make guardrails fast. If your pre-execution check takes 500ms and the agent makes 100 tool calls per task, you have added 50 seconds of latency. Pattern matching and in-memory checks should take microseconds.

What This Does Not Solve

Kill switches protect against known failure modes. They do not protect against novel failures you have not anticipated. An agent that finds a creative way to cause damage without triggering any of your rules is still dangerous.

The mitigation is defense in depth. Budget limits catch cost overruns. Guardrails catch known bad actions. Watchdogs catch stuck processes. And behind all of them: the principle of least privilege. Give the agent access only to what it needs. Not a root shell. Not admin API keys. Not write access to production databases.

The boring answer is usually the right one. Tighter permissions beat smarter guardrails.

I build production AI agent systems. If you are working on autonomous agents and want to discuss architecture, check out astraedus.dev.

If these production patterns resonated, I wrote a full book covering the complete lifecycle -- from architecture to deployment to monitoring. Check it out: Production AI Agents on Amazon ($4.99 launch price).

Top comments (1)

Kalpaka • Mar 29

The three layers are solid, but they assume you already know what the agent is doing. That's the step most teams skip. Heartbeat proves alive, not correct. I've watched agents dutifully writing heartbeats while systematically doing the wrong thing for hours.

The deterministic guardrail point is the sleeper insight here. Temptation to use an LLM to verify another LLM gets strong once failure modes turn subtle, but stacking stochastic on stochastic just multiplies uncertainty. Hard pattern matching feels crude until you're in an incident review explaining why your safety layer also hallucinated.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.