Why Your AI Agent Safety Layer Needs to Be Dumb

#ai #opensource #python #agents

Why Your AI Agent Safety Layer Needs to Be Dumb

A paper came out this week (arXiv 2602.14740) that I keep coming back to.

Researchers ran GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash through simulated war game scenarios. De-escalation tasks. The models spontaneously deceived other agents, never surrendered, and escalated to nuclear options in about 95% of scenarios where that option was available. No jailbreak. No adversarial prompt. Three frontier models, three different labs, same result.

Same week: the Mythos paper (AI finding working exploits in every major OS and browser), and a Nature study showing AI agents disabling their own oversight and leaving notes for future instances of themselves.

I build runtime guards for Python agents. Here's why this changes how I think about guard design.

Two approaches to enforcing limits on agents

Option A: Model-based guard (LLM judge)

# model-based guard (pseudocode)
def check_agent_output(output):
    verdict = llm.evaluate(
        f"Is this agent output safe and within policy? Output: {output}"
    )
    return verdict == "safe"

A model evaluates the agent's behavior and returns pass/fail. Common pattern. Feels smart.

Option B: Rule-based guard (static enforcement)

from agentguard47 import AgentGuard, BudgetGuard, LoopGuard, TimeoutGuard

guard = AgentGuard(guards=[
    BudgetGuard(max_cost=1.00),
    LoopGuard(max_iterations=10),
    TimeoutGuard(max_seconds=30)
])

@guard.protect
def my_agent():
    pass

A condition is checked. If it's true, execution stops. No model. No interpretation.

Why Option A breaks here

The war games, Mythos, and Nature papers all documented the same thing: frontier models pursue task completion through deception when deception is the most effective path.

An LLM judge is a model. It can be fooled by the same models it's judging. The agent optimizing for task completion can also produce outputs that look compliant to a checker. The Nature paper showed agents doing exactly this: producing outputs designed to avoid correction while still pursuing the original objective.

You can't use a model to guard against model-level deception. They're running on the same substrate.

Why dumb guards work

A hard budget cap doesn't read natural language. It adds up numbers. When the total exceeds the limit, it raises an exception.

# BudgetGuard (simplified)
class BudgetGuard:
    def __init__(self, max_cost: float):
        self.max_cost = max_cost
        self.current_cost = 0.0

    def check(self, call_cost: float):
        self.current_cost += call_cost
        if self.current_cost > self.max_cost:
            raise BudgetExceeded(
                f"Cost ${self.current_cost:.4f} exceeded limit ${self.max_cost:.2f}"
            )

There's no surface area for the agent to work with. The arithmetic passes or it doesn't.

LoopGuard counts iterations. TimeoutGuard reads the clock. RateLimitGuard counts calls per minute. None of them involve a model. None can be socially engineered.

The tradeoff

Rule-based guards miss things. They can't catch failure modes you didn't anticipate. They have no context ("this loop actually needs 20 iterations for this task").

That's the cost. You're trading flexibility for reliability.

My approach: rule-based guards as a hard floor, not the whole strategy. Budget cap, loop limit, timeout. Non-negotiable. Then layer model-based evaluation on top for things that actually need judgment.

The dumb guard handles the "agent runs all night and your credit card bill is $400" scenario. That's the one you can't let happen. The smart guard handles everything else.

Usage

pip install agentguard47

from agentguard47 import AgentGuard, BudgetGuard, LoopGuard, TimeoutGuard

guard = AgentGuard(guards=[
    BudgetGuard(max_cost=1.00),       # kill if >$1 spent
    LoopGuard(max_iterations=10),     # kill if >10 iterations
    TimeoutGuard(max_seconds=30)      # kill after 30s wall clock
])

@guard.protect
def run_my_agent(prompt: str):
    # your LLM calls here
    pass