Msatfi

Posted on Mar 13

"I Ran 14 Agent Failure Scenarios Through a Guard Layer (With Cost Data From Every Run)"

#ai #autonomousagents #devops #python

How do you know if your agent is looping or actually working?

I've been stress-testing AI agents against known failure modes (tool loops, duplicate side-effects, retry storms) and I built a middleware layer to catch them. Then I measured what it actually caught across 14 scenarios and 25 real-model runs. Here's the data.

The problem

Agents generate activity. Tool calls, search results, reformulated queries. It looks like work. But after enough runs, I kept finding the same three things:

Agent searches "refund policy", then "refund policy EU", then "refund policy EU Germany 2024". Each query is different. Same results every time.
Agent issues a refund, gets a timeout, retries. Customer refunded twice.
Agent A asks Agent B for help. Agent B asks Agent A for clarification. Back and forth forever.

max_steps doesn't help. It can't tell productive calls from loops. Set it too low, you kill good workflows. Too high, you burn money.

What the guard does

AuraGuard sits between the agent and its tools. Every tool call goes through it. It checks the call signature against rolling history and returns a decision: ALLOW, CACHE, BLOCK, REWRITE, ESCALATE, or FINALIZE.

No LLM calls. Deterministic heuristics only. HMAC signatures, token-set overlap, counters, sequence matching.

8 primitives run in sequence on every call:

Identical repeat detection
Argument jitter detection (token-set overlap)
Error retry circuit breaker
Side-effect idempotency ledger
Stall/no-state-change detection
Cost budget enforcement
Per-tool policy layer
Multi-tool sequence loop detection

Zero dependencies. Stdlib Python only.

What the benchmark output looks like

14 synthetic scenarios. Each one replays a specific failure pattern. No LLM involved. This measures detection accuracy, not model behavior.

aura-guard bench --all

Scenario                             No Guard   Aura Guard    Saved
────────────────────────────────────────────────────────────────────
KB Query Jitter Loop                    $0.32        $0.24      25%
Double Refund Attempt                   $0.16        $0.08      50%
Error Retry Spiral                      $0.40        $0.26      35%
CRM Lookup Cascade                      $0.36        $0.24      33%
Stall + Apology Spiral                  $0.04        $0.06     -50%
Mixed Degradation                       $0.40        $0.28      30%
RAG Retrieval Loop                      $0.40        $0.28      30%
Ticket Lookup Cascade                   $0.32        $0.20      38%
Side-Effect Storm                       $0.28        $0.16      43%
Budget Overrun                          $0.80        $0.76       5%
Healthy Workflow (FP check)             $0.20        $0.20       0%
Ping-Pong Delegation Loop               $0.40        $0.30      25%
Circular 3-Agent Delegation             $0.48        $0.40      17%
Mixed Normal + Sequence Loop            $0.44        $0.38      14%
────────────────────────────────────────────────────────────────────
TOTAL                                   $5.00        $3.94      21%

What I learned from the data

"Stall + Apology Spiral" costs more with the guard. +$0.02 overhead. The guard adds an intervention turn, then escalates. Without the guard the agent loops forever. Small cost increase for termination.

"Healthy Workflow" shows 0% savings. This scenario exists to verify zero false positives. Five normal tool calls, no loops. The guard allows all of them. If this ever goes above 0%, the thresholds are wrong.

Budget Overrun only saves 5%. The guard escalates on the 19th of 20 calls. Most of the budget is already spent. Budget enforcement catches the overrun. It doesn't prevent the spend leading up to it.

Side-effect scenarios prevent business damage, not token waste. "Double Refund Attempt" blocks the duplicate. "Side-Effect Storm" blocks 3 of 6 mutations. These are prevented duplicate charges, not cost optimizations.

Real-model A/B test

5 scenarios against Claude Sonnet. 5 runs per variant. Real API calls. Tools rigged to trigger failures.

Scenario	No guard	With guard	Result
Jitter loop	$0.28	$0.14	48% saved
Double refund	$0.14	$0.15	Duplicate prevented at +$0.01
Error retry spiral	$0.13	$0.10	29% saved
Smart reformulation	$0.86	$0.15	83% saved
Combined flagship	$0.35	$0.14	59% saved

The "smart reformulation" one caught me off guard. The agent reformulated queries with different word order, synonyms, added qualifiers. String matching wouldn't catch it. Token-set overlap was above 60%. 83% cost reduction.

64 interventions across 25 runs. Zero false positives in manual review. JSON report committed in the repo.

Caveats: tools were rigged. Controlled test, not production replay.

What the failure modes look like in aggregate

After hundreds of runs, three categories:

Exploration loops. 60% of interventions. The agent explores a search space with diminishing returns. Each query is slightly different. Each result is slightly different. The agent thinks it's making progress. It's not. Jitter detection and per-tool caps catch these.

Retry spirals. 25% of interventions. Tool fails. Agent retries. Fails again. Retries with modifications. The tool is down. No modification will fix it. Circuit breaker catches these.

Delegation loops. 15% of interventions. Multi-agent only. A asks B. B asks A. Repeat. Sequence detection catches these after the pattern repeats 3 times.

Category 1 is the most dangerous because it looks like productivity. The agent is "working." The logs show diverse calls. The summaries report progress. Only the token counter tells the truth.

How to use it

pip install aura-guard

from aura_guard import AgentGuard

guard = AgentGuard(
    secret_key=b"your-key",
    side_effect_tools={"refund", "cancel"},
    max_cost_per_run=1.00,
)

result = guard.run("search_kb", search_kb, query="refund policy")

Run the benchmark against your own failure patterns:

aura-guard bench --all

MCP support (Claude Desktop, Cursor, any MCP client):

pip install aura-guard[mcp]

Who this is for

Anyone running an AI agent that calls tools. Especially tools with side effects (payments, emails, cancellations) where a duplicate call causes real damage, not just wasted tokens.

You don't need a specific framework. The guard wraps any Python callable. OpenAI, LangChain, and MCP adapters are included if you want them.

Source: github.com/auraguardhq/aura-guard. 75 tests, zero dependencies.

DEV Community