Binu George

Posted on Jun 3 • Originally published at aisecuritygateway.ai

Your AI Agent Just Burned $108 in an Hour. Here's the 50-Line Fix.

#devops #ai #langchain #python

Autonomous AI agents have a failure mode that every team discovers the hard way: infinite retry loops.

The agent sends a request. The model returns something the agent can't parse. The agent retries with the same prompt. Same response. Retry. Retry. Retry — hundreds of times before anyone notices.

The math is unforgiving: a single GPT-4-class agent loop at one request per second drains over $100 in an hour. Over a weekend with no one watching, that's $2,500+ before Monday morning.

If you're running LangChain, CrewAI, AutoGPT, or any custom agent framework in production, this will happen to you. The question is whether you catch it in 30 seconds or 30 hours.

Why agents loop

The causes are predictable across every framework:

# The classic loop: model output doesn't match expected format
while True:
    response = llm.invoke(prompt)
    try:
        result = parse_json(response)  # fails
        break
    except ParseError:
        prompt = f"That wasn't valid JSON. Try again: {prompt}"
        # Same prompt → same bad response → infinite loop

The specific triggers:

Parsing failures: The model returns output that doesn't match the expected format. The agent retries, hoping for a different result. It won't be.
Tool call errors: A tool returns an error. The agent tries the same call with the same parameters.
Hallucinated tool names: The model calls a tool that doesn't exist. The error goes back, and the model calls the same non-existent tool again.
"Let me try again" behavior: Some models, when told their output was wrong, rephrase the same answer — creating an infinite feedback loop.
Missing termination conditions: max_iterations set to 1,000, or not set at all.

Why `max_iterations` doesn't save you

Most frameworks offer max_iterations or similar parameters. The limitations:

Problem	`max_iterations`	Gateway-level detection
Protects multiple frameworks	No — per-framework	Yes — one chokepoint for all
Cross-session detection	No	Yes — shared state
Default is useful	Often 100-1000	Tight defaults, configurable
Sub-agent spawning	Bypassed	Still caught
Language-agnostic	No — Python only	Yes — HTTP layer

The fundamental issue: max_iterations is a per-framework, per-language, per-deployment setting. Gateway-level detection sits below all of it. Every request passes through the same chokepoint regardless of what generated it.

The detection algorithm

Here's the approach we use in AI Security Gateway. The core idea is fingerprinting + sliding window counting:

import hashlib
import json

def make_request_fingerprint(
    caller_id: str,
    model: str,
    messages: list[dict],
) -> str:
    """Build a deterministic fingerprint for a request.

    The idea: hash the caller identity, model, and the
    recent message content into a single fixed-length key.
    If the same key appears too often, it's a loop.
    """
    # Focus on the recent tail of the conversation —
    # full history changes naturally, but loops repeat
    # the same tail over and over
    TAIL_WINDOW = 3   # tune to your workload
    recent = messages[-TAIL_WINDOW:] if len(messages) > TAIL_WINDOW else messages

    texts = []
    for msg in recent:
        content = msg.get("content", "")
        # Flatten multimodal content to text-only
        if isinstance(content, list):
            content = " ".join(
                part.get("text", "")
                for part in content
                if isinstance(part, dict) and part.get("type") == "text"
            )
        texts.append(str(content).strip().lower())

    blob = json.dumps(
        {"who": caller_id, "model": model, "texts": texts},
        sort_keys=True,
    )
    return hashlib.sha256(blob.encode()).hexdigest()

Why these design choices?

Fingerprint the tail, not the full conversation. The full message history changes naturally as a conversation evolves, but a looping agent repeats the same recent messages. Focusing on the tail catches loops without flagging normal multi-turn conversations.

Caller identity in the fingerprint. Two different users sending the same prompt are independent — separate counters per caller. One user's legitimate batch job doesn't trigger detection for another user.

Model in the fingerprint. Sending the same prompt to different models (e.g., trying GPT-4.1 then Claude) is legitimate fallback behavior, not a loop.

Normalize and lowercase. Prevents trivial variations (trailing whitespace, case changes) from evading detection.

The counter: atomic increment with TTL

The fingerprint feeds into a sliding-window counter. Here's the check logic:

async def is_looping(
    fingerprint: str,
    cache,           # Redis-compatible async client
    window: int,     # sliding window in seconds
    threshold: int,  # max allowed identical requests
    cooldown: int,   # block duration after detection
) -> bool:
    """Check if a fingerprint indicates a runaway loop.

    Uses atomic INCR so this works correctly across
    horizontally-scaled instances sharing a cache.
    """
    # Fast path: already in cooldown from a previous trigger?
    if await cache.get(f"cool:{fingerprint}"):
        return True

    # Atomic increment — each call bumps the count by 1.
    # The TTL means the counter auto-expires after `window`
    # seconds, so it's a natural sliding window.
    count = await cache.incr(f"cnt:{fingerprint}")
    if count == 1:
        await cache.expire(f"cnt:{fingerprint}", window)

    if count > threshold:
        # Enter cooldown — block requests for this fingerprint
        # even after the counter key expires
        await cache.setex(f"cool:{fingerprint}", cooldown, 1)
        return True

    return False

The key properties:

Atomic INCR — no race conditions when multiple proxy instances share the same cache
TTL on the counter — the window auto-expires, no cleanup cron needed
Separate cooldown key — once a loop is detected, the block persists even after the counter key expires. This prevents the agent from resuming the loop after the window resets.
Distributed state — when backed by a Redis-compatible store, an agent sending requests to different proxy instances is still caught. For single-instance setups, an in-memory backend works too.

The response

When a loop is detected, the client gets a structured, actionable error:

{
  "detail": {
    "error": "recursive_loop_detected",
    "message": "Blocked: repetitive request pattern detected. This usually indicates an agent retry loop.",
    "cooldown_seconds": 30
  }
}

HTTP 429 (not 500) — because it's a client-side issue that the client should handle. The structured error field lets your agent framework catch it specifically:

from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://your-gateway.example.com/v1",
    api_key="your-key",
)

try:
    response = client.chat.completions.create(
        model="oah/gpt-4.1-mini",
        messages=messages,
    )
except RateLimitError as e:
    if "recursive_loop_detected" in str(e):
        # Agent is looping — stop retrying, alert the team
        notify_slack("Agent loop detected, halting execution")
        raise SystemExit(1)
    raise  # Normal rate limit — retry with backoff

What doesn't trigger detection

This matters as much as what does:

Normal conversation: Users sending different messages to the same model — the message content changes, so the fingerprint changes. Never triggered.
Batch processing: Same prompt to different models — model is part of the fingerprint, independent counters.
Different users: Two users sending the same prompt — caller identity is part of the fingerprint, independent counters.
Genuine content changes: Conversations where content evolves naturally produce different fingerprints on each turn. The system catches repetitive identical patterns, not normal dialogue.

In production across real traffic, we've seen zero false positives from legitimate usage. The fingerprinting is conservative enough that only truly identical, repeated request patterns within the detection window trigger it.

The cost math

Without loop protection, the blast radius of a single agent failure:

Model	Blended cost/1K tokens	Tokens per loop iteration	Cost per hour (1 req/sec)
GPT-4.1	~$0.012	~2,500	~$108
Claude Sonnet 4	~$0.018	~2,500	~$162
GPT-4.1-mini	~$0.002	~2,500	~$18

Blended rate assumes typical agent call token distribution (input-heavy). Actual cost depends on your input/output ratio and current provider pricing. Calculate your own: (input_tokens × input_rate + output_tokens × output_rate) × 3600.

With loop protection (default settings): the loop is caught after a small number of identical requests within the detection window. Total cost: under $1 instead of $100+. The blast radius drops by orders of magnitude.

Running it yourself

Loop detection is built into AI Security Gateway — active on every request by default, no configuration needed. It works with any OpenAI-compatible client (Python, Node, Go, curl) since it operates at the HTTP layer. The open-source core (GitHub) includes the DLP proxy and multi-provider routing; loop protection is part of the managed cloud offering.

If you're building your own loop detector, the code above is a complete starting point. The important design decisions are:

Fingerprint the tail, not the full conversation — catches loops without false positives on normal usage
Use atomic distributed counters — works across horizontally-scaled instances
Separate cooldown from detection window — prevents the loop from resuming after counter expiry
Include API key and model in the fingerprint — isolates users and legitimate multi-model usage

If your agents are running in production without this, it's not a question of if you'll hit a loop — it's when.

DEV Community

Your AI Agent Just Burned $108 in an Hour. Here's the 50-Line Fix.

Why agents loop

Why `max_iterations` doesn't save you

The detection algorithm

Why these design choices?

The counter: atomic increment with TTL

The response

What doesn't trigger detection

The cost math

Running it yourself

Top comments (0)

Why agents loop

Why max_iterations doesn't save you

The detection algorithm

Why these design choices?

The counter: atomic increment with TTL

The response

What doesn't trigger detection

The cost math

Running it yourself

Why `max_iterations` doesn't save you