Autonomous AI agents have a failure mode that every team discovers the hard way: infinite retry loops.
The agent sends a request. The model returns something the agent can't parse. The agent retries with the same prompt. Same response. Retry. Retry. Retry — hundreds of times before anyone notices.
The math is unforgiving: a single GPT-4-class agent loop at one request per second drains over $100 in an hour. Over a weekend with no one watching, that's $2,500+ before Monday morning.
If you're running LangChain, CrewAI, AutoGPT, or any custom agent framework in production, this will happen to you. The question is whether you catch it in 30 seconds or 30 hours.
Why agents loop
The causes are predictable across every framework:
# The classic loop: model output doesn't match expected format
while True:
response = llm.invoke(prompt)
try:
result = parse_json(response) # fails
break
except ParseError:
prompt = f"That wasn't valid JSON. Try again: {prompt}"
# Same prompt → same bad response → infinite loop
The specific triggers:
- Parsing failures: The model returns output that doesn't match the expected format. The agent retries, hoping for a different result. It won't be.
- Tool call errors: A tool returns an error. The agent tries the same call with the same parameters.
- Hallucinated tool names: The model calls a tool that doesn't exist. The error goes back, and the model calls the same non-existent tool again.
- "Let me try again" behavior: Some models, when told their output was wrong, rephrase the same answer — creating an infinite feedback loop.
-
Missing termination conditions:
max_iterationsset to 1,000, or not set at all.
Why max_iterations doesn't save you
Most frameworks offer max_iterations or similar parameters. The limitations:
| Problem | max_iterations |
Gateway-level detection |
|---|---|---|
| Protects multiple frameworks | No — per-framework | Yes — one chokepoint for all |
| Cross-session detection | No | Yes — shared state |
| Default is useful | Often 100-1000 | Tight defaults, configurable |
| Sub-agent spawning | Bypassed | Still caught |
| Language-agnostic | No — Python only | Yes — HTTP layer |
The fundamental issue: max_iterations is a per-framework, per-language, per-deployment setting. Gateway-level detection sits below all of it. Every request passes through the same chokepoint regardless of what generated it.
The detection algorithm
Here's the approach we use in AI Security Gateway. The core idea is fingerprinting + sliding window counting:
import hashlib
import json
def make_request_fingerprint(
caller_id: str,
model: str,
messages: list[dict],
) -> str:
"""Build a deterministic fingerprint for a request.
The idea: hash the caller identity, model, and the
recent message content into a single fixed-length key.
If the same key appears too often, it's a loop.
"""
# Focus on the recent tail of the conversation —
# full history changes naturally, but loops repeat
# the same tail over and over
TAIL_WINDOW = 3 # tune to your workload
recent = messages[-TAIL_WINDOW:] if len(messages) > TAIL_WINDOW else messages
texts = []
for msg in recent:
content = msg.get("content", "")
# Flatten multimodal content to text-only
if isinstance(content, list):
content = " ".join(
part.get("text", "")
for part in content
if isinstance(part, dict) and part.get("type") == "text"
)
texts.append(str(content).strip().lower())
blob = json.dumps(
{"who": caller_id, "model": model, "texts": texts},
sort_keys=True,
)
return hashlib.sha256(blob.encode()).hexdigest()
Why these design choices?
Fingerprint the tail, not the full conversation. The full message history changes naturally as a conversation evolves, but a looping agent repeats the same recent messages. Focusing on the tail catches loops without flagging normal multi-turn conversations.
Caller identity in the fingerprint. Two different users sending the same prompt are independent — separate counters per caller. One user's legitimate batch job doesn't trigger detection for another user.
Model in the fingerprint. Sending the same prompt to different models (e.g., trying GPT-4.1 then Claude) is legitimate fallback behavior, not a loop.
Normalize and lowercase. Prevents trivial variations (trailing whitespace, case changes) from evading detection.
The counter: atomic increment with TTL
The fingerprint feeds into a sliding-window counter. Here's the check logic:
async def is_looping(
fingerprint: str,
cache, # Redis-compatible async client
window: int, # sliding window in seconds
threshold: int, # max allowed identical requests
cooldown: int, # block duration after detection
) -> bool:
"""Check if a fingerprint indicates a runaway loop.
Uses atomic INCR so this works correctly across
horizontally-scaled instances sharing a cache.
"""
# Fast path: already in cooldown from a previous trigger?
if await cache.get(f"cool:{fingerprint}"):
return True
# Atomic increment — each call bumps the count by 1.
# The TTL means the counter auto-expires after `window`
# seconds, so it's a natural sliding window.
count = await cache.incr(f"cnt:{fingerprint}")
if count == 1:
await cache.expire(f"cnt:{fingerprint}", window)
if count > threshold:
# Enter cooldown — block requests for this fingerprint
# even after the counter key expires
await cache.setex(f"cool:{fingerprint}", cooldown, 1)
return True
return False
The key properties:
- Atomic INCR — no race conditions when multiple proxy instances share the same cache
- TTL on the counter — the window auto-expires, no cleanup cron needed
- Separate cooldown key — once a loop is detected, the block persists even after the counter key expires. This prevents the agent from resuming the loop after the window resets.
- Distributed state — when backed by a Redis-compatible store, an agent sending requests to different proxy instances is still caught. For single-instance setups, an in-memory backend works too.
The response
When a loop is detected, the client gets a structured, actionable error:
{
"detail": {
"error": "recursive_loop_detected",
"message": "Blocked: repetitive request pattern detected. This usually indicates an agent retry loop.",
"cooldown_seconds": 30
}
}
HTTP 429 (not 500) — because it's a client-side issue that the client should handle. The structured error field lets your agent framework catch it specifically:
from openai import OpenAI, RateLimitError
client = OpenAI(
base_url="https://your-gateway.example.com/v1",
api_key="your-key",
)
try:
response = client.chat.completions.create(
model="oah/gpt-4.1-mini",
messages=messages,
)
except RateLimitError as e:
if "recursive_loop_detected" in str(e):
# Agent is looping — stop retrying, alert the team
notify_slack("Agent loop detected, halting execution")
raise SystemExit(1)
raise # Normal rate limit — retry with backoff
What doesn't trigger detection
This matters as much as what does:
- Normal conversation: Users sending different messages to the same model — the message content changes, so the fingerprint changes. Never triggered.
- Batch processing: Same prompt to different models — model is part of the fingerprint, independent counters.
- Different users: Two users sending the same prompt — caller identity is part of the fingerprint, independent counters.
- Genuine content changes: Conversations where content evolves naturally produce different fingerprints on each turn. The system catches repetitive identical patterns, not normal dialogue.
In production across real traffic, we've seen zero false positives from legitimate usage. The fingerprinting is conservative enough that only truly identical, repeated request patterns within the detection window trigger it.
The cost math
Without loop protection, the blast radius of a single agent failure:
| Model | Blended cost/1K tokens | Tokens per loop iteration | Cost per hour (1 req/sec) |
|---|---|---|---|
| GPT-4.1 | ~$0.012 | ~2,500 | ~$108 |
| Claude Sonnet 4 | ~$0.018 | ~2,500 | ~$162 |
| GPT-4.1-mini | ~$0.002 | ~2,500 | ~$18 |
Blended rate assumes typical agent call token distribution (input-heavy). Actual cost depends on your input/output ratio and current provider pricing. Calculate your own: (input_tokens × input_rate + output_tokens × output_rate) × 3600.
With loop protection (default settings): the loop is caught after a small number of identical requests within the detection window. Total cost: under $1 instead of $100+. The blast radius drops by orders of magnitude.
Running it yourself
Loop detection is built into AI Security Gateway — active on every request by default, no configuration needed. It works with any OpenAI-compatible client (Python, Node, Go, curl) since it operates at the HTTP layer. The open-source core (GitHub) includes the DLP proxy and multi-provider routing; loop protection is part of the managed cloud offering.
If you're building your own loop detector, the code above is a complete starting point. The important design decisions are:
- Fingerprint the tail, not the full conversation — catches loops without false positives on normal usage
- Use atomic distributed counters — works across horizontally-scaled instances
- Separate cooldown from detection window — prevents the loop from resuming after counter expiry
- Include API key and model in the fingerprint — isolates users and legitimate multi-model usage
If your agents are running in production without this, it's not a question of if you'll hit a loop — it's when.
Top comments (0)