DEV Community: Binu George

Your AI Agent Just Burned $108 in an Hour. Here's the 50-Line Fix.

Binu George — Wed, 03 Jun 2026 20:33:57 +0000

Autonomous AI agents have a failure mode that every team discovers the hard way: infinite retry loops.

The agent sends a request. The model returns something the agent can't parse. The agent retries with the same prompt. Same response. Retry. Retry. Retry — hundreds of times before anyone notices.

The math is unforgiving: a single GPT-4-class agent loop at one request per second drains over $100 in an hour. Over a weekend with no one watching, that's $2,500+ before Monday morning.

If you're running LangChain, CrewAI, AutoGPT, or any custom agent framework in production, this will happen to you. The question is whether you catch it in 30 seconds or 30 hours.

Why agents loop

The causes are predictable across every framework:

# The classic loop: model output doesn't match expected format
while True:
    response = llm.invoke(prompt)
    try:
        result = parse_json(response)  # fails
        break
    except ParseError:
        prompt = f"That wasn't valid JSON. Try again: {prompt}"
        # Same prompt → same bad response → infinite loop

The specific triggers:

Parsing failures: The model returns output that doesn't match the expected format. The agent retries, hoping for a different result. It won't be.
Tool call errors: A tool returns an error. The agent tries the same call with the same parameters.
Hallucinated tool names: The model calls a tool that doesn't exist. The error goes back, and the model calls the same non-existent tool again.
"Let me try again" behavior: Some models, when told their output was wrong, rephrase the same answer — creating an infinite feedback loop.
Missing termination conditions: max_iterations set to 1,000, or not set at all.

Why `max_iterations` doesn't save you

Most frameworks offer max_iterations or similar parameters. The limitations:

Problem	`max_iterations`	Gateway-level detection
Protects multiple frameworks	No — per-framework	Yes — one chokepoint for all
Cross-session detection	No	Yes — shared state
Default is useful	Often 100-1000	Tight defaults, configurable
Sub-agent spawning	Bypassed	Still caught
Language-agnostic	No — Python only	Yes — HTTP layer

The fundamental issue: max_iterations is a per-framework, per-language, per-deployment setting. Gateway-level detection sits below all of it. Every request passes through the same chokepoint regardless of what generated it.

The detection algorithm

Here's the approach we use in AI Security Gateway. The core idea is fingerprinting + sliding window counting:

import hashlib
import json

def make_request_fingerprint(
    caller_id: str,
    model: str,
    messages: list[dict],
) -> str:
    """Build a deterministic fingerprint for a request.

    The idea: hash the caller identity, model, and the
    recent message content into a single fixed-length key.
    If the same key appears too often, it's a loop.
    """
    # Focus on the recent tail of the conversation —
    # full history changes naturally, but loops repeat
    # the same tail over and over
    TAIL_WINDOW = 3   # tune to your workload
    recent = messages[-TAIL_WINDOW:] if len(messages) > TAIL_WINDOW else messages

    texts = []
    for msg in recent:
        content = msg.get("content", "")
        # Flatten multimodal content to text-only
        if isinstance(content, list):
            content = " ".join(
                part.get("text", "")
                for part in content
                if isinstance(part, dict) and part.get("type") == "text"
            )
        texts.append(str(content).strip().lower())

    blob = json.dumps(
        {"who": caller_id, "model": model, "texts": texts},
        sort_keys=True,
    )
    return hashlib.sha256(blob.encode()).hexdigest()

Why these design choices?

Fingerprint the tail, not the full conversation. The full message history changes naturally as a conversation evolves, but a looping agent repeats the same recent messages. Focusing on the tail catches loops without flagging normal multi-turn conversations.

Caller identity in the fingerprint. Two different users sending the same prompt are independent — separate counters per caller. One user's legitimate batch job doesn't trigger detection for another user.

Model in the fingerprint. Sending the same prompt to different models (e.g., trying GPT-4.1 then Claude) is legitimate fallback behavior, not a loop.

Normalize and lowercase. Prevents trivial variations (trailing whitespace, case changes) from evading detection.

The counter: atomic increment with TTL

The fingerprint feeds into a sliding-window counter. Here's the check logic:

async def is_looping(
    fingerprint: str,
    cache,           # Redis-compatible async client
    window: int,     # sliding window in seconds
    threshold: int,  # max allowed identical requests
    cooldown: int,   # block duration after detection
) -> bool:
    """Check if a fingerprint indicates a runaway loop.

    Uses atomic INCR so this works correctly across
    horizontally-scaled instances sharing a cache.
    """
    # Fast path: already in cooldown from a previous trigger?
    if await cache.get(f"cool:{fingerprint}"):
        return True

    # Atomic increment — each call bumps the count by 1.
    # The TTL means the counter auto-expires after `window`
    # seconds, so it's a natural sliding window.
    count = await cache.incr(f"cnt:{fingerprint}")
    if count == 1:
        await cache.expire(f"cnt:{fingerprint}", window)

    if count > threshold:
        # Enter cooldown — block requests for this fingerprint
        # even after the counter key expires
        await cache.setex(f"cool:{fingerprint}", cooldown, 1)
        return True

    return False

The key properties:

Atomic INCR — no race conditions when multiple proxy instances share the same cache
TTL on the counter — the window auto-expires, no cleanup cron needed
Separate cooldown key — once a loop is detected, the block persists even after the counter key expires. This prevents the agent from resuming the loop after the window resets.
Distributed state — when backed by a Redis-compatible store, an agent sending requests to different proxy instances is still caught. For single-instance setups, an in-memory backend works too.

The response

When a loop is detected, the client gets a structured, actionable error:

{
  "detail": {
    "error": "recursive_loop_detected",
    "message": "Blocked: repetitive request pattern detected. This usually indicates an agent retry loop.",
    "cooldown_seconds": 30
  }
}

HTTP 429 (not 500) — because it's a client-side issue that the client should handle. The structured error field lets your agent framework catch it specifically:

from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://your-gateway.example.com/v1",
    api_key="your-key",
)

try:
    response = client.chat.completions.create(
        model="oah/gpt-4.1-mini",
        messages=messages,
    )
except RateLimitError as e:
    if "recursive_loop_detected" in str(e):
        # Agent is looping — stop retrying, alert the team
        notify_slack("Agent loop detected, halting execution")
        raise SystemExit(1)
    raise  # Normal rate limit — retry with backoff

What doesn't trigger detection

This matters as much as what does:

Normal conversation: Users sending different messages to the same model — the message content changes, so the fingerprint changes. Never triggered.
Batch processing: Same prompt to different models — model is part of the fingerprint, independent counters.
Different users: Two users sending the same prompt — caller identity is part of the fingerprint, independent counters.
Genuine content changes: Conversations where content evolves naturally produce different fingerprints on each turn. The system catches repetitive identical patterns, not normal dialogue.

In production across real traffic, we've seen zero false positives from legitimate usage. The fingerprinting is conservative enough that only truly identical, repeated request patterns within the detection window trigger it.

The cost math

Without loop protection, the blast radius of a single agent failure:

Model	Blended cost/1K tokens	Tokens per loop iteration	Cost per hour (1 req/sec)
GPT-4.1	~$0.012	~2,500	~$108
Claude Sonnet 4	~$0.018	~2,500	~$162
GPT-4.1-mini	~$0.002	~2,500	~$18

Blended rate assumes typical agent call token distribution (input-heavy). Actual cost depends on your input/output ratio and current provider pricing. Calculate your own: (input_tokens × input_rate + output_tokens × output_rate) × 3600.

With loop protection (default settings): the loop is caught after a small number of identical requests within the detection window. Total cost: under $1 instead of $100+. The blast radius drops by orders of magnitude.

Running it yourself

Loop detection is built into AI Security Gateway — active on every request by default, no configuration needed. It works with any OpenAI-compatible client (Python, Node, Go, curl) since it operates at the HTTP layer. The open-source core (GitHub) includes the DLP proxy and multi-provider routing; loop protection is part of the managed cloud offering.

If you're building your own loop detector, the code above is a complete starting point. The important design decisions are:

Fingerprint the tail, not the full conversation — catches loops without false positives on normal usage
Use atomic distributed counters — works across horizontally-scaled instances
Separate cooldown from detection window — prevents the loop from resuming after counter expiry
Include API key and model in the fingerprint — isolates users and legitimate multi-model usage

If your agents are running in production without this, it's not a question of if you'll hit a loop — it's when.

I Built an Open-Source AI Firewall Because Every LLM App Leaks Data

Binu George — Fri, 08 May 2026 02:39:23 +0000

Every LLM app I audited had the same problem.

Users type real data into AI features. Names, emails, social security numbers, credit card numbers, medical details. The app takes that input, wraps it in a prompt, and sends it straight to OpenAI or Anthropic. No filtering. No redaction. Nothing.

The developer didn't plan for it. The product manager didn't think about it. The compliance team doesn't even know AI features exist yet.

I built AI Security Gateway to fix this. It's an open-source proxy that sits between your app and any LLM provider. Every prompt passes through a security layer before it reaches the model.

What It Does

The proxy inspects every request in real-time and applies four layers of governance:

1. PII Redaction

Before your prompt reaches OpenAI, Anthropic, Google, or anyone else, the proxy detects and redacts 28+ PII entity types:

Personal identifiers — names, emails, phone numbers, dates of birth
Financial data — credit card numbers, IBANs, bank accounts
Government IDs — SSNs, passport numbers, driver's licenses
Medical identifiers — medical record numbers, NPI numbers
Locations — physical addresses, IP addresses
Custom patterns — your own regex for internal codes, customer IDs, etc.

It also handles images. If a user uploads a screenshot to a vision model (GPT-4o, Claude, Gemini), our OCR pipeline extracts text from the image and scans it for PII before the image reaches the provider.

2. Prompt Injection Blocking

Heuristic detection catches jailbreak attempts, role override attacks, and instruction extraction — combined with custom regex rules for your specific application patterns.

3. Budget Enforcement

Set hard spend caps per API key. When a key hits its limit, the proxy returns HTTP 402. Not a warning — a hard stop.

This exists because I watched an agent loop burn through $3,000 in a single night during testing.

4. Smart Cost Routing

Configure multiple providers and the proxy automatically routes each request to the cheapest available model. We track live pricing across 600+ models and 8+ providers. Teams typically see 30-60% cost reduction from routing alone.

The Architecture Decision That Matters Most

AISG is fully stateless. This isn't a feature toggle — it's the architecture.

Prompts pass through memory and are discarded. Only metadata survives: cost, latency, token counts, PII entity counts, policy violations. The proxy physically cannot retain prompt content. There's no database to store it, no log to write it to, no queue to buffer it.

I made this decision early because the alternative — a proxy that logs everything "for observability" — creates exactly the problem it claims to solve. You're trying to prevent data leaking to third parties, so you route it through a proxy that... stores all the data? That never made sense to me.

This matters for compliance:

Standard	What it means with AISG
HIPAA	Patient data in prompts never persists outside your app
PCI DSS	Credit card numbers redacted before any third-party API call
GDPR	No personal data stored by the proxy layer
SOC 2	Audit logs capture what happened without capturing what was said

The Tech Stack

For anyone interested in what's under the hood:

Python + FastAPI — async proxy layer, handles streaming responses
Presidio + custom NER — multi-layered PII detection pipeline
Database — metadata only (costs, violations, never prompts)
Docker Compose — single command self-hosting
AWS — managed cloud version

Integration

If you're using the OpenAI SDK, it's two lines:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aisecuritygateway.ai/v1",
    api_key="your-aisg-key"
)

# Your existing code stays exactly the same
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this contract..."}]
)

No new SDK.

No wrapper library.

Your existing OpenAI calls now go through:

PII redaction
Injection blocking
Budget enforcement
Smart routing

All transparent to your application.

What I Learned Building This

1. PII Detection Is Harder Than You Think

"John Smith" is a name. "Smith & Wesson" is not. "Call me at 555-1234" contains a phone number. "Error code 555-1234" does not. Context matters enormously. Regex alone gets you maybe 60% accuracy. You need NER models layered on top.

2. Latency Budgets Are Brutal

Every millisecond of proxy overhead is overhead users feel.We got text inspection down to ~50ms. Image OCR still costs ~0.5–1 second. That's the trade-off — and for images containing PII, it's worth it.

3. Budget Enforcement Became the Killer Feature

I originally built this for PII redaction. But the feature people ask about most is budget caps. Turns out, "My agent loop burned $2,000 overnight" is a more common pain point than, "My prompts contain SSNs."

4. Self-Hosting Is a Trust Multiplier

Making the entire stack open-source under Apache 2.0 was the best decision I made. Enterprise security teams don't trust a proxy they can't inspect. Open source removes that objection immediately.

Try It

Managed Cloud

Website: https://aisecuritygateway.ai
Free credits: 1M credits
Credit card required: No

Self-Host

docker compose up

GitHub: https://github.com/aisecuritygateway/aisecuritygateway

Documentation

https://aisecuritygateway.ai/docs
The project is Apache 2.0 licensed. Stars, issues, and PRs are all welcome.

Final Thought

I'd love to hear from anyone dealing with PII in LLM prompts.

What's your current approach?

Filtering at the application layer?
Using a proxy?
Ignoring it and hoping for the best?