<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Binu George</title>
    <description>The latest articles on DEV Community by Binu George (@bgp).</description>
    <link>https://dev.to/bgp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3918991%2Ff18ebbbb-d88f-4735-ae29-e6928b36b858.jpg</url>
      <title>DEV Community: Binu George</title>
      <link>https://dev.to/bgp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bgp"/>
    <language>en</language>
    <item>
      <title>Your AI Agent Just Burned $108 in an Hour. Here's the 50-Line Fix.</title>
      <dc:creator>Binu George</dc:creator>
      <pubDate>Wed, 03 Jun 2026 20:33:57 +0000</pubDate>
      <link>https://dev.to/bgp/your-ai-agent-just-burned-108-in-an-hour-heres-the-50-line-fix-4ifl</link>
      <guid>https://dev.to/bgp/your-ai-agent-just-burned-108-in-an-hour-heres-the-50-line-fix-4ifl</guid>
      <description>&lt;p&gt;Autonomous AI agents have a failure mode that every team discovers the hard way: infinite retry loops.&lt;/p&gt;

&lt;p&gt;The agent sends a request. The model returns something the agent can't parse. The agent retries with the same prompt. Same response. Retry. Retry. Retry — hundreds of times before anyone notices.&lt;/p&gt;

&lt;p&gt;The math is unforgiving: &lt;strong&gt;a single GPT-4-class agent loop at one request per second drains over $100 in an hour.&lt;/strong&gt; Over a weekend with no one watching, that's $2,500+ before Monday morning.&lt;/p&gt;

&lt;p&gt;If you're running LangChain, CrewAI, AutoGPT, or any custom agent framework in production, this will happen to you. The question is whether you catch it in 30 seconds or 30 hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why agents loop
&lt;/h2&gt;

&lt;p&gt;The causes are predictable across every framework:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The classic loop: model output doesn't match expected format
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# fails
&lt;/span&gt;        &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ParseError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;That wasn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t valid JSON. Try again: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="c1"&gt;# Same prompt → same bad response → infinite loop
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The specific triggers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parsing failures:&lt;/strong&gt; The model returns output that doesn't match the expected format. The agent retries, hoping for a different result. It won't be.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool call errors:&lt;/strong&gt; A tool returns an error. The agent tries the same call with the same parameters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated tool names:&lt;/strong&gt; The model calls a tool that doesn't exist. The error goes back, and the model calls the same non-existent tool again.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Let me try again" behavior:&lt;/strong&gt; Some models, when told their output was wrong, rephrase the same answer — creating an infinite feedback loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing termination conditions:&lt;/strong&gt; &lt;code&gt;max_iterations&lt;/code&gt; set to 1,000, or not set at all.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why &lt;code&gt;max_iterations&lt;/code&gt; doesn't save you
&lt;/h2&gt;

&lt;p&gt;Most frameworks offer &lt;code&gt;max_iterations&lt;/code&gt; or similar parameters. The limitations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;&lt;code&gt;max_iterations&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Gateway-level detection&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Protects multiple frameworks&lt;/td&gt;
&lt;td&gt;No — per-framework&lt;/td&gt;
&lt;td&gt;Yes — one chokepoint for all&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-session detection&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes — shared state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default is useful&lt;/td&gt;
&lt;td&gt;Often 100-1000&lt;/td&gt;
&lt;td&gt;Tight defaults, configurable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sub-agent spawning&lt;/td&gt;
&lt;td&gt;Bypassed&lt;/td&gt;
&lt;td&gt;Still caught&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language-agnostic&lt;/td&gt;
&lt;td&gt;No — Python only&lt;/td&gt;
&lt;td&gt;Yes — HTTP layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fundamental issue: &lt;code&gt;max_iterations&lt;/code&gt; is a per-framework, per-language, per-deployment setting. Gateway-level detection sits below all of it. Every request passes through the same chokepoint regardless of what generated it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The detection algorithm
&lt;/h2&gt;

&lt;p&gt;Here's the approach we use in &lt;a href="https://aisecuritygateway.ai" rel="noopener noreferrer"&gt;AI Security Gateway&lt;/a&gt;. The core idea is fingerprinting + sliding window counting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;make_request_fingerprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;caller_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Build a deterministic fingerprint for a request.

    The idea: hash the caller identity, model, and the
    recent message content into a single fixed-length key.
    If the same key appears too often, it&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s a loop.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Focus on the recent tail of the conversation —
&lt;/span&gt;    &lt;span class="c1"&gt;# full history changes naturally, but loops repeat
&lt;/span&gt;    &lt;span class="c1"&gt;# the same tail over and over
&lt;/span&gt;    &lt;span class="n"&gt;TAIL_WINDOW&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;   &lt;span class="c1"&gt;# tune to your workload
&lt;/span&gt;    &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;TAIL_WINDOW&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;TAIL_WINDOW&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;

    &lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Flatten multimodal content to text-only
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;who&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;caller_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;texts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why these design choices?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Fingerprint the tail, not the full conversation.&lt;/strong&gt; The full message history changes naturally as a conversation evolves, but a looping agent repeats the same recent messages. Focusing on the tail catches loops without flagging normal multi-turn conversations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caller identity in the fingerprint.&lt;/strong&gt; Two different users sending the same prompt are independent — separate counters per caller. One user's legitimate batch job doesn't trigger detection for another user.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model in the fingerprint.&lt;/strong&gt; Sending the same prompt to different models (e.g., trying GPT-4.1 then Claude) is legitimate fallback behavior, not a loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Normalize and lowercase.&lt;/strong&gt; Prevents trivial variations (trailing whitespace, case changes) from evading detection.&lt;/p&gt;

&lt;h2&gt;
  
  
  The counter: atomic increment with TTL
&lt;/h2&gt;

&lt;p&gt;The fingerprint feeds into a sliding-window counter. Here's the check logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_looping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;fingerprint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Redis-compatible async client
&lt;/span&gt;    &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# sliding window in seconds
&lt;/span&gt;    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# max allowed identical requests
&lt;/span&gt;    &lt;span class="n"&gt;cooldown&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# block duration after detection
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Check if a fingerprint indicates a runaway loop.

    Uses atomic INCR so this works correctly across
    horizontally-scaled instances sharing a cache.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Fast path: already in cooldown from a previous trigger?
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cool:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fingerprint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="c1"&gt;# Atomic increment — each call bumps the count by 1.
&lt;/span&gt;    &lt;span class="c1"&gt;# The TTL means the counter auto-expires after `window`
&lt;/span&gt;    &lt;span class="c1"&gt;# seconds, so it's a natural sliding window.
&lt;/span&gt;    &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cnt:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fingerprint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cnt:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fingerprint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Enter cooldown — block requests for this fingerprint
&lt;/span&gt;        &lt;span class="c1"&gt;# even after the counter key expires
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cool:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fingerprint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cooldown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Atomic INCR&lt;/strong&gt; — no race conditions when multiple proxy instances share the same cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTL on the counter&lt;/strong&gt; — the window auto-expires, no cleanup cron needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separate cooldown key&lt;/strong&gt; — once a loop is detected, the block persists even after the counter key expires. This prevents the agent from resuming the loop after the window resets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed state&lt;/strong&gt; — when backed by a Redis-compatible store, an agent sending requests to different proxy instances is still caught. For single-instance setups, an in-memory backend works too.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The response
&lt;/h2&gt;

&lt;p&gt;When a loop is detected, the client gets a structured, actionable error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"detail"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"recursive_loop_detected"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Blocked: repetitive request pattern detected. This usually indicates an agent retry loop."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cooldown_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;HTTP 429 (not 500) — because it's a client-side issue that the client should handle. The structured &lt;code&gt;error&lt;/code&gt; field lets your agent framework catch it specifically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RateLimitError&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-gateway.example.com/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oah/gpt-4.1-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;RateLimitError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recursive_loop_detected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Agent is looping — stop retrying, alert the team
&lt;/span&gt;        &lt;span class="nf"&gt;notify_slack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent loop detected, halting execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SystemExit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt;  &lt;span class="c1"&gt;# Normal rate limit — retry with backoff
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What doesn't trigger detection
&lt;/h2&gt;

&lt;p&gt;This matters as much as what does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Normal conversation:&lt;/strong&gt; Users sending different messages to the same model — the message content changes, so the fingerprint changes. Never triggered.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch processing:&lt;/strong&gt; Same prompt to different models — model is part of the fingerprint, independent counters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Different users:&lt;/strong&gt; Two users sending the same prompt — caller identity is part of the fingerprint, independent counters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Genuine content changes:&lt;/strong&gt; Conversations where content evolves naturally produce different fingerprints on each turn. The system catches repetitive identical patterns, not normal dialogue.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In production across real traffic, we've seen &lt;strong&gt;zero false positives&lt;/strong&gt; from legitimate usage. The fingerprinting is conservative enough that only truly identical, repeated request patterns within the detection window trigger it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost math
&lt;/h2&gt;

&lt;p&gt;Without loop protection, the blast radius of a single agent failure:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Blended cost/1K tokens&lt;/th&gt;
&lt;th&gt;Tokens per loop iteration&lt;/th&gt;
&lt;th&gt;Cost per hour (1 req/sec)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1&lt;/td&gt;
&lt;td&gt;~$0.012&lt;/td&gt;
&lt;td&gt;~2,500&lt;/td&gt;
&lt;td&gt;~$108&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4&lt;/td&gt;
&lt;td&gt;~$0.018&lt;/td&gt;
&lt;td&gt;~2,500&lt;/td&gt;
&lt;td&gt;~$162&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1-mini&lt;/td&gt;
&lt;td&gt;~$0.002&lt;/td&gt;
&lt;td&gt;~2,500&lt;/td&gt;
&lt;td&gt;~$18&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Blended rate assumes typical agent call token distribution (input-heavy). Actual cost depends on your input/output ratio and current provider pricing. Calculate your own: &lt;code&gt;(input_tokens × input_rate + output_tokens × output_rate) × 3600&lt;/code&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With loop protection (default settings): the loop is caught after a small number of identical requests within the detection window. Total cost: &lt;strong&gt;under $1&lt;/strong&gt; instead of $100+. The blast radius drops by orders of magnitude.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running it yourself
&lt;/h2&gt;

&lt;p&gt;Loop detection is built into &lt;a href="https://aisecuritygateway.ai" rel="noopener noreferrer"&gt;AI Security Gateway&lt;/a&gt; — active on every request by default, no configuration needed. It works with any OpenAI-compatible client (Python, Node, Go, curl) since it operates at the HTTP layer. The open-source core (&lt;a href="https://github.com/aisecuritygateway/aisecuritygateway" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;) includes the DLP proxy and multi-provider routing; loop protection is part of the managed cloud offering.&lt;/p&gt;

&lt;p&gt;If you're building your own loop detector, the code above is a complete starting point. The important design decisions are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fingerprint the tail, not the full conversation&lt;/strong&gt; — catches loops without false positives on normal usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use atomic distributed counters&lt;/strong&gt; — works across horizontally-scaled instances&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separate cooldown from detection window&lt;/strong&gt; — prevents the loop from resuming after counter expiry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Include API key and model in the fingerprint&lt;/strong&gt; — isolates users and legitimate multi-model usage&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If your agents are running in production without this, it's not a question of &lt;em&gt;if&lt;/em&gt; you'll hit a loop — it's &lt;em&gt;when&lt;/em&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>langchain</category>
      <category>devops</category>
    </item>
    <item>
      <title>I Built an Open-Source AI Firewall Because Every LLM App Leaks Data</title>
      <dc:creator>Binu George</dc:creator>
      <pubDate>Fri, 08 May 2026 02:39:23 +0000</pubDate>
      <link>https://dev.to/bgp/i-built-an-open-source-ai-firewall-because-every-llm-app-leaks-data-5468</link>
      <guid>https://dev.to/bgp/i-built-an-open-source-ai-firewall-because-every-llm-app-leaks-data-5468</guid>
      <description>&lt;p&gt;Every LLM app I audited had the same problem.&lt;/p&gt;

&lt;p&gt;Users type real data into AI features. Names, emails, social security numbers, credit card numbers, medical details. The app takes that input, wraps it in a prompt, and sends it straight to OpenAI or Anthropic. No filtering. No redaction. Nothing.&lt;/p&gt;

&lt;p&gt;The developer didn't plan for it. The product manager didn't think about it. The compliance team doesn't even know AI features exist yet.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://aisecuritygateway.ai" rel="noopener noreferrer"&gt;AI Security Gateway&lt;/a&gt; to fix this. It's an open-source proxy that sits between your app and any LLM provider. Every prompt passes through a security layer before it reaches the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;The proxy inspects every request in real-time and applies four layers of governance:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. PII Redaction
&lt;/h3&gt;

&lt;p&gt;Before your prompt reaches OpenAI, Anthropic, Google, or anyone else, the proxy detects and redacts 28+ PII entity types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Personal identifiers&lt;/strong&gt; — names, emails, phone numbers, dates of birth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Financial data&lt;/strong&gt; — credit card numbers, IBANs, bank accounts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Government IDs&lt;/strong&gt; — SSNs, passport numbers, driver's licenses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medical identifiers&lt;/strong&gt; — medical record numbers, NPI numbers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Locations&lt;/strong&gt; — physical addresses, IP addresses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom patterns&lt;/strong&gt; — your own regex for internal codes, customer IDs, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also handles images. If a user uploads a screenshot to a vision model (GPT-4o, Claude, Gemini), our OCR pipeline extracts text from the image and scans it for PII before the image reaches the provider.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Prompt Injection Blocking
&lt;/h3&gt;

&lt;p&gt;Heuristic detection catches jailbreak attempts, role override attacks, and instruction extraction — combined with custom regex rules for your specific application patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Budget Enforcement
&lt;/h3&gt;

&lt;p&gt;Set hard spend caps per API key. When a key hits its limit, the proxy returns &lt;code&gt;HTTP 402&lt;/code&gt;. Not a warning — a hard stop.&lt;/p&gt;

&lt;p&gt;This exists because I watched an agent loop burn through $3,000 in a single night during testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Smart Cost Routing
&lt;/h3&gt;

&lt;p&gt;Configure multiple providers and the proxy automatically routes each request to the cheapest available model. We track live pricing across 600+ models and 8+ providers. Teams typically see 30-60% cost reduction from routing alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Decision That Matters Most
&lt;/h2&gt;

&lt;p&gt;AISG is fully stateless. This isn't a feature toggle — it's the architecture.&lt;/p&gt;

&lt;p&gt;Prompts pass through memory and are discarded. Only metadata survives: cost, latency, token counts, PII entity counts, policy violations. The proxy physically cannot retain prompt content. There's no database to store it, no log to write it to, no queue to buffer it.&lt;/p&gt;

&lt;p&gt;I made this decision early because the alternative — a proxy that logs everything "for observability" — creates exactly the problem it claims to solve. You're trying to prevent data leaking to third parties, so you route it through a proxy that... stores all the data? That never made sense to me.&lt;/p&gt;

&lt;p&gt;This matters for compliance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Standard&lt;/th&gt;
&lt;th&gt;What it means with AISG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HIPAA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Patient data in prompts never persists outside your app&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PCI DSS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Credit card numbers redacted before any third-party API call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GDPR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No personal data stored by the proxy layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SOC 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Audit logs capture what happened without capturing what was said&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Tech Stack
&lt;/h2&gt;

&lt;p&gt;For anyone interested in what's under the hood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python + FastAPI&lt;/strong&gt; — async proxy layer, handles streaming responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Presidio + custom NER&lt;/strong&gt; — multi-layered PII detection pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database&lt;/strong&gt; — metadata only (costs, violations, never prompts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Compose&lt;/strong&gt; — single command self-hosting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS&lt;/strong&gt; — managed cloud version&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Integration
&lt;/h2&gt;

&lt;p&gt;If you're using the OpenAI SDK, it's two lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.aisecuritygateway.ai/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-aisg-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Your existing code stays exactly the same
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this contract...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No new SDK.&lt;/p&gt;

&lt;p&gt;No wrapper library.&lt;/p&gt;

&lt;p&gt;Your existing OpenAI calls now go through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PII redaction&lt;/li&gt;
&lt;li&gt;Injection blocking&lt;/li&gt;
&lt;li&gt;Budget enforcement&lt;/li&gt;
&lt;li&gt;Smart routing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All transparent to your application.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned Building This
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. PII Detection Is Harder Than You Think
&lt;/h3&gt;

&lt;p&gt;"John Smith" is a name. "Smith &amp;amp; Wesson" is not. "Call me at 555-1234" contains a phone number. "Error code 555-1234" does not. Context matters enormously. Regex alone gets you maybe 60% accuracy. You need NER models layered on top.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Latency Budgets Are Brutal
&lt;/h3&gt;

&lt;p&gt;Every millisecond of proxy overhead is overhead users feel.We got text inspection down to ~50ms. Image OCR still costs ~0.5–1 second. That's the trade-off — and for images containing PII, it's worth it.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Budget Enforcement Became the Killer Feature
&lt;/h3&gt;

&lt;p&gt;I originally built this for PII redaction. But the feature people ask about most is budget caps. Turns out, "My agent loop burned $2,000 overnight" is a more common pain point than, "My prompts contain SSNs."&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Self-Hosting Is a Trust Multiplier
&lt;/h3&gt;

&lt;p&gt;Making the entire stack open-source under Apache 2.0 was the best decision I made. Enterprise security teams don't trust a proxy they can't inspect. Open source removes that objection immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Managed Cloud
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://aisecuritygateway.ai" rel="noopener noreferrer"&gt;https://aisecuritygateway.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free credits:&lt;/strong&gt; 1M credits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credit card required:&lt;/strong&gt; No&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Self-Host
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/aisecuritygateway/aisecuritygateway" rel="noopener noreferrer"&gt;https://github.com/aisecuritygateway/aisecuritygateway&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Documentation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aisecuritygateway.ai/docs" rel="noopener noreferrer"&gt;https://aisecuritygateway.ai/docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The project is Apache 2.0 licensed. Stars, issues, and PRs are all welcome.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;I'd love to hear from anyone dealing with PII in LLM prompts.&lt;/p&gt;

&lt;p&gt;What's your current approach?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filtering at the application layer?&lt;/li&gt;
&lt;li&gt;Using a proxy?&lt;/li&gt;
&lt;li&gt;Ignoring it and hoping for the best?&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>opensource</category>
      <category>python</category>
    </item>
  </channel>
</rss>
