<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gs. Sanjana</title>
    <description>The latest articles on DEV Community by Gs. Sanjana (@gs_sanjana_3e822112e14f8).</description>
    <link>https://dev.to/gs_sanjana_3e822112e14f8</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4003482%2F090b09ca-52cc-4b7c-a609-9b77c2efc574.jpg</url>
      <title>DEV Community: Gs. Sanjana</title>
      <link>https://dev.to/gs_sanjana_3e822112e14f8</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gs_sanjana_3e822112e14f8"/>
    <language>en</language>
    <item>
      <title>Your AI Agent Doesn't Need to Be Smarter. It Needs to Be Idempotent</title>
      <dc:creator>Gs. Sanjana</dc:creator>
      <pubDate>Sat, 27 Jun 2026 19:22:08 +0000</pubDate>
      <link>https://dev.to/gs_sanjana_3e822112e14f8/your-ai-agent-doesnt-need-to-be-smarter-it-needs-to-be-idempotent-2736</link>
      <guid>https://dev.to/gs_sanjana_3e822112e14f8/your-ai-agent-doesnt-need-to-be-smarter-it-needs-to-be-idempotent-2736</guid>
      <description>&lt;p&gt;Most of the failures I see in production AI agents aren't reasoning failures. The model picks the right tool, fills in the right arguments, and makes a perfectly sensible decision. Then the agent charges the customer twice.&lt;/p&gt;

&lt;p&gt;The reason is mundane and has nothing to do with intelligence. A write-capable agent — one that can send an email, create a ticket, move money, or update a database — lives inside the same unreliable network as any other distributed system. Requests time out. Connections drop after the server already committed the write but before the response came back. An orchestration framework retries a step that looked like it failed but didn't. And because the agent is a loop that re-plans on every observation, a single ambiguous outcome can send it down the path of just trying the action again.&lt;/p&gt;

&lt;p&gt;In a read-only agent, a retry is free. In a write-capable agent, a retry is a second irreversible action in the real world. That asymmetry is the whole game, and the fix is older than LLMs: idempotency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of the bug
&lt;/h2&gt;

&lt;p&gt;Here's the sequence that bites teams over and over. The agent calls &lt;code&gt;send_invoice&lt;/code&gt;. The downstream service receives it, creates the invoice, and starts sending the response. Somewhere on the way back, the connection dies. From the agent's point of view, the call &lt;em&gt;failed&lt;/em&gt; — it got a timeout, not a 200. So the agent, doing exactly what a resilient system is supposed to do, retries. Now there are two invoices.&lt;/p&gt;

&lt;p&gt;Notice that nothing here is the model's fault. You could swap in a smarter model and the bug gets &lt;em&gt;worse&lt;/em&gt;, because a more capable agent is more aggressive about recovering from apparent failures. The intelligence layer and the reliability layer are different problems, and you cannot prompt your way out of a network partition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Borrow the pattern that already won
&lt;/h2&gt;

&lt;p&gt;Payments infrastructure solved this years ago, and the solution is worth copying wholesale. Stripe's API lets a client attach an &lt;code&gt;Idempotency-Key&lt;/code&gt; header to any POST request. Per &lt;a href="https://docs.stripe.com/api/idempotent_requests" rel="noopener noreferrer"&gt;Stripe's API reference&lt;/a&gt;, the server saves the status code and body of the first request made for a given key, and subsequent requests with the same key return that same stored result — even if the original was a failure. Stripe recommends a V4 UUID or another random string with enough entropy to avoid collisions, and notes that keys can be pruned automatically once they're at least 24 hours old.&lt;/p&gt;

&lt;p&gt;The mechanism is simple, but the &lt;em&gt;insight&lt;/em&gt; is the part to internalize: the safety guarantee lives at the boundary, keyed on the caller's stated intent, not on the model's judgment. The agent is allowed to be flaky. The boundary is what makes flakiness safe.&lt;/p&gt;

&lt;p&gt;For an agent, the only adaptation is where the key comes from. A human checkout flow generates one fresh key per user click. An agent has no clicks — so you derive the key from the &lt;em&gt;content of the intended action&lt;/em&gt;. Same logical action, same key, every time, even across retries and process restarts.&lt;/p&gt;

&lt;h2&gt;
  
  
  A minimal, working guard
&lt;/h2&gt;

&lt;p&gt;Here's the entire idea in runnable Python. An &lt;code&gt;IdempotentStore&lt;/code&gt; wraps the side-effecting action; the key is a hash of the tool name plus its parameters, so a retried call collapses onto the original.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;IdempotentStore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;side_effects&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="c1"&gt;# times the REAL action ran
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replayed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# no downstream call
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;action&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                       &lt;span class="c1"&gt;# the irreversible part
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;side_effects&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;executed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;intent_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Drive it with an agent that retries the same logical charge three times:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;IdempotentStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cus_42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4999&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;intent_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;charge_customer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;charge_customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attempt &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: mode=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running it prints &lt;code&gt;executed&lt;/code&gt; once and &lt;code&gt;replayed&lt;/code&gt; twice, and the downstream system records exactly one charge. The agent still &lt;em&gt;thinks&lt;/em&gt; it acted three times — and that's fine. Its job is to decide; the store's job is to make sure deciding twice doesn't cost twice.&lt;/p&gt;

&lt;p&gt;In real systems you'd back &lt;code&gt;_results&lt;/code&gt; with Redis or a Postgres table (with a unique constraint on the key, so even two concurrent workers race safely), set a TTL, and store enough of the response to replay it faithfully. The structure stays the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the key is the real design work
&lt;/h2&gt;

&lt;p&gt;The hash-the-params trick has a sharp edge worth naming. Your key is only as good as your definition of "the same action."&lt;/p&gt;

&lt;p&gt;If two genuinely distinct actions hash to the same key, you've created a false duplicate and the second one silently no-ops — a &lt;code&gt;send_reminder&lt;/code&gt; that quietly never sends. If two &lt;em&gt;retries&lt;/em&gt; of the same action hash to different keys — because you included a timestamp, a freshly generated request ID, or the model rephrased a free-text field — your guard does nothing and the double-write sails through. The model's nondeterminism makes this trap easy to fall into: ask an LLM to "email the customer about their late payment" twice and you may get two different message bodies, and therefore two different keys.&lt;/p&gt;

&lt;p&gt;The fix is to key on the &lt;em&gt;stable&lt;/em&gt; part of the intent — the customer ID, the invoice ID, the logical operation — and deliberately exclude anything the model might reword or anything that varies per call. Treat the key as a first-class part of your tool's contract, designed by you, not as an incidental hash of whatever arguments happened to show up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Before you reach for a bigger model, a longer prompt, or another layer of self-reflection, ask a cheaper question: &lt;em&gt;if my agent does this exact action twice, what breaks?&lt;/em&gt; For every write-capable tool, the answer should be "nothing," and the way you get there is an idempotency key derived from intent and enforced at the boundary.&lt;/p&gt;

&lt;p&gt;Reliability in agents isn't mostly about making better decisions. It's about making the cost of a repeated decision zero. Get that right and you can let the agent be as flaky as the network it lives on — which it will be, whether you plan for it or not.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>distributedsystems</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Everyone asks if AI will replace engineers. After a year of coding with it daily, that's the wrong question.</title>
      <dc:creator>Gs. Sanjana</dc:creator>
      <pubDate>Fri, 26 Jun 2026 07:47:45 +0000</pubDate>
      <link>https://dev.to/gs_sanjana_3e822112e14f8/everyone-asks-if-ai-will-replace-engineers-after-a-year-of-coding-with-it-daily-thats-the-wrong-6jm</link>
      <guid>https://dev.to/gs_sanjana_3e822112e14f8/everyone-asks-if-ai-will-replace-engineers-after-a-year-of-coding-with-it-daily-thats-the-wrong-6jm</guid>
      <description>&lt;p&gt;I've used AI coding tools every single working day for about a year. Not for demos — for real, shipped, production work. Long enough to get past both the hype and the backlash. Here's the honest version, the one I'd tell a friend over coffee.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually got faster
&lt;/h2&gt;

&lt;p&gt;The boring stuff. Boilerplate, glue code, the first draft of a function, translating an idea into a framework I half-know, writing the test I was going to skip. The "I know exactly what I want, I just don't want to type it all" tasks collapsed from an afternoon to a few minutes.&lt;/p&gt;

&lt;p&gt;That part is real, and it's not small. A surprising amount of engineering is typing things you already understand.&lt;/p&gt;

&lt;h2&gt;
  
  
  What did NOT get faster (and got a little harder)
&lt;/h2&gt;

&lt;p&gt;Knowing &lt;em&gt;what&lt;/em&gt; to build. Deciding the tradeoff. Holding the whole system in your head. Figuring out why the thing is actually slow. Saying "no" to the clever solution and picking the boring one that survives.&lt;/p&gt;

&lt;p&gt;If anything, AI made these &lt;em&gt;more&lt;/em&gt; important, because it removes the friction that used to slow you down before you'd thought it through. It'll happily generate 200 lines of confidently wrong code. The bottleneck moved from &lt;em&gt;writing&lt;/em&gt; to &lt;em&gt;judging&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The skill that quietly became everything: review
&lt;/h2&gt;

&lt;p&gt;A year ago, my main skill was writing code. Now my main skill is reading it — fast, skeptically, deciding in seconds whether to keep it, fix it, or throw it out. AI made me a senior reviewer of a very fast, very eager junior who never gets tired and never gets offended.&lt;/p&gt;

&lt;p&gt;The engineers getting the most out of this aren't the ones who trust it most. They're the ones who trust it &lt;em&gt;least&lt;/em&gt; by default, and verify quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest failure mode
&lt;/h2&gt;

&lt;p&gt;The trap isn't that AI writes bad code. It's that it writes &lt;em&gt;plausible&lt;/em&gt; code, and plausible is exactly what slips through when you're tired. The worst bugs I've seen this year weren't typos — they were confident, well-formatted, completely reasonable-looking lines that were subtly wrong. You only catch those if you still understand the thing yourself.&lt;/p&gt;

&lt;p&gt;So my one rule: never ship code you couldn't have written and can't fully explain. The day you do, you've stopped being the engineer and started being the rubber stamp.&lt;/p&gt;

&lt;h2&gt;
  
  
  So — replacement?
&lt;/h2&gt;

&lt;p&gt;Wrong question. It's not replacing engineers; it's &lt;strong&gt;deleting the gap between knowing and doing.&lt;/strong&gt; That rewards people who know things and have taste, and punishes hand-waving. The leverage is enormous if you bring judgment, and dangerous if you bring it &lt;em&gt;instead of&lt;/em&gt; judgment.&lt;/p&gt;

&lt;p&gt;I'm more productive than I've ever been. I'm also reading more carefully than I ever have. Both are true, and I don't think that's a coincidence.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you code with AI daily: what's the one thing you refuse to let it do for you?&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Blocklists Leak, Allowlists Hold: a tiny benchmark for stopping hijacked AI agents</title>
      <dc:creator>Gs. Sanjana</dc:creator>
      <pubDate>Fri, 26 Jun 2026 07:38:39 +0000</pubDate>
      <link>https://dev.to/gs_sanjana_3e822112e14f8/blocklists-leak-allowlists-hold-a-tiny-benchmark-for-stopping-hijacked-ai-agents-4obp</link>
      <guid>https://dev.to/gs_sanjana_3e822112e14f8/blocklists-leak-allowlists-hold-a-tiny-benchmark-for-stopping-hijacked-ai-agents-4obp</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Once an AI agent can &lt;em&gt;act&lt;/em&gt;, a single injected instruction can make it delete data, move money, or leak secrets. I built a tiny, reproducible benchmark for the layer that actually executes actions. An undefended agent let through &lt;strong&gt;100%&lt;/strong&gt; of attacks; a blocklist still leaked &lt;strong&gt;20%&lt;/strong&gt;; a default-deny &lt;strong&gt;allowlist&lt;/strong&gt; blocked &lt;strong&gt;100%&lt;/strong&gt; with zero false positives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shift that changes everything
&lt;/h2&gt;

&lt;p&gt;Everyone's talking about smarter models. The bigger change is quieter: agents that don't just &lt;em&gt;answer&lt;/em&gt; but &lt;em&gt;act&lt;/em&gt; — send the email, issue the refund, run the query. The day an agent can act, a wrong answer becomes a wrong &lt;em&gt;action&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That's why &lt;strong&gt;prompt injection&lt;/strong&gt; and &lt;strong&gt;agent goal-hijacking&lt;/strong&gt; sit at the top of the OWASP risk lists for agentic systems. Hostile instructions hide in something the agent reads — a doc, a tool result, a web page — and the agent, trying to help, follows them.&lt;/p&gt;

&lt;h2&gt;
  
  
  A deliberately pessimistic question
&lt;/h2&gt;

&lt;p&gt;Most defenses try to stop the model from being fooled. I asked the opposite: &lt;strong&gt;assume it already has been — does the layer that executes actions stop the harmful ones?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fub3eguvxsuf4hgul5bb7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fub3eguvxsuf4hgul5bb7.png" alt=" " width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I compared three postures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Undefended&lt;/strong&gt; — the agent runs whatever it's driven to do.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blocklist&lt;/strong&gt; — block obviously-dangerous capabilities + a payment cap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default-deny allowlist&lt;/strong&gt; — only explicitly safe actions auto-run; everything else pauses for a human.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fub7m0x8jqzna9cxbvxla.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fub7m0x8jqzna9cxbvxla.png" alt=" " width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Posture&lt;/th&gt;
&lt;th&gt;Attack success rate&lt;/th&gt;
&lt;th&gt;Benign blocked&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Undefended&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Blocklist&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default-deny allowlist&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The blocklist leaked on the &lt;strong&gt;sneaky, low-impact&lt;/strong&gt; attacks — "post a summary to this public link," "turn off the audit log." Nothing looked dangerous, so they slipped through. The allowlist caught them, because they weren't on the safe list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson fits on a sticky note: you can't enumerate every dangerous action, but you can enumerate the safe ones. Blocklists fail open on novelty; allowlists fail closed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The whole idea is a few lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;needs_confirmation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confirm&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;needs_confirmation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;confirm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HELD for review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# waits for a human
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Honest limitations: it's a small, hand-built suite, and it assumes the worst case (the model is already hijacked), so it measures only the gating layer — not whether your agent resists injection in the first place. It complements bigger dynamic benchmarks like AgentDojo rather than replacing them.&lt;/p&gt;

&lt;p&gt;The benchmark and a small guardrail library are open source (MIT) — repo link in the comments.&lt;/p&gt;

&lt;p&gt;How do you draw the line between what your agents do on their own and what waits for a human? Curious what others are doing.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
