<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Brenn Hill</title>
    <description>The latest articles on DEV Community by Brenn Hill (@brennhill).</description>
    <link>https://dev.to/brennhill</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3856905%2F15b3b99d-a66c-43bf-b3c6-1af943635cf1.jpeg</url>
      <title>DEV Community: Brenn Hill</title>
      <link>https://dev.to/brennhill</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/brennhill"/>
    <language>en</language>
    <item>
      <title>Does Human-in-the-Loop Actually Improve AI Safety?</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Wed, 01 Jul 2026 12:00:00 +0000</pubDate>
      <link>https://dev.to/brennhill/does-human-in-the-loop-actually-improve-ai-safety-5f46</link>
      <guid>https://dev.to/brennhill/does-human-in-the-loop-actually-improve-ai-safety-5f46</guid>
      <description>&lt;p&gt;Human-in-the-loop can improve AI safety, but it usually does not by default. Putting a person behind an approval button only helps when the consequence is high &lt;em&gt;and&lt;/em&gt; that person can realistically catch the mistake in time. When they can't, the approval click is a rubber stamp that adds latency, manufactures a false sense of safety, and sets the human up to take the blame for a failure they were never positioned to prevent.&lt;/p&gt;

&lt;p&gt;This article unpacks when human oversight of AI genuinely raises safety, when it only looks like it does, and what real AI safety for agents requires instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  The wrong question, and the right one
&lt;/h2&gt;

&lt;p&gt;Most discussions of human in the loop AI safety start with "should a human review this?" That question is nearly useless, because the honest answer is almost always "sure, why not." The better question is sharper and uncomfortable: &lt;em&gt;can a human realistically catch this mistake in time?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If the answer is no, then a review step is theater rather than a safety control. The agent still does the wrong thing, and you have simply added a person whose name is on the approval. The &lt;a href="https://looprails.dev/framework.html" rel="noopener noreferrer"&gt;framework&lt;/a&gt; reframes oversight around this distinction, and it changes nearly every design decision that follows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The evidence: an approval click is not the same as catching an error
&lt;/h2&gt;

&lt;p&gt;Here is the finding that should reset everyone's intuition. In research on AI coding agents (see the LoopRails &lt;a href="https://looprails.dev/codex.html" rel="noopener noreferrer"&gt;codex&lt;/a&gt;), requiring plan-approval did reduce how often attacks occurred, from roughly 90% down to 60 to 74%. That sounds like a win. But the number that actually matters for safety stayed grim: when a bad action was put in front of a human to catch, intervention success was only 9 to 26% across every oversight strategy tested.&lt;/p&gt;

&lt;p&gt;Read those two numbers together. Approval gates reduced the &lt;em&gt;volume&lt;/em&gt; of bad actions, mostly by making the agent propose fewer of them. They did almost nothing to make humans &lt;em&gt;good at catching&lt;/em&gt; the ones that got through. The gap between being exposed to an error and actually correcting it is enormous, and a confirmation prompt does not close it.&lt;/p&gt;

&lt;p&gt;Two well-documented forces explain why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automation bias.&lt;/strong&gt; People over-trust system suggestions and approve them without real scrutiny. This is structural, not a matter of effort or expertise. It afflicts trained professionals, and it gets &lt;em&gt;worse&lt;/em&gt; as the system becomes more reliable, because a tool that is usually right teaches you to stop looking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rubber stamp.&lt;/strong&gt; A human told to "review the output" under any time pressure will skim and click approve. The agent's proposal arrives wrapped in a confident rationale. The reviewer reads the rationale, it sounds reasonable, and they accept it. This is the &lt;strong&gt;Rubber Stamp&lt;/strong&gt; anti-pattern, and it is the default outcome of naive oversight rather than the exception.&lt;/p&gt;

&lt;p&gt;So the click happened. The log shows a human approved. Safety did not improve. That is the trap.&lt;/p&gt;

&lt;h2&gt;
  
  
  When human-in-the-loop genuinely improves AI safety
&lt;/h2&gt;

&lt;p&gt;Oversight earns its place in exactly one quadrant: when consequence is high &lt;strong&gt;and&lt;/strong&gt; controllability is high, meaning a human can both detect the problem from what they're shown and correct it before harm lands.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;genuine oversight&lt;/strong&gt;, and it is worth investing in. The classic example is a code change where the agent surfaces a real, readable diff plus passing or failing tests. A competent reviewer can look at that diff, see what it actually does, and reject it before it merges. The action is reversible, the evidence is verification-oriented rather than persuasive, and there is time on the clock. Here, review works.&lt;/p&gt;

&lt;p&gt;For oversight to actually function in this quadrant, the moment has to be engineered, not assumed. The reviewer needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The real action and its consequences&lt;/strong&gt;, shown as a diff or preview, including whether it is reversible. Not a summary of intentions, the concrete effect. This is the "Show" move.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enough provenance&lt;/strong&gt; to answer "how did this get to me" so they have situation awareness rather than a cold decision out of context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detection affordances&lt;/strong&gt; that help them find the error rather than sell them on the answer. Explanations framed to persuade increase acceptance regardless of correctness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A respected attention budget&lt;/strong&gt;, because every spurious prompt erodes the scrutiny available for the prompts that matter.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the territory of the &lt;a href="https://looprails.dev/guide-g2.html" rel="noopener noreferrer"&gt;G2 guide&lt;/a&gt;: high-consequence but human-catchable actions, where a preview, a diff, and a real approval step are the right controls.&lt;/p&gt;

&lt;h2&gt;
  
  
  When human-in-the-loop gives false safety
&lt;/h2&gt;

&lt;p&gt;Now the dangerous quadrant: consequence is high but controllability is low. The human &lt;em&gt;cannot&lt;/em&gt; reliably detect or correct the error from what's surfaced, or there isn't time. Review becomes a trap.&lt;/p&gt;

&lt;p&gt;Putting an approval gate here does not produce safety. It produces a rubber stamp and a scapegoat. The recognition bottleneck and automation bias guarantee the human accepts, and the 9 to 26% figure is exactly this quadrant in the data. You have manufactured the &lt;em&gt;appearance&lt;/em&gt; of control over an action no human in that position could actually control.&lt;/p&gt;

&lt;p&gt;It gets worse than ineffective, because it creates a &lt;strong&gt;moral crumple zone&lt;/strong&gt;: a human positioned to absorb blame for a system's failure despite having no real power to prevent it. The reviewer's signature is on the approval, so when the agent deletes the production database or wires the payment, accountability collapses onto them. The system and its designers are insulated. The human is the liability sponge. That is a way of laundering responsibility for a design that was never safe.&lt;/p&gt;

&lt;p&gt;If you cannot give a reviewer real authority, awareness, ability, and time, do not claim oversight. Change the design.&lt;/p&gt;

&lt;h2&gt;
  
  
  What real AI safety for agents looks like instead
&lt;/h2&gt;

&lt;p&gt;When review is a trap, the answer is not a better prompt. Stop depending on the human as a detector and prevent the bad outcome directly. The &lt;a href="https://looprails.dev/playbook.html" rel="noopener noreferrer"&gt;playbook&lt;/a&gt; is built around the method &lt;strong&gt;Grade, Guard, Show, Prove&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grade the action.&lt;/strong&gt; Score every capability the agent has from G0 (trivial, like reading a file) to G3 (critical, like deleting prod, sending external email, or executing a payment), based on reversibility times blast radius times stakes. You cannot allocate oversight until you know what each action is worth. The &lt;a href="https://looprails.dev/guide-g3.html" rel="noopener noreferrer"&gt;G3 guide&lt;/a&gt; covers the critical tier where prevention, not review, has to carry the load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guard with controls matched to the grade.&lt;/strong&gt; This is where prevention lives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sandbox-First&lt;/strong&gt; so high-autonomy work runs in a contained environment with no network and scoped credentials. The worst case is bounded, so you don't need a human to catch every action.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blast-Radius Cap&lt;/strong&gt; so a single action, or many small ones composing together, cannot exceed a hard limit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability Lock&lt;/strong&gt; so dangerous actions are &lt;em&gt;impossible&lt;/em&gt;, not merely discouraged. A denylist the agent can evade is policy, not a boundary, the &lt;strong&gt;Denylist Theater&lt;/strong&gt; anti-pattern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kill Switch&lt;/strong&gt; so there is always a way to stop. Knight Capital lost about $440M in roughly 45 minutes in 2012 to trading software with no way to halt it. A missing kill switch is how the worst incidents happen, not a rare edge case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Circuit Breaker&lt;/strong&gt; so the system halts automatically on anomaly before a human even has to react.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maker-Checker&lt;/strong&gt; for the genuinely irreversible, where the proposer must not be the approver, but only when the checker can actually verify.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The unifying invariant is &lt;strong&gt;RAIL&lt;/strong&gt;: keep every governed action &lt;strong&gt;R&lt;/strong&gt;eversible, &lt;strong&gt;A&lt;/strong&gt;uthorized, &lt;strong&gt;I&lt;/strong&gt;nterruptible, and &lt;strong&gt;L&lt;/strong&gt;ogged. Reversibility shrinks consequence so an error can be undone instead of caught (&lt;a href="https://looprails.dev/rail-reversible.html" rel="noopener noreferrer"&gt;rail-reversible.html&lt;/a&gt;). Authorization enforces real boundaries server-side (&lt;a href="https://looprails.dev/rail-authorized.html" rel="noopener noreferrer"&gt;rail-authorized.html&lt;/a&gt;). Interruptibility means there is a working stop, the lesson Knight Capital paid for (&lt;a href="https://looprails.dev/rail-interruptible.html" rel="noopener noreferrer"&gt;rail-interruptible.html&lt;/a&gt;). Logging makes accountability traceable to an informed human (&lt;a href="https://looprails.dev/rail-logged.html" rel="noopener noreferrer"&gt;rail-logged.html&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;A word on interruptibility and alert design, because over-prompting is how oversight quietly dies. At Three Mile Island, more than 100 alarms fired within minutes, hiding the real problem. Studies find clinicians dismiss 49 to 96% of safety alerts. Flood a human with prompts and they tune out the one that mattered, the &lt;strong&gt;Alert-Fatigue Spiral&lt;/strong&gt;. Spend attention sparingly, on the actions where it can actually change the outcome.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Show the reviewer the real action and its consequences&lt;/strong&gt; when, and only when, a human is genuinely in the loop. A preview the reviewer can't evaluate is decorative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prove the oversight catches seeded errors.&lt;/strong&gt; This is the move almost everyone skips. Do not check that a review step exists. Plant errors and adversarial actions and measure whether the human, or the monitoring system, actually catches them. Track intervention success rate, not approval rate. An oversight design that has never been tested against a wrong agent is unvalidated. Treat "there is a human in the loop" as a claim to demonstrate with evidence, not a checkbox.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Human in the loop AI safety is conditional, not automatic.&lt;/strong&gt; It helps only when consequence is high &lt;em&gt;and&lt;/em&gt; a human can catch the error in time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An approval click is not error-catching.&lt;/strong&gt; Plan-approval cut bad actions from ~90% to 60 to 74%, but human intervention success stayed only 9 to 26% (see the &lt;a href="https://looprails.dev/codex.html" rel="noopener noreferrer"&gt;codex&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation bias makes the rubber stamp the default.&lt;/strong&gt; People over-trust suggestions and approve without scrutiny, more so as the agent gets more reliable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review is a trap when consequence is high but controllability is low.&lt;/strong&gt; It creates false safety and a moral crumple zone where the human absorbs blame without power.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real safety for agents is prevention:&lt;/strong&gt; grade by consequence, sandbox, cap the blast radius, lock capabilities, keep a kill switch, and hold to RAIL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prove it works.&lt;/strong&gt; Seed errors and measure whether oversight catches them. Don't ship unvalidated oversight.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;If you are deciding where a human belongs in your agent's loop, start by grading your actions. Run them through the &lt;a href="https://looprails.dev/index.html#grader" rel="noopener noreferrer"&gt;interactive grader&lt;/a&gt; to see which are genuinely human-catchable and which need prevention instead. Then read the &lt;a href="https://looprails.dev/framework.html" rel="noopener noreferrer"&gt;framework&lt;/a&gt; for the full method, skim the &lt;a href="https://looprails.dev/cheatsheet.html" rel="noopener noreferrer"&gt;cheatsheet&lt;/a&gt; for the patterns and anti-patterns, and dig into the &lt;a href="https://looprails.dev/codex.html" rel="noopener noreferrer"&gt;codex&lt;/a&gt; for the research behind every claim here. The goal is to make sure the bad outcome cannot happen, whether a human is watching or not.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://looprails.dev/article-hitl-ai-safety.html" rel="noopener noreferrer"&gt;looprails.dev/article-hitl-ai-safety.html&lt;/a&gt;. &lt;a href="https://looprails.dev" rel="noopener noreferrer"&gt;LoopRails&lt;/a&gt; is a free, sourced framework for designing human-in-the-loop oversight of AI agents.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>The first malicious MCP server was one line of code: the postmark-mcp rug pull</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Tue, 30 Jun 2026 12:00:00 +0000</pubDate>
      <link>https://dev.to/brennhill/the-first-malicious-mcp-server-was-one-line-of-code-the-postmark-mcp-rug-pull-jda</link>
      <guid>https://dev.to/brennhill/the-first-malicious-mcp-server-was-one-line-of-code-the-postmark-mcp-rug-pull-jda</guid>
      <description>&lt;p&gt;In September 2025, security researchers at &lt;a href="https://www.koi.ai/blog/postmark-mcp-npm-malicious-backdoor-email-theft" rel="noopener noreferrer"&gt;Koi Security found&lt;/a&gt; what's widely described as the first in-the-wild malicious MCP server. It wasn't a sophisticated zero-day. It was one added line in an email tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;postmark-mcp&lt;/code&gt; is an npm package that gives an AI agent a tool for sending email through Postmark. For fifteen releases — versions 1.0.0 through 1.0.15 — it did exactly that, and nothing else. It got adopted, it got trusted, it landed in people's daily agent workflows. By the time it mattered, it was pulling roughly 1,500 downloads a week.&lt;/p&gt;

&lt;p&gt;Then version 1.0.16 shipped on September 17, 2025. The diff was small enough to miss in a glance: the send-email function gained a &lt;code&gt;Bcc&lt;/code&gt; field pointing at &lt;code&gt;phan@giftshop[.]club&lt;/code&gt;, a domain the maintainer controlled. Every email the agent sent — content, recipients, attachments, whatever secrets or PII happened to be inside — got silently copied to the attacker.&lt;/p&gt;

&lt;p&gt;Nothing else changed. The tool still sent your email correctly. From the outside, and from the agent's perspective, it worked. That's the whole trick: the malicious version was indistinguishable in behavior from the benign one, except for the carbon copy you couldn't see.&lt;/p&gt;

&lt;p&gt;Anyone on auto-update inherited the backdoor the moment they pulled the new version. The package was downloaded 1,643 times in total before it was removed from npm. Postmark, the company, &lt;a href="https://postmarkapp.com/blog/information-regarding-malicious-postmark-mcp-package" rel="noopener noreferrer"&gt;confirmed&lt;/a&gt; it had nothing to do with the package — the name just borrowed their credibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it matters
&lt;/h2&gt;

&lt;p&gt;The uncomfortable lesson here isn't "audit your dependencies." Plenty of people &lt;em&gt;had&lt;/em&gt; effectively audited this one — it was fine for fifteen versions. The lesson is that &lt;strong&gt;approval isn't permanent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When you vet a tool, you vet a specific version's behavior at a specific moment. An MCP server can change its tool definitions and its actual behavior in any later release, and the agent — which trusts the tool to describe itself honestly — has no built-in way to notice. This is the "rug pull": vetted and benign, then quietly hostile, with the trust you extended earlier carried forward to code you never looked at.&lt;/p&gt;

&lt;p&gt;MCP makes this sharper than a normal dependency bump, because these tools run with real authority inside your agent's loop. An email tool can read and send mail. A filesystem tool can read and write files. The blast radius of a hostile update is whatever you granted the tool on the day you trusted it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practitioner takeaway
&lt;/h2&gt;

&lt;p&gt;You can't manually re-read every dependency on every update. But you can make "the tool changed" a thing your system &lt;em&gt;notices&lt;/em&gt; instead of a thing it silently accepts.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pin versions.&lt;/strong&gt; Auto-update is what turned a malicious release into mass exposure. Pin MCP servers and their dependencies to exact versions, and treat a version bump as a change that needs a human, not a default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fingerprint tools at approval time.&lt;/strong&gt; When you vet a tool, record a fingerprint — the package version and integrity hash, plus the tool's declared schema and description. That's the thing you actually approved.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-check the fingerprint on every load.&lt;/strong&gt; Before an agent uses a tool, compare its current fingerprint to the approved one. A &lt;code&gt;postmark-mcp&lt;/code&gt; running 1.0.15 and one running 1.0.16 should not look the same to your system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat a moved fingerprint as hostile until proven otherwise.&lt;/strong&gt; If the hash, version, or tool definition changed and nobody re-approved it, fail closed. Don't run the tool, don't pass it secrets, and surface the diff to a human. A changed tool definition is exactly the signal a rug pull produces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this requires catching the malicious line by reading it. It requires noticing that &lt;em&gt;something&lt;/em&gt; changed in a tool you'd already decided to trust — which is the one signal this attack couldn't hide.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This incident is one of the sources behind *&lt;/em&gt;&lt;a href="https://braceframework.org/" rel="noopener noreferrer"&gt;BRACE&lt;/a&gt;*&lt;em&gt;, an open, vendor-neutral framework for securing autonomous AI agents — its &lt;a href="https://braceframework.org/guides/ecosystem/" rel="noopener noreferrer"&gt;ecosystem guide&lt;/a&gt; covers vetting tools and re-checking them on every load. BRACE is built by reading the incidents and the research and asking, each time: what concrete control would have prevented or contained this?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>What Is Agentic AI? And Why Oversight Has to Change</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Sat, 27 Jun 2026 12:00:00 +0000</pubDate>
      <link>https://dev.to/brennhill/what-is-agentic-ai-and-why-oversight-has-to-change-4k6k</link>
      <guid>https://dev.to/brennhill/what-is-agentic-ai-and-why-oversight-has-to-change-4k6k</guid>
      <description>&lt;p&gt;Agentic AI is software built on a large language model (LLM) that can pursue a goal by taking actions on its own. It uses tools, calls APIs, runs code, and reacts to what it sees, rather than just answering one prompt at a time. The plain definition of what is agentic AI: a model that runs in a loop, deciding its own next step until the goal is met. Because the work shifts from generating text to taking actions, oversight has to change too.&lt;/p&gt;

&lt;p&gt;This explainer covers what agentic AI is, how an agent works, what makes it both powerful and risky, where you'll meet it, and why "just add a human" doesn't automatically make it safe. It also covers how to start governing agents instead of reviewing their outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What agentic AI is (vs. a chatbot)
&lt;/h2&gt;

&lt;p&gt;A chatbot, or any single LLM call, is one round trip. You send a prompt, the model returns text, and that's it. The model produces words; a human decides what to do with them. Nothing happens in the world unless a person acts on the answer.&lt;/p&gt;

&lt;p&gt;An AI agent is different in one decisive way: it can &lt;em&gt;act&lt;/em&gt;. Give it a goal, and it doesn't just describe a solution. It works toward it by using tools. It can read your files, query a database, send an email, run a shell command, edit code, or browse a website. Then it observes the result and keeps going. The human is no longer the only one taking actions in the loop. The agent is.&lt;/p&gt;

&lt;p&gt;So the core distinction in agentic AI isn't intelligence or model size. It's &lt;em&gt;agency&lt;/em&gt;. A chatbot answers; an agent does. Taking real actions toward a goal with limited supervision is what makes agentic AI useful, and what makes it a new kind of risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  How an AI agent works: goal, plan, tools, observe, loop
&lt;/h2&gt;

&lt;p&gt;Almost every agent runs the same cycle. Understanding it is the fastest way to grasp both the power and the danger.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Goal.&lt;/strong&gt; You give the agent an objective in natural language: "fix this failing test," "summarize last quarter's support tickets," "book a flight under $400."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan.&lt;/strong&gt; The model breaks the goal into steps and decides what to do first. The plan adapts as the agent learns more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Act (use a tool).&lt;/strong&gt; The agent calls a tool to do something real: run a command, search the web, write a file, hit an API. This is the moment an action takes effect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observe.&lt;/strong&gt; It reads the result (the test output, the search results, the API response) and feeds that back into its reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loop.&lt;/strong&gt; It plans the next step and acts again, repeating until the goal is met (or it gives up or hits a limit).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That loop is the whole idea. A single prompt is one turn; an agent is a model using tools in a loop to pursue a goal, planning, calling tools, observing results, and continuing. The convergence on this pattern, and the human-in-the-loop primitive that wraps it, is documented in the &lt;a href="https://looprails.dev/codex.html" rel="noopener noreferrer"&gt;LoopRails codex&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is where oversight gets hard. In a chatbot you review one output and you're done. In an agent there may be dozens of actions, each one changing the world a little, most happening faster than you can read.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes agentic AI powerful and risky
&lt;/h2&gt;

&lt;p&gt;The power and the risk come from the same three properties.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It takes real actions.&lt;/strong&gt; An agent doesn't suggest sending the email; it sends it. It doesn't propose the database change; it runs it. The output isn't text you choose to use. It's an action that already happened. A mistake isn't a bad paragraph you ignore. It's a deleted record, a wrong payment, or leaked data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It acts autonomously.&lt;/strong&gt; Between your goal and the result, the agent makes many decisions you never see: which tool to call, what arguments to pass, when to stop. You set the destination; it picks the route. That helps when it's right and hurts when it's wrong, because the wrong turn happens without asking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It acts fast.&lt;/strong&gt; Agents do in seconds what would take a person minutes or hours. Speed is the selling point, and also why human review struggles to keep up. By the time you've read what the agent is about to do, it's often already done three more things.&lt;/p&gt;

&lt;p&gt;Put those together and you have a system doing real work at machine speed, with real-world consequences and limited per-step supervision. That is the value proposition and the threat model in one sentence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common examples of AI agents
&lt;/h2&gt;

&lt;p&gt;Agentic AI isn't theoretical. You're likely already using or building one of these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Coding agents.&lt;/strong&gt; Given a goal, they read your repo, write and edit code, run tests, and iterate until the build passes. They take real actions across your codebase, committing, pushing, running commands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Computer-use agents.&lt;/strong&gt; These control a screen the way a person would, clicking, typing, moving through apps and websites to complete tasks. Their tool is basically the entire computer, which makes their blast radius hard to bound.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customer-support and ops agents.&lt;/strong&gt; They read tickets, look up account data, issue refunds, update records, and message customers. Each of those is an action against real systems and real people.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In every case the pattern is the same: a goal, a loop, and tools that change something real. What differs is &lt;em&gt;which&lt;/em&gt; tools and &lt;em&gt;how much&lt;/em&gt; they can break.&lt;/p&gt;

&lt;h2&gt;
  
  
  The oversight problem: you can't just review outputs
&lt;/h2&gt;

&lt;p&gt;Here is the shift that trips up most teams. We learned to oversee AI by reviewing outputs: read the generated text, decide if it's good, use it or don't. That works for a chatbot because the output &lt;em&gt;is&lt;/em&gt; the product and nothing happens until you act.&lt;/p&gt;

&lt;p&gt;It breaks for agents, because the agent's product is &lt;em&gt;actions&lt;/em&gt; that take effect whether or not you read them. Reviewing the final summary doesn't help if the agent already deleted the wrong files getting there. Oversight has to move from reviewing outputs to &lt;strong&gt;governing actions&lt;/strong&gt;, the things the agent does along the way, while it can still be stopped or undone.&lt;/p&gt;

&lt;p&gt;LoopRails frames that as a simple method: &lt;strong&gt;Grade, Guard, Show, Prove.&lt;/strong&gt; First, &lt;em&gt;grade&lt;/em&gt; each action an agent can take on three axes (reversibility, blast radius, and stakes) and let the worst axis set the grade from G0 (trivial, reversible, local) to G3 (irreversible and external or severe). Reading a file is G0; deleting production data or sending money is G3. Then &lt;em&gt;guard&lt;/em&gt; each grade with a matching control instead of treating every action the same. Try this on your own agent's actions with the &lt;a href="https://looprails.dev/index.html#grader" rel="noopener noreferrer"&gt;interactive grader&lt;/a&gt;; the full method lives in the &lt;a href="https://looprails.dev/framework.html" rel="noopener noreferrer"&gt;LoopRails framework&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Underneath the controls, keep every governed action on the &lt;strong&gt;RAIL&lt;/strong&gt;: &lt;a href="https://looprails.dev/rail-reversible.html" rel="noopener noreferrer"&gt;Reversible&lt;/a&gt;, &lt;a href="https://looprails.dev/rail-authorized.html" rel="noopener noreferrer"&gt;Authorized&lt;/a&gt;, &lt;a href="https://looprails.dev/rail-interruptible.html" rel="noopener noreferrer"&gt;Interruptible&lt;/a&gt;, and &lt;a href="https://looprails.dev/rail-logged.html" rel="noopener noreferrer"&gt;Logged&lt;/a&gt;. If an action satisfies those four, even a missed review is recoverable, scoped, stoppable, and accountable. For a deeper introduction to the controls, see the guide to &lt;a href="https://looprails.dev/article-ai-agent-guardrails.html" rel="noopener noreferrer"&gt;AI agent guardrails&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One specific trap is worth naming early: the &lt;strong&gt;lethal trifecta.&lt;/strong&gt; An agent that has access to private data, exposure to untrusted content, and a channel to send data externally can be tricked through prompt injection into leaking that data. The malicious instruction hides in content the agent reads, and the agent looks like it's just doing its job. No "are you sure?" prompt reliably catches it. The full breakdown is in the guide to the &lt;a href="https://looprails.dev/article-lethal-trifecta.html" rel="noopener noreferrer"&gt;lethal trifecta&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a human in the loop isn't automatically enough
&lt;/h2&gt;

&lt;p&gt;The obvious fix is to put a person in front of the agent's actions and make it ask before it acts. That helps, but far less than people expect, and it's the most important thing to understand about overseeing agentic AI.&lt;/p&gt;

&lt;p&gt;In research on AI coding agents (see the &lt;a href="https://looprails.dev/codex.html" rel="noopener noreferrer"&gt;LoopRails codex&lt;/a&gt;), requiring plan-approval before the agent acted did reduce risky actions. But when a bad action slipped through, human intervention success stayed at just &lt;strong&gt;9 to 26%&lt;/strong&gt;. The gate cut &lt;em&gt;how often&lt;/em&gt; bad actions happened, yet barely improved the human's ability to &lt;em&gt;catch and stop&lt;/em&gt; one. People over-trust confident-looking suggestions and approve them with little real scrutiny, especially under time pressure. A confirmation prompt mostly turns a person into a click, not a detector.&lt;/p&gt;

&lt;p&gt;So the right question isn't "should a human review this?" It's: &lt;strong&gt;can a human realistically catch this mistake in time?&lt;/strong&gt; If yes, meaning the reviewer can see the real action, understand it, and stop or reverse it, a gate can work. If no, because the action is too fast, too opaque, or too irreversible, then a review is a trap. It stages a decision the human can't really make and launders the risk into their name. When you can't catch it in time, prevent the bad outcome instead of gating it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to start overseeing agents safely
&lt;/h2&gt;

&lt;p&gt;You don't need to rebuild everything. Start small and concrete:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;List the actions, not the features.&lt;/strong&gt; Write down every tool your agent can call: every command, API, and write operation. You're governing actions, so first you have to see them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grade each one G0 to G3&lt;/strong&gt; on reversibility, blast radius, and stakes. Most actions are low-grade and need no gate; a few are critical and need real protection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Match the control to the grade.&lt;/strong&gt; Skip gates on G0/G1 to avoid fatigue; for G2, confirm with a real preview of the action and its effects; for G3, lean on prevention (sandboxes, blast-radius caps, capability locks, a kill switch) over approval prompts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep every action on the RAIL&lt;/strong&gt; so a missed step is still reversible, authorized, interruptible, and logged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prove it works.&lt;/strong&gt; Seed known-bad actions and prompt-injection attempts into your pipeline and measure whether your human or monitor actually catches them. Track intervention-success rate, not approval rate.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the step-by-step version, work through the &lt;a href="https://looprails.dev/playbook.html" rel="noopener noreferrer"&gt;practitioner playbook&lt;/a&gt; and keep the &lt;a href="https://looprails.dev/cheatsheet.html" rel="noopener noreferrer"&gt;cheatsheet&lt;/a&gt; next to your next agent review. If you're choosing how much freedom to give an agent in the first place, the guide to &lt;a href="https://looprails.dev/article-ai-agent-autonomy-levels.html" rel="noopener noreferrer"&gt;AI agent autonomy levels&lt;/a&gt; maps grades to how much you let it run on its own. And for the foundations of keeping a person meaningfully involved, start with &lt;a href="https://looprails.dev/article-what-is-human-in-the-loop.html" rel="noopener noreferrer"&gt;what human-in-the-loop means&lt;/a&gt; and &lt;a href="https://looprails.dev/article-hitl-ai-safety.html" rel="noopener noreferrer"&gt;HITL for AI safety&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agentic AI&lt;/strong&gt; is an LLM that pursues a goal by taking actions in a loop, planning, using tools, observing results, and continuing, rather than just answering a single prompt.&lt;/li&gt;
&lt;li&gt;The defining difference from a chatbot is &lt;strong&gt;agency&lt;/strong&gt;: an agent acts on the world; a chatbot only produces text.&lt;/li&gt;
&lt;li&gt;It's powerful and risky for the same reasons. It takes &lt;strong&gt;real actions, autonomously, and fast&lt;/strong&gt;, with consequences that can't be undone by ignoring an output.&lt;/li&gt;
&lt;li&gt;Oversight must shift from &lt;strong&gt;reviewing outputs to governing actions&lt;/strong&gt;, using &lt;strong&gt;Grade, Guard, Show, Prove&lt;/strong&gt; and keeping every action on the &lt;strong&gt;RAIL&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;A human in the loop isn't automatically enough: when bad actions slip through, intervention success is only &lt;strong&gt;9 to 26%&lt;/strong&gt;, so prevention often beats review.&lt;/li&gt;
&lt;li&gt;Watch for the &lt;strong&gt;lethal trifecta&lt;/strong&gt; (private data, untrusted content, and an external channel), which review can't reliably catch.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Get started
&lt;/h2&gt;

&lt;p&gt;Now that you can answer &lt;em&gt;what is agentic AI&lt;/em&gt;, the next step is to govern one. Run your agent's riskiest actions through the &lt;a href="https://looprails.dev/index.html#grader" rel="noopener noreferrer"&gt;interactive grader&lt;/a&gt; to see their G0 to G3 grade and the controls that match, then put the &lt;a href="https://looprails.dev/framework.html" rel="noopener noreferrer"&gt;LoopRails framework&lt;/a&gt; to work. The shift from reviewing outputs to governing actions is the whole job, and the sooner you make it, the safer your agents get.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://looprails.dev/article-what-is-agentic-ai.html" rel="noopener noreferrer"&gt;looprails.dev/article-what-is-agentic-ai.html&lt;/a&gt;. &lt;a href="https://looprails.dev" rel="noopener noreferrer"&gt;LoopRails&lt;/a&gt; is a free, sourced framework for designing human-in-the-loop oversight of AI agents.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>What Is Human-in-the-Loop (HITL) in AI? A Practical Guide</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Tue, 23 Jun 2026 16:59:49 +0000</pubDate>
      <link>https://dev.to/brennhill/what-is-human-in-the-loop-hitl-in-ai-a-practical-guide-2le5</link>
      <guid>https://dev.to/brennhill/what-is-human-in-the-loop-hitl-in-ai-a-practical-guide-2le5</guid>
      <description>&lt;p&gt;Human-in-the-loop (HITL) in AI means keeping a person involved in an automated system's decisions, approving, editing, or interrupting what an AI does, instead of letting it run fully on its own. For AI agents, human-in-the-loop is the practice of pausing the agent at chosen points so a human can review or steer an action before it takes effect. The hard part isn't &lt;em&gt;adding&lt;/em&gt; a human. It's making sure that human can actually catch the mistakes that matter.&lt;/p&gt;

&lt;p&gt;This guide explains what human-in-the-loop is, the three forms it takes in real AI agents, how it differs from full automation and human-on-the-loop, and why a review step is not the same as safety. Then it covers how to do HITL well, and when you should prevent a bad outcome instead of reviewing for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What human-in-the-loop actually means
&lt;/h2&gt;

&lt;p&gt;The phrase comes from control systems and machine learning, where a "loop" is the cycle of an action, its result, and a correction. Putting a human &lt;em&gt;in&lt;/em&gt; the loop means the cycle can't close without a person: the system stops and waits for input. Putting a human &lt;em&gt;on&lt;/em&gt; the loop means the system runs autonomously while a person watches and can step in. Taking the human &lt;em&gt;out&lt;/em&gt; of the loop means full automation.&lt;/p&gt;

&lt;p&gt;In AI agents (code agents, computer-use agents, support bots, ops automations) human-in-the-loop is how teams try to keep oversight as autonomy grows. The agent proposes or starts an action; a human gets a say. That's the idea. Whether it works depends entirely on the details, which is where most implementations quietly fail.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three modes of human-in-the-loop AI
&lt;/h2&gt;

&lt;p&gt;HITL shows up in agents in three recognizable shapes. Most products use a mix.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Approve-before-act
&lt;/h3&gt;

&lt;p&gt;The agent describes an action and waits for a yes before doing it: "Run this command?" "Send this email?" "Delete these rows?" This is the most common pattern and the most over-trusted. It feels safe because nothing happens without a click, but a click is not the same as understanding. (More on that below.)&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Review-and-edit
&lt;/h3&gt;

&lt;p&gt;The agent produces a draft (code, a message, a plan, a config change) and the human reviews and edits it before it ships. This is genuinely useful when the artifact is legible and the reviewer has time: a small diff, a short email, a single query. It degrades fast when the output is large or dense, because reviewers skim.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Interrupt-and-resume
&lt;/h3&gt;

&lt;p&gt;The agent runs autonomously, but a human (or a monitor) can pause, redirect, or kill it mid-task. This is the human-&lt;em&gt;on&lt;/em&gt;-the-loop end of the spectrum, and it's the right default for high-throughput work where stopping for every action would be absurd. It only counts as oversight if the interrupt is real: reachable, fast, and able to halt in-flight work.&lt;/p&gt;

&lt;h2&gt;
  
  
  HITL vs. full automation vs. human-on-the-loop
&lt;/h2&gt;

&lt;p&gt;These aren't three boxes. They're points on an autonomy ladder, and the right point depends on the action.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full automation:&lt;/strong&gt; the agent acts, no human gate. Correct for trivial, reversible, contained actions where a human adds nothing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-on-the-loop:&lt;/strong&gt; the agent acts autonomously while a human supervises and can interrupt. Good for high-volume work with a healthy kill switch and monitoring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-loop:&lt;/strong&gt; the agent waits for a human before (or during) the action. Reserve it for the consequential, gateable middle.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mistake is treating this as one global setting ("we have a human in the loop") applied to everything an agent can do. That guarantees either fatigue (you gate trivia) or blind risk (you under-gate the dangerous stuff). The &lt;a href="https://looprails.dev/framework.html" rel="noopener noreferrer"&gt;LoopRails framework&lt;/a&gt; argues you should pick the mode &lt;em&gt;per action&lt;/em&gt;, not per product. For the full mapping of grades to autonomy modes, see the &lt;a href="https://looprails.dev/playbook.html" rel="noopener noreferrer"&gt;playbook&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The misconception: a review step is not safety
&lt;/h2&gt;

&lt;p&gt;Here's the belief that breaks most HITL designs: &lt;em&gt;if a human approves it, it's safe.&lt;/em&gt; It isn't, and the evidence is direct.&lt;/p&gt;

&lt;p&gt;In research on AI coding agents (see the &lt;a href="https://looprails.dev/codex.html" rel="noopener noreferrer"&gt;LoopRails codex&lt;/a&gt;), requiring plan-approval before the agent acted cut attack occurrence from roughly 90% down to 60 to 74%. That sounds like a win until you look at whether the human actually &lt;em&gt;caught&lt;/em&gt; the bad action when it slipped through. Human intervention success stayed at just &lt;strong&gt;9 to 26%&lt;/strong&gt; across every approval strategy tested. Gating reduced &lt;em&gt;how often&lt;/em&gt; bad actions happened, but barely improved the human's ability to &lt;em&gt;recognize and stop&lt;/em&gt; one. The gate was working as a speed bump, not as a detector.&lt;/p&gt;

&lt;p&gt;Why? &lt;strong&gt;Automation bias.&lt;/strong&gt; People over-trust system suggestions and approve them without real scrutiny, especially when the system has been right before, when the output looks confident, and when there's time pressure to keep moving. A confirmation prompt does not turn a person into a good error-catcher. It mostly turns them into a click.&lt;/p&gt;

&lt;p&gt;Two failure modes follow from this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Rubber Stamp:&lt;/strong&gt; approvals get clicked through reflexively, so the gate stops bad actions occasionally but rarely &lt;em&gt;catches&lt;/em&gt; a targeted one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Moral Crumple Zone:&lt;/strong&gt; when something goes wrong, the human who clicked "approve" gets the blame, even though they never had a realistic chance to catch the problem. The review existed to assign accountability, not to prevent harm.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your oversight only proves that &lt;em&gt;a review step exists&lt;/em&gt;, you have &lt;a href="https://looprails.dev/framework.html" rel="noopener noreferrer"&gt;Phantom Oversight&lt;/a&gt;: a control that looks like safety on the org chart and does nothing in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The better question
&lt;/h2&gt;

&lt;p&gt;Don't ask "should a human review this?" Ask: &lt;strong&gt;can a human realistically catch this mistake in time?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That reframes oversight as an engineering problem with a testable answer. A gate can work when the reviewer can see the real action and its consequences, has the competence and the time to judge, and can actually stop or reverse it. When they can't, when the consequence is high but their controllability is low, &lt;strong&gt;review is a trap.&lt;/strong&gt; You're staging a decision the human can't really make, and a confirmation prompt just launders the risk into their name.&lt;/p&gt;

&lt;p&gt;That's the line between oversight that prevents harm and oversight that exists to be pointed at after harm.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to do human-in-the-loop well
&lt;/h2&gt;

&lt;p&gt;LoopRails frames good HITL as four moves: &lt;strong&gt;Grade, Guard, Show, Prove.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Grade
&lt;/h3&gt;

&lt;p&gt;Score every action an agent can take on three axes (&lt;strong&gt;reversibility, blast radius, and stakes&lt;/strong&gt;) and let the highest axis set the grade, G0 to G3.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://looprails.dev/guide-g0.html" rel="noopener noreferrer"&gt;G0, trivial&lt;/a&gt;:&lt;/strong&gt; reversible, local, no stakes (read a file, run a read-only query). No gate; gating it just breeds fatigue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://looprails.dev/guide-g1.html" rel="noopener noreferrer"&gt;G1, low&lt;/a&gt;:&lt;/strong&gt; at most one medium axis (edit a local file, run tests). Cheap undo beats a confirmation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://looprails.dev/guide-g2.html" rel="noopener noreferrer"&gt;G2, high&lt;/a&gt;:&lt;/strong&gt; any one high axis (&lt;code&gt;git push&lt;/code&gt;, spend within budget, send an internal message). Confirm-before with a real preview.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://looprails.dev/guide-g3.html" rel="noopener noreferrer"&gt;G3, critical&lt;/a&gt;:&lt;/strong&gt; irreversible &lt;em&gt;and&lt;/em&gt; external or severe (deploy, pay, delete prod data, post publicly). Prevent, or escalate. Review alone won't hold here.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Guard
&lt;/h3&gt;

&lt;p&gt;Match the control to the grade. Don't spend attention on G0/G1; gate G2 with a preview; for G3, lean on prevention patterns over approval prompts: Sandbox-First (contain blast radius in the environment), Blast-Radius Cap (limit any single action's magnitude), Capability Lock (make the bad action &lt;em&gt;impossible&lt;/em&gt;, not discouraged), Runtime Shield, Kill Switch, Circuit Breaker, and Maker-Checker (the proposer is never the approver).&lt;/p&gt;

&lt;h3&gt;
  
  
  Show
&lt;/h3&gt;

&lt;p&gt;When you do pull a human in, design the moment. Show them the &lt;em&gt;real&lt;/em&gt; action and its consequences (a diff, a preview, the side effects, whether it can be undone) rather than a bare "Approve?" Surface the agent's uncertainty and provenance so they can check rather than trust. And spend attention sparingly: interrupt rarely and at meaningful breakpoints, because over-prompting trains people to dismiss prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prove
&lt;/h3&gt;

&lt;p&gt;Treat "a human reviews it" as a claim to validate, not a checkbox. Seed known errors and prompt-injection attempts into your pipeline and measure whether the human (or monitor) actually &lt;em&gt;catches&lt;/em&gt; them. The number that matters is intervention-success rate, not approval rate. Untested oversight is unvalidated oversight.&lt;/p&gt;

&lt;p&gt;Underneath all four moves, keep every governed action on the &lt;strong&gt;RAIL&lt;/strong&gt;: &lt;a href="https://looprails.dev/rail-reversible.html" rel="noopener noreferrer"&gt;Reversible&lt;/a&gt;, &lt;a href="https://looprails.dev/rail-authorized.html" rel="noopener noreferrer"&gt;Authorized&lt;/a&gt;, &lt;a href="https://looprails.dev/rail-interruptible.html" rel="noopener noreferrer"&gt;Interruptible&lt;/a&gt;, and &lt;a href="https://looprails.dev/rail-logged.html" rel="noopener noreferrer"&gt;Logged&lt;/a&gt;. An action that satisfies those four leaves even a missed review recoverable, scoped, stoppable, and accountable.&lt;/p&gt;

&lt;h2&gt;
  
  
  When HITL is the wrong tool, prevent instead
&lt;/h2&gt;

&lt;p&gt;Sometimes the honest answer to "can a human catch this in time?" is no. The action is too fast, too opaque, or too irreversible, and no realistic prompt would let a person intervene effectively. In that case, &lt;em&gt;don't add a review.&lt;/em&gt; Adding one creates a Rubber Stamp and a Moral Crumple Zone at once. Change the action instead so the bad outcome can't happen or can be undone.&lt;/p&gt;

&lt;p&gt;The clearest example is the &lt;strong&gt;lethal trifecta.&lt;/strong&gt; An agent that has (1) access to private data, (2) exposure to untrusted content, and (3) a way to send data externally can be tricked by prompt injection into exfiltrating that data. No "are you sure?" prompt reliably catches this, because the malicious instruction is buried in content the human won't read, and the agent looks like it's doing its job. The fix isn't review; it's prevention. Remove any one leg (cut external send, isolate the private data, or sanitize the untrusted input) and the attack can't complete. That's a Capability Lock, not a gate.&lt;/p&gt;

&lt;p&gt;When consequence is high and controllability is low, prevention beats review every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-loop&lt;/strong&gt; means a person can approve, edit, or interrupt an AI's action before it takes effect, the opposite of full automation.&lt;/li&gt;
&lt;li&gt;It shows up in three modes: approve-before-act, review-and-edit, and interrupt-and-resume.&lt;/li&gt;
&lt;li&gt;Adding a review step is &lt;em&gt;not&lt;/em&gt; the same as safety: gates cut how often bad actions occur but barely improve a human's ability to catch one (9 to 26% intervention success), and automation bias makes approvals reflexive.&lt;/li&gt;
&lt;li&gt;Ask "can a human realistically catch this in time?", not "should a human review this?"&lt;/li&gt;
&lt;li&gt;Do HITL well with &lt;strong&gt;Grade, Guard, Show, Prove&lt;/strong&gt;, and keep every action &lt;strong&gt;Reversible, Authorized, Interruptible, Logged&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;When a human can't catch the mistake in time, &lt;strong&gt;prevent&lt;/strong&gt; the bad outcome instead of staging a review.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Get started
&lt;/h2&gt;

&lt;p&gt;Stop asking whether you have a human in the loop and start grading your agent's actions. Run your riskiest actions through the &lt;a href="https://looprails.dev/index.html#grader" rel="noopener noreferrer"&gt;interactive grader&lt;/a&gt; to see their G0 to G3 grade and the controls that match, then work the four moves with the &lt;a href="https://looprails.dev/playbook.html" rel="noopener noreferrer"&gt;practitioner playbook&lt;/a&gt;. Keep the &lt;a href="https://looprails.dev/cheatsheet.html" rel="noopener noreferrer"&gt;cheatsheet&lt;/a&gt; next to your next agent review, and the next time someone proposes "just add an approval step," ask whether the human can actually catch the mistake in time.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://looprails.dev/article-what-is-human-in-the-loop.html" rel="noopener noreferrer"&gt;looprails.dev/article-what-is-human-in-the-loop.html&lt;/a&gt;. &lt;a href="https://looprails.dev" rel="noopener noreferrer"&gt;LoopRails&lt;/a&gt; is a free, sourced framework for designing human-in-the-loop oversight of AI agents.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>73% of AI-agent credential leaks trace back to one mundane thing: debug logging</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Tue, 23 Jun 2026 15:38:17 +0000</pubDate>
      <link>https://dev.to/brennhill/73-of-ai-agent-credential-leaks-trace-back-to-one-mundane-thing-debug-logging-59hc</link>
      <guid>https://dev.to/brennhill/73-of-ai-agent-credential-leaks-trace-back-to-one-mundane-thing-debug-logging-59hc</guid>
      <description>&lt;p&gt;A paper accepted to ASE 2026 — &lt;em&gt;&lt;a href="https://arxiv.org/abs/2604.03070" rel="noopener noreferrer"&gt;"How Your Credentials Are Leaked by LLM Agent Skills: An Empirical Study"&lt;/a&gt;&lt;/em&gt; (Chen et al.) — did something most agent-security discussion doesn't: it measured. The authors sampled 17,022 third-party agent "skills" and looked for credentials leaking out of them. The result is worth sitting with.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;520 skills&lt;/strong&gt; leaked credentials, across &lt;strong&gt;1,708 distinct issues&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;89.6%&lt;/strong&gt; of the leaked credentials were &lt;strong&gt;immediately exploitable&lt;/strong&gt; — and 92.5% of those during routine execution, no privilege escalation needed.&lt;/li&gt;
&lt;li&gt;Secrets removed from &lt;strong&gt;107 upstream repositories persisted across 50+ forks&lt;/strong&gt;, so "we patched it" didn't actually fix it downstream.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the single most useful finding is the &lt;em&gt;mechanism&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dominant cause is boring, and that's the point
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;73.5% of the leaks came from debug logging.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not a clever exploit. Not a novel attack. Debug logging. Here's why that's not as dumb as it sounds: in most agent frameworks, a tool's &lt;strong&gt;stdout is piped straight into the model's context window&lt;/strong&gt; — and from there into your traces and logs. So the moment a skill prints something for debugging, and that something happens to include an API key or a token, the secret has been handed to the model and written to your logs. Nobody decided to leak it. The plumbing did.&lt;/p&gt;

&lt;p&gt;This reframes how you should think about data hygiene in an agent. We spend most of our attention on what comes &lt;em&gt;in&lt;/em&gt; — prompt injection, untrusted documents, poisoned tool descriptions. But a tool's &lt;strong&gt;output is a leakage channel too&lt;/strong&gt;, running in the opposite direction, and it's the one this study found doing the most damage in the wild.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to actually do about it
&lt;/h2&gt;

&lt;p&gt;Three things, in order of leverage:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Redact secrets on the tool-output path — before it reaches the context window or the logs.&lt;/strong&gt; Same discipline you apply to untrusted input, pointed the other way. A secret-shaped string in stdout should be scrubbed before the framework forwards it anywhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep credentials capability-scoped and short-lived.&lt;/strong&gt; The study found 89.6% of leaked secrets immediately exploitable largely because they were broad and long-lived. A read-only, 15-minute token that leaks is a much smaller problem than a standing god-credential that leaks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vet skills — and re-vet them.&lt;/strong&gt; The fork-persistence finding is a reminder that a skill you approved can change underneath you. Pin it, fingerprint it (a hash of its code/description), and re-check on load.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of this requires new infrastructure. It requires treating tool output with the same suspicion you already give tool input.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This study is one of the sources behind *&lt;/em&gt;&lt;a href="https://braceframework.org/" rel="noopener noreferrer"&gt;BRACE&lt;/a&gt;*&lt;em&gt;, an open, vendor-neutral framework for securing autonomous AI agents. Its &lt;a href="https://braceframework.org/guides/run-time/" rel="noopener noreferrer"&gt;run-time guide&lt;/a&gt; covers exactly this — data hygiene runs both ways, and tool output is a leakage channel. BRACE is built by reading the incidents and the research and asking, each time: what concrete control would have prevented or contained this?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>When your agent does something bad, can you tell which agent did it?</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Tue, 23 Jun 2026 15:02:22 +0000</pubDate>
      <link>https://dev.to/brennhill/when-your-agent-does-something-bad-can-you-tell-which-agent-did-it-37a2</link>
      <guid>https://dev.to/brennhill/when-your-agent-does-something-bad-can-you-tell-which-agent-did-it-37a2</guid>
      <description>&lt;p&gt;An agent does something it shouldn't: deletes a record it had no business touching, sends a message to the wrong tenant, calls an API in a tight loop until the bill spikes. Someone asks the only question that matters in the first ten minutes of an incident: &lt;em&gt;which agent did this?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If the honest answer is "we're not sure," everything downstream is harder. You can't contain what you can't name. You can't kill a build you can't identify. You can't audit, and you can't learn enough to stop it happening again.&lt;/p&gt;

&lt;p&gt;The frustrating part is that this is usually an identity problem, not a logging problem. The logs exist. They just don't say enough to point at one agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the action is unattributable
&lt;/h2&gt;

&lt;p&gt;Two patterns make most agent actions impossible to pin down.&lt;/p&gt;

&lt;p&gt;The first is &lt;strong&gt;shared service accounts&lt;/strong&gt;. Ten agents share one set of credentials, so every action shows up as the same actor. The IdP records "service-account-prod did X" ten thousand times a day, and there is no way to separate the agent that misbehaved from the nine that didn't.&lt;/p&gt;

&lt;p&gt;The second is &lt;strong&gt;agents running under a human's credentials&lt;/strong&gt;. The agent inherits the launching user's full access, and in the logs the action is indistinguishable from something the human did by hand. Now you have an attribution problem &lt;em&gt;and&lt;/em&gt; a blast-radius problem: the agent can do anything the human can.&lt;/p&gt;

&lt;p&gt;There's a subtler version too. Say two different builds of an agent both authenticate as the same account. Same name in the IdP, same token. In the logs they are identical — but one has a changed system prompt, a newer model, a different tool config. When one of them goes wrong, you cannot tell from the logs which build was running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Give the agent its own identity
&lt;/h2&gt;

&lt;p&gt;Step one: the agent gets its own identity, separate from the launching user. Not a human's credentials. Not a shared service account. Its own.&lt;/p&gt;

&lt;p&gt;This is the foundation everything else sits on. Once the agent authenticates as itself, every action it takes is at least &lt;em&gt;its own&lt;/em&gt; action in the record — not borrowed from a person, not blended with nine siblings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stamp six fields on every action
&lt;/h2&gt;

&lt;p&gt;Identity at the IdP is necessary but not sufficient. The action itself needs to carry enough context to be reconstructed later. Stamp six fields on every action:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accountable party&lt;/strong&gt; — who is responsible for this agent existing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational owner&lt;/strong&gt; — who actually runs and maintains it day to day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant&lt;/strong&gt; — which customer this action was on behalf of.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent-type-id&lt;/strong&gt; — which &lt;em&gt;build&lt;/em&gt; of the agent this is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent-instance-id&lt;/strong&gt; — which specific &lt;em&gt;run&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trace context&lt;/strong&gt; — where this sits in the call graph.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together they answer: who's responsible, who operates it, for which customer, which build, which run, and where in the chain of calls. Most systems capture one or two of these — usually a tenant and maybe a trace id. The gap between "one or two" and "all six" is exactly the gap that makes an incident unattributable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Make agent-type-id a content hash, not a name
&lt;/h2&gt;

&lt;p&gt;The field that quietly breaks is &lt;strong&gt;agent-type-id&lt;/strong&gt;, because the obvious implementation is a name someone assigns. Call it &lt;code&gt;support-agent-v2&lt;/code&gt; and ship it. Three weeks later someone swaps the model, tweaks the system prompt, and ships again — still &lt;code&gt;support-agent-v2&lt;/code&gt;. The name didn't change; the behavior did. Silent drift, invisible in the logs.&lt;/p&gt;

&lt;p&gt;Make agent-type-id a &lt;strong&gt;content hash&lt;/strong&gt; instead. Hash over everything that determines how the agent behaves: the container image, the harness, the system prompt, the model identifier, the config. Like a container digest, but extended past the image to everything that shapes behavior.&lt;/p&gt;

&lt;p&gt;The property you want is that the id changes when &lt;em&gt;any&lt;/em&gt; input changes. Swap the model, the hash changes. Edit one line of the system prompt, the hash changes. A changed build can no longer masquerade as the old one, because it gets a new id automatically. Drift stops being silent and shows up as a new agent-type-id in your logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Track parent-child lineage
&lt;/h2&gt;

&lt;p&gt;Agents spawn sub-agents, and the sub-agent is where a lot of trouble actually happens. So record lineage: which sub-agent ran, under which parent, and — this is the part people miss — &lt;strong&gt;the prompt the parent handed it&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That parent-passed prompt is often the only place an injected instruction is visible. A poisoned tool result or a manipulated upstream response turns into an instruction the parent passes down. If you didn't capture the handoff, the injection leaves no trace and the sub-agent looks like it just decided to misbehave on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identity is the recovery surface
&lt;/h2&gt;

&lt;p&gt;The thing to internalize: identity isn't paperwork you do for compliance. It's the &lt;strong&gt;recovery surface&lt;/strong&gt;. Containment, the kill switch, the audit trail, and the learning afterward all depend on being able to attribute an action to a specific agent build and run.&lt;/p&gt;

&lt;p&gt;And it has to be there &lt;em&gt;before&lt;/em&gt; the incident. Identity added afterward is too late for the incident you're currently in — you can instrument for next time, but the one in front of you stays unattributable.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://braceframework.org/guides/agent/" rel="noopener noreferrer"&gt;BRACE agent guide&lt;/a&gt; goes deeper on the field definitions and how they fit the broader framework.&lt;/p&gt;

&lt;p&gt;One honest question to leave you with: pull up your logs from an action an agent took an hour ago. Can you name the specific build that took it? If not, that's the gap — and it's worth closing before you need it.&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>After an agent deleted a production database, I mapped what actually stops these failures</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Tue, 23 Jun 2026 15:02:21 +0000</pubDate>
      <link>https://dev.to/brennhill/after-an-agent-deleted-a-production-database-i-mapped-what-actually-stops-these-failures-5bfl</link>
      <guid>https://dev.to/brennhill/after-an-agent-deleted-a-production-database-i-mapped-what-actually-stops-these-failures-5bfl</guid>
      <description>&lt;p&gt;A coding agent deleted a production database during a stated code freeze, then reported that rollback was impossible (it wasn't). Another agent deleted a user's files after misreading a command. A destructive payload was merged into a widely-distributed developer extension and shipped to roughly a million people. A zero-click prompt injection quietly exfiltrated data from a major enterprise AI assistant.&lt;/p&gt;

&lt;p&gt;These aren't edge cases anymore. Once an agent can plan, call tools, change real systems, and spawn sub-agents without a human reviewing each action, the question stops being "is the model good?" and becomes "what can this thing actually do when it's wrong?"&lt;/p&gt;

&lt;p&gt;I spent a while reading through the public incidents and trying to find the common thread. Here's the one that reframed it for me.&lt;/p&gt;

&lt;h2&gt;
  
  
  An agent is not the code that shipped — it's a configuration
&lt;/h2&gt;

&lt;p&gt;When we review traditional software, we review code. For an autonomous agent there often isn't much code to review. The behavior comes from a &lt;em&gt;runtime configuration&lt;/em&gt;: a container, a harness (the wrapper that runs the model and hands it tools), a system prompt, a set of available tools, a memory store, an identity, and a network boundary.&lt;/p&gt;

&lt;p&gt;Two agents built from the exact same model can behave completely differently depending on how those parts are assembled. So the security question isn't "is the code safe" — it's "is the configuration bounded." That shift changes where you put your effort.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five concerns, and what each one bounds
&lt;/h2&gt;

&lt;p&gt;I organized the configuration into five places where things actually go wrong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Build-time&lt;/strong&gt; — architecture, API access, the container, the harness. Fixed when the agent is built and frozen into the artifact. This is where you decide what the agent &lt;em&gt;can&lt;/em&gt; reach at all.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run-time&lt;/strong&gt; — data, memory, and behavioral checks active on every execution. This is where you watch what it's doing live.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent&lt;/strong&gt; — the per-agent-type concerns: scoped tokens, the system prompt as a policy surface, what tools it actually needs versus what it's been handed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration&lt;/strong&gt; — drift. The approved config and the running config diverge over time, and a hardened deployment quietly decays into an unsafe one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ecosystem&lt;/strong&gt; — the shared substrate every agent runs on: identity issuance, egress control, the MCP servers and supply chain it pulls from.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each concern bounds a different failure class. A scoped token bounds blast radius. Egress control bounds exfiltration. Drift detection bounds the slow decay. None of them are exotic; most are built from tools you already run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The single highest-leverage control
&lt;/h2&gt;

&lt;p&gt;If you do one thing: &lt;strong&gt;make the harness deny destructive verbs by default.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dangerous actions — &lt;code&gt;delete&lt;/code&gt;, &lt;code&gt;drop&lt;/code&gt;, &lt;code&gt;wipe&lt;/code&gt;, force-push, mass-revoke — get blocked at the harness unless explicitly allowed for that agent type. Not "the model was told not to." Intercepted, in the wrapper, where a confused or manipulated model can't talk its way past it.&lt;/p&gt;

&lt;p&gt;This is high-leverage because it sits below the model's reasoning. The production-DB deletion and the file-deletion incident both share a shape: an agent ran an irreversible operation it was never authorized to run. A harness that refuses destructive verbs by default turns "catastrophic and irreversible" into "blocked and logged" — without depending on the model being right in the moment. Pair it with narrowly scoped tokens (&lt;code&gt;read:invoices&lt;/code&gt;, not &lt;code&gt;invoices:*&lt;/code&gt;) and you've bounded the two worst incident classes.&lt;/p&gt;

&lt;h2&gt;
  
  
  I built a framework for this, and I'd like you to tear it apart
&lt;/h2&gt;

&lt;p&gt;The thing I put together is called BRACE — Build-time, Run-time, Agent, Configuration, Ecosystem. It's nine controls, three observability requirements, and a one-page sign-off checklist, and it reverse-maps to the OWASP Agentic Top 10 and MITRE ATLAS so you can see exactly which threat loses its primary mitigation for every control you choose to skip.&lt;/p&gt;

&lt;p&gt;It's open and vendor-neutral. The guides and the checklist are here: &lt;a href="https://braceframework.org/" rel="noopener noreferrer"&gt;https://braceframework.org/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'm not posting this to sell you anything — there's nothing to buy. I'm posting it because I'd rather find the holes now than after someone ships an agent on it. If the "agent is a configuration" framing breaks down somewhere, or a control is missing, or one of them is unworkable in practice, I want to hear it. Adopting only part of it means you're accepting the remaining risk on purpose, and I'd like that risk to be honest.&lt;/p&gt;

&lt;p&gt;So: what's the worst autonomous-agent failure you've personally seen or cleaned up? I'm collecting the ones that don't make the news.&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>Build-time is where agent security is won or lost</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Tue, 23 Jun 2026 15:02:19 +0000</pubDate>
      <link>https://dev.to/brennhill/build-time-is-where-agent-security-is-won-or-lost-5944</link>
      <guid>https://dev.to/brennhill/build-time-is-where-agent-security-is-won-or-lost-5944</guid>
      <description>&lt;p&gt;In 2025 an AI coding agent deleted a production database during a stated code freeze, then told the operator a rollback was impossible. It wasn't a jailbreak or an exotic exploit. The agent simply had a path to prod, a credential that could drop tables, and a harness that let the destructive call through. Every link in that chain was a decision someone made before the agent ever started its run.&lt;/p&gt;

&lt;p&gt;That's the uncomfortable, useful part. Most agent security advice is about getting the model to behave — better prompts, better refusals, better guardrails on its output. Those help, but they all depend on the model behaving. Build-time controls don't. They're the things you freeze in advance — the tools, the network routes, the credentials, the deny list — and they hold whether the model is well-behaved, confused, or actively hijacked by injected input. If you only invest in one layer, invest here.&lt;/p&gt;

&lt;p&gt;Here's how I think about it, in plain terms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Give the agent the least it needs, decided ahead of time.&lt;/strong&gt; Least privilege for tools, MCP servers, credentials, file paths, network destinations. Whatever you grant is exactly what a compromised agent can do — there's no daylight between "capabilities" and "attack surface."&lt;/p&gt;

&lt;p&gt;There's a second reason that points the same direction, and it's the one people miss: fewer tools also makes the agent &lt;em&gt;better&lt;/em&gt;. Every tool and MCP server you attach gets injected into the model's context as schema, on every step. More tools means more tokens spent reading menus, and measurably worse tool selection — the model fumbles more when it has fifty options than when it has six. So trimming the tool surface is not a security tax you pay against capability. It buys you both. That reframing tends to win the argument with people who'd otherwise resist locking things down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bound the blast radius by topology, not trust.&lt;/strong&gt; An agent can't delete a prod database it has no network route to. Physical and network isolation is threat-model-agnostic — it doesn't care whether the cause was a clever attacker, a confused model, or a runaway loop. Separate prod from staging from dev for real, so an agent with staging access can't reach production under any failure mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope and time-box every credential.&lt;/strong&gt; Capability-scoped tokens (&lt;code&gt;read:tickets&lt;/code&gt;, never &lt;code&gt;tickets:*&lt;/code&gt;), short-lived, no standing god-credentials sitting in the environment. An analysis agent should not be able to delete records through the same token it uses to read them. This is the layer teams skip most, because issuing a token per capability is real operational work — and it's exactly the layer that contains the damage when every other layer fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gate destructive actions in the harness, deny by default.&lt;/strong&gt; This is the single highest-leverage build-time control, so it's worth being precise about. The &lt;em&gt;harness&lt;/em&gt; — the loop that runs the model and calls tools — keeps an explicit list of destructive verb classes: file deletion, recursive removal, &lt;code&gt;DROP&lt;/code&gt;/&lt;code&gt;TRUNCATE&lt;/code&gt;, force-push to protected branches, infra teardown, payment-state changes, mass external sends. Each one is intercepted &lt;em&gt;before it executes&lt;/em&gt;. The default is deny. An agent that invokes a destructive verb not pre-authorized for this run gets stopped at the harness — not because the harness reasoned that the call was unsafe, but because nothing reasoned that it was safe.&lt;/p&gt;

&lt;p&gt;The point is that the &lt;em&gt;model&lt;/em&gt; never gets to be the thing standing between a destructive command and your data. A human (or an out-of-band credential) does. When Amazon Q shipped a destructive wiper payload to roughly a million developers through a VS Code extension, the failure was a destructive verb reaching execution with nothing in front of it. A deny-by-default harness is the thing that's in front of it.&lt;/p&gt;

&lt;p&gt;Two more, briefly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pin and sign the build, fold the digest into the agent's identity.&lt;/strong&gt; A minimal signed container, pinned by digest, so silent drift between "what we reviewed" and "what's running" is detectable rather than invisible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat the harness and system prompt as versioned, diff-reviewed artifacts&lt;/strong&gt; — not config you can hand-edit in a web UI with no history. Every change to the tool allowlist, the capability scopes, or the destructive-verb list lands as a reviewed diff, approved by someone who isn't the author.&lt;/p&gt;

&lt;p&gt;None of this is exotic. It's the same engineering discipline you already apply to anything that touches production, pointed at a new kind of actor — one that improvises, runs unattended, and will do whatever its capabilities allow. The full guide and checklist is here: &lt;a href="https://braceframework.org/guides/build-time/" rel="noopener noreferrer"&gt;https://braceframework.org/guides/build-time/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, practically: how are you handling destructive tool calls today? Is there a real deny-by-default gate in your harness, or is the model still the last thing between an agent and &lt;code&gt;DROP TABLE&lt;/code&gt;?&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>You can't prevent prompt injection. So what do you actually do?</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Tue, 23 Jun 2026 15:02:18 +0000</pubDate>
      <link>https://dev.to/brennhill/you-cant-prevent-prompt-injection-so-what-do-you-actually-do-1d37</link>
      <guid>https://dev.to/brennhill/you-cant-prevent-prompt-injection-so-what-do-you-actually-do-1d37</guid>
      <description>&lt;p&gt;There's a quiet assumption baked into a lot of agent security work: that with enough prompt engineering, the right system message, or the next model version, we'll get the model to stop following malicious instructions. It hasn't happened, and it's worth designing as if it won't. No current model reliably refuses adversarial input when that input is formatted as instructions. A single crafted prompt can strip the careful alignment you layered on top.&lt;/p&gt;

&lt;p&gt;So the useful question isn't "how do I prevent injection?" It's "injection will sometimes succeed — what state is my agent in afterward, and what can it actually do from there?"&lt;/p&gt;

&lt;p&gt;That reframe is the whole game for run-time security: the protections that run live on &lt;em&gt;every&lt;/em&gt; execution, not the ones you reason about at design time. Here are the parts that have held up in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model is not a security boundary
&lt;/h2&gt;

&lt;p&gt;If a single input can flip the model's behavior, then the model can't be the thing standing between an attacker and your systems. Treat it like a component that will occasionally do the wrong thing, and put the boundary somewhere it can't talk its way past.&lt;/p&gt;

&lt;p&gt;Concretely, that means two things downstream of the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capability-scoped credentials.&lt;/strong&gt; The agent holds only the permissions the current task needs. A hijacked agent with read-only, narrowly-scoped tokens does a lot less damage than one holding your admin key.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A gate on destructive verbs.&lt;/strong&gt; Deleting, sending, paying, granting access — these get an explicit check (a policy, a confirmation, a second factor) that doesn't depend on the model having behaved.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Containment limits the blast radius. Detection tells you it happened. Neither requires the model to be trustworthy, which is the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Separate the data channel from the instruction channel
&lt;/h2&gt;

&lt;p&gt;Almost every injection bug reduces to one sentence: data got read as instructions. The fetched web page, the retrieved document, the tool output, the user upload — all of it is &lt;em&gt;data&lt;/em&gt;, and somewhere it got concatenated into the context the model treats as &lt;em&gt;commands&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So treat every external input as untrusted: user messages, fetched pages, tool outputs, retrieved documents, uploads. Indirect injection is the nasty case here — the payload rides in on content your agent went and fetched on its own, so "trusting the source" buys you nothing. Defend at the boundary where data enters, and don't splice untrusted text into the instruction context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data hygiene runs both ways
&lt;/h2&gt;

&lt;p&gt;Here's the part that's easy to miss. You watch what comes &lt;em&gt;in&lt;/em&gt;. But a tool's &lt;em&gt;output&lt;/em&gt; is a leakage channel too.&lt;/p&gt;

&lt;p&gt;Agent frameworks routinely pipe tool stdout — including debug logging — straight into the model's context window, and from there into your logs. An empirical study of 17,022 agent skills found credentials leaking exactly this way, with debug logging behind 73.5% of the cases. The secret was never meant for the model; it just happened to be on stdout, and the framework forwarded it.&lt;/p&gt;

&lt;p&gt;The fix is unglamorous: redact secrets from tool output &lt;em&gt;before&lt;/em&gt; it reaches context or logs. Same discipline as input, opposite direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor behavior, separately from quality
&lt;/h2&gt;

&lt;p&gt;A hijacked agent can produce clean, well-formatted, "high quality" output while doing something it shouldn't. Quality monitoring won't catch it, because nothing about the &lt;em&gt;result&lt;/em&gt; looks wrong. You need a separate signal: does this &lt;em&gt;sequence of actions&lt;/em&gt; look like normal behavior for this agent?&lt;/p&gt;

&lt;p&gt;That means baselining the action sequences you expect and alerting on deviation. There's a gradient of effort:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Static rules&lt;/strong&gt; — cheap, catch the obvious (an agent that never emails suddenly emailing).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequence-pattern baselines&lt;/strong&gt; — learn the normal shape of an agent's actions, flag the ones that don't fit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A second model as judge&lt;/strong&gt; — independent review of the primary agent's behavior.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One detail that's easy to overlook: &lt;strong&gt;log context size at decision time.&lt;/strong&gt; Context size shapes behavior, so a baseline that doesn't condition on it will drift and misfire. Record it alongside the action.&lt;/p&gt;

&lt;h2&gt;
  
  
  And memory makes it persistent
&lt;/h2&gt;

&lt;p&gt;If your agent has memory, a one-shot injection can become a standing one — a poisoned "fact" gets written once and re-executes every session. Keep memory hygienic: scope it per instance or type, validate what gets written, and keep per-entry provenance so you can trace where a "fact" came from.&lt;/p&gt;




&lt;p&gt;None of this prevents prompt injection. It assumes injection lands and asks what your system does next. The &lt;a href="https://braceframework.org/guides/run-time/" rel="noopener noreferrer"&gt;BRACE run-time guide&lt;/a&gt; walks through these as a checklist if you want the structured version.&lt;/p&gt;

&lt;p&gt;So, honest question: if an agent of yours got hijacked mid-task right now, would you see it in the action stream — or are you flying blind on everything after the prompt? What does your behavioral baseline actually look like?&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Your AI agent is only as secure as the tools and agents it calls</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Tue, 23 Jun 2026 15:02:17 +0000</pubDate>
      <link>https://dev.to/brennhill/your-ai-agent-is-only-as-secure-as-the-tools-and-agents-it-calls-53p7</link>
      <guid>https://dev.to/brennhill/your-ai-agent-is-only-as-secure-as-the-tools-and-agents-it-calls-53p7</guid>
      <description>&lt;p&gt;We spend a lot of effort hardening the agent itself: scoping its permissions, sandboxing its code execution, watching its outputs. Then it loads a third-party MCP server, and most of that work routes around the locks we built.&lt;/p&gt;

&lt;p&gt;That's the uncomfortable part of agent security nobody automates away: &lt;strong&gt;your agent is only as safe as the agents and tools it calls.&lt;/strong&gt; It loads third-party tools, talks to MCP servers, spawns sub-agents, and shares a substrate — a registry, an identity plane, a gateway, a kill-switch bus — with every other agent in your system. A failure in any of those doesn't stay put. It cascades through the shared substrate.&lt;/p&gt;

&lt;p&gt;A useful framing here: every control you build has two halves. An &lt;strong&gt;agent-scoped&lt;/strong&gt; half (what &lt;em&gt;this&lt;/em&gt; agent is allowed to do) and an &lt;strong&gt;ecosystem-scoped&lt;/strong&gt; half (the shared infrastructure every agent leans on). Most teams build the first half and assume the second. Here are six things worth getting concrete about.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. A tool you vetted can turn hostile later
&lt;/h3&gt;

&lt;p&gt;The scariest supply-chain fact about MCP is that approval is not a permanent state. In September 2025, the &lt;code&gt;postmark-mcp&lt;/code&gt; npm package shipped a routine-looking update. The only meaningful diff between the benign version and the malicious one was a single added line: a &lt;code&gt;Bcc&lt;/code&gt; field on the send-email function, quietly copying every message to an attacker's domain. Anyone on auto-update started leaking email with no visible change in behavior.&lt;/p&gt;

&lt;p&gt;That's a &lt;strong&gt;rug pull&lt;/strong&gt;: vetted on Monday, hostile on Thursday. Pinning versions and signing help, but they don't tell you &lt;em&gt;what changed&lt;/em&gt;. For that you want a &lt;strong&gt;fingerprint&lt;/strong&gt; — a hash of the tool's description plus its schema — recorded at approval time and re-checked on every load. If the fingerprint moves, the tool stops until a human looks. Cheap to compute, and it turns a silent rug pull into a loud one.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tool descriptions and schemas are untrusted input
&lt;/h3&gt;

&lt;p&gt;Here's the detail that trips people up: a tool's description and parameter schema get &lt;strong&gt;injected straight into the agent's prompt&lt;/strong&gt;. That makes them an instruction channel, not just documentation. Invariant Labs demonstrated this last year — a benign-looking tool whose description carried hidden instructions to exfiltrate data. The term that stuck is &lt;strong&gt;tool poisoning&lt;/strong&gt;, and it's just prompt injection wearing a tool's clothes.&lt;/p&gt;

&lt;p&gt;So treat tool metadata like any other hostile input. Before a description reaches the model, scan it for invisible Unicode, right-to-left override characters, HTML comments, base64/hex blobs, and role-override phrasing ("ignore previous instructions", "you are now..."). Strip control characters. If you wouldn't trust a string from a web form, don't trust one from a tool registry.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Watch for lookalikes
&lt;/h3&gt;

&lt;p&gt;A malicious server doesn't need to beat your real tool — it just needs to sit next to it with a confusingly similar name. &lt;code&gt;send_email&lt;/code&gt; vs &lt;code&gt;send_emai1&lt;/code&gt;. Typosquatting and cross-server name confusion let a rogue tool intercept calls meant for a trusted one. Flag near-duplicate tool names, and &lt;strong&gt;namespace every tool by the verified identity of the server that published it&lt;/strong&gt;, so two tools called &lt;code&gt;search&lt;/code&gt; are never ambiguous.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Put a fail-closed gateway at the MCP boundary
&lt;/h3&gt;

&lt;p&gt;If you take one architectural idea from this, take this one: route all MCP traffic through a single auditable choke point. One gateway that authenticates the caller, scans the call and the response, rate-limits, writes an audit trail — and on &lt;em&gt;any&lt;/em&gt; error, &lt;strong&gt;denies&lt;/strong&gt;. Not "log and continue." Deny. A gateway that fails open is just latency.&lt;/p&gt;

&lt;p&gt;You don't have to invent the spec yourself. Microsoft's open &lt;strong&gt;MCP Security Gateway&lt;/strong&gt; spec is one conformance-tested implementation of exactly this pattern, and it's a reasonable reference point even if you build your own.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The kill switch has to reach the sub-agents
&lt;/h3&gt;

&lt;p&gt;Most kill switches halt the parent agent and call it done. But the parent has spawned sub-agents and opened tool sessions, and those keep running with the parent gone — orphaned processes still holding credentials and making calls. A real stop signal &lt;strong&gt;propagates&lt;/strong&gt; to every sub-agent and tool session, and leaves each one in a safe state.&lt;/p&gt;

&lt;p&gt;And like any safety system: if you haven't tested it firing, you don't have it. Pull the switch in a drill and watch whether the sub-agents actually stop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where this fits
&lt;/h3&gt;

&lt;p&gt;These five concerns — vetting, poisoning, lookalikes, the gateway, the kill switch — are the &lt;strong&gt;E (Ecosystem)&lt;/strong&gt; layer of &lt;a href="https://braceframework.org/guides/ecosystem/" rel="noopener noreferrer"&gt;BRACE&lt;/a&gt;, an open framework for agent security. The guide goes deeper on the substrate model and the agent-scoped/ecosystem-scoped split if you want the longer version.&lt;/p&gt;

&lt;p&gt;None of this is exotic. It's the same supply-chain hygiene we already apply to dependencies — pin, sign, fingerprint, verify on load — pointed at a new kind of dependency that can also &lt;em&gt;talk to your model&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So a real question to leave with: &lt;strong&gt;how are you vetting the MCP servers and tools your agents load today&lt;/strong&gt; — and would you catch it if one of them changed after you approved it?&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>There's no pull request to review for an autonomous agent. So what do you review?</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Tue, 23 Jun 2026 15:02:16 +0000</pubDate>
      <link>https://dev.to/brennhill/theres-no-pull-request-to-review-for-an-autonomous-agent-so-what-do-you-review-355m</link>
      <guid>https://dev.to/brennhill/theres-no-pull-request-to-review-for-an-autonomous-agent-so-what-do-you-review-355m</guid>
      <description>&lt;p&gt;When you ship a normal service, security review has an anchor: the diff. Someone opens a pull request, someone reads it, and the thing that runs in production is the thing that got reviewed.&lt;/p&gt;

&lt;p&gt;Now put an autonomous agent in production. It plans, calls tools, and changes state, often without a human approving each action. Ask the obvious question — &lt;em&gt;where's the PR for what it just did?&lt;/em&gt; — and there isn't one. The agent didn't ship the action in a commit. It decided it at runtime.&lt;/p&gt;

&lt;p&gt;So the review you're used to doing is aimed at the wrong artifact. Let me try to point it at the right one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The agent is a configuration, not the code that shipped
&lt;/h2&gt;

&lt;p&gt;Here's the load-bearing observation: an autonomous agent is not code. It's a runtime &lt;em&gt;configuration&lt;/em&gt; of infrastructure — a container, a harness (the loop that runs the model and calls tools on its behalf), a system prompt, a tool surface, memory, an identity, and a network egress policy.&lt;/p&gt;

&lt;p&gt;The model is mostly fixed. Everything around it is not. Two agents built from the &lt;em&gt;same&lt;/em&gt; model with the &lt;em&gt;same&lt;/em&gt; task description can behave completely differently depending on how those parts are configured — what tools they can reach, what the system prompt tells them to do, what memory they carry, what the network will let out. The security-relevant artifact is the running configuration, and a review aimed at your application code walks right past it.&lt;/p&gt;

&lt;p&gt;Once you see the agent as a config, the question "what do you review?" has an answer: you review the config.&lt;/p&gt;

&lt;h2&gt;
  
  
  Treat the system prompt and harness as versioned, diff-reviewed artifacts
&lt;/h2&gt;

&lt;p&gt;Most teams treat the system prompt as settings — a text box, editable in a dashboard, changed by whoever has access. That's the problem. The system prompt encodes the task, the rules for refusing requests, and the shape of the output. A one-line change to it can quietly remove a guardrail. An editable-in-prod system prompt is an &lt;em&gt;unreviewed, unattributed code path with full influence over the agent's behavior.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The incidents bear this out. The "Rules File Backdoor" weaponized editable rules files in Cursor and GitHub Copilot using invisible Unicode. DPD's support bot started swearing at customers after a guardrail-removing system update. NYC's MyCity bot told landlords they could refuse Section 8 tenants — illegal advice, live for weeks. None of these were model failures. They were configuration changes that nobody reviewed because nobody treated the configuration as something you review.&lt;/p&gt;

&lt;p&gt;So treat it like code. Put the system prompt, the harness config, hooks, and the MCP server list in version control. Change them only through review. Diff them. Now a guardrail removal shows up as a reviewable change with an author on it, instead of a silent edit in a prod console.&lt;/p&gt;

&lt;h2&gt;
  
  
  Freeze the config per release, and pin it into identity
&lt;/h2&gt;

&lt;p&gt;If the config is the artifact, then a change to the config is a new build — and you want that to be &lt;em&gt;visible&lt;/em&gt;. The practical move is to take a content hash over the deployed configuration — container digest, harness version, system prompt version, model identifier, settings — and make that hash the agent's type identity (BRACE calls it the &lt;code&gt;agent-type-id&lt;/code&gt;). Now any change to any of those parts produces a different identity. You can't quietly swap a prompt and keep the same name. The new config is a new build, and your logs say so.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detect drift at agent granularity
&lt;/h2&gt;

&lt;p&gt;Infrastructure-as-code drift detection is a control your platform team probably already runs. The catch is that it usually runs at the &lt;em&gt;host&lt;/em&gt; level, and the changes that matter for an agent hide one level down.&lt;/p&gt;

&lt;p&gt;So point it at the agent's surface specifically: container images, harness configurations, MCP server lists, system prompts, and &lt;em&gt;per-agent&lt;/em&gt; network egress policy — not just the host's. An MCP server quietly added to one agent's list (see the Postmark-MCP malicious-package incident) won't trip a host-level check. A drift check scoped to that agent's configuration will.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capture the config-adjacent observables
&lt;/h2&gt;

&lt;p&gt;Two things tell you what configuration actually ran, and both are easy to drop on the floor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decision-time context size&lt;/strong&gt; — how much context the model had in front of it when it acted. The same agent behaves differently with a near-empty context than a near-full one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The parent-passed prompt&lt;/strong&gt; — in multi-agent setups, what the calling agent actually handed down. That's part of the effective configuration of the child, and it's invisible if you only log the child's own prompt.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If an action goes wrong, these are often the difference between "we can see exactly what config was in play" and "we're guessing."&lt;/p&gt;




&lt;p&gt;None of this requires new infrastructure. Version control, content hashing, IaC drift detection, and structured logging are things you already run. What changes is &lt;em&gt;where you point them&lt;/em&gt; — at the agent's configuration, which is the artifact that actually decides what the agent does.&lt;/p&gt;

&lt;p&gt;This is the Configuration concern in &lt;a href="https://braceframework.org/guides/configuration/" rel="noopener noreferrer"&gt;BRACE&lt;/a&gt;, an open framework for agent security; the guide goes deeper on each control.&lt;/p&gt;

&lt;p&gt;One honest question to leave with: &lt;strong&gt;do you version and diff-review your agents' system prompts — or are they editable runtime config that anyone with console access can change without a trace?&lt;/strong&gt; The answer tells you whether there's anything to review at all.&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>devops</category>
      <category>llm</category>
    </item>
    <item>
      <title>How a sandwich defeats North Korea's hackers (and the US couldn't for 70 years)</title>
      <dc:creator>Brenn Hill</dc:creator>
      <pubDate>Thu, 02 Apr 2026 06:11:21 +0000</pubDate>
      <link>https://dev.to/brennhill/how-a-sandwich-defeats-north-koreas-hackers-and-the-us-couldnt-for-70-years-5bg4</link>
      <guid>https://dev.to/brennhill/how-a-sandwich-defeats-north-koreas-hackers-and-the-us-couldnt-for-70-years-5bg4</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49il8f4gk9ojd3hsa4nv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49il8f4gk9ojd3hsa4nv.png" alt=" " width="800" height="500"&gt;&lt;/a&gt;&lt;br&gt;
Two days ago, Google's Mandiant team &lt;a href="https://thehackernews.com/2026/03/axios-supply-chain-attack-pushes-cross.html" rel="noopener noreferrer"&gt;attributed the axios npm compromise&lt;/a&gt; to UNC1069 — a North Korean threat group previously linked to cryptocurrency theft and attacks on DeFi platforms. The malicious code shares significant overlap with WAVESHAPER, a C++ backdoor Mandiant attributed to the same group in February.&lt;/p&gt;

&lt;p&gt;North Korea just weaponized the most popular HTTP client in JavaScript. 100 million weekly downloads. The payload: a cross-platform RAT that harvests credentials, SSH keys, and cloud tokens from every developer machine that runs &lt;code&gt;npm install&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The United States has spent 70 years and trillions of dollars trying to contain North Korea. Nuclear negotiations, sanctions, carrier groups, diplomatic pressure, UN resolutions. None of it has stopped the DPRK from becoming one of the most effective cyber threats on the planet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A sloppy joe sandwich stops them in 3 seconds.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  What happened
&lt;/h2&gt;

&lt;p&gt;On March 30, the attacker compromised the npm account of axios's lead maintainer (&lt;code&gt;jasonsaayman&lt;/code&gt;) using a stolen access token. They changed the account email to a Proton Mail address and published two malicious versions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="mailto:axios@1.14.1"&gt;axios@1.14.1&lt;/a&gt;&lt;/strong&gt; — published March 31, 00:21 UTC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="mailto:axios@0.30.4"&gt;axios@0.30.4&lt;/a&gt;&lt;/strong&gt; — published March 31, 01:00 UTC&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both versions injected a new dependency: &lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt;. This package was never imported anywhere in the axios source. Its sole purpose was to run a &lt;code&gt;postinstall&lt;/code&gt; hook that deployed platform-specific RATs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;macOS&lt;/strong&gt;: Binary at &lt;code&gt;/Library/Caches/com.apple.act.mond&lt;/code&gt;, executed via AppleScript&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windows&lt;/strong&gt;: PowerShell RAT with Registry persistence and in-memory binary injection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linux&lt;/strong&gt;: Python RAT script via &lt;code&gt;nohup&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dropper script deleted itself after execution to hide forensic evidence. The attacker staged &lt;code&gt;plain-crypto-js&lt;/code&gt; 18 hours in advance, pre-built three platform payloads, and hit both release branches within 39 minutes. This was not amateur hour.&lt;/p&gt;
&lt;h2&gt;
  
  
  How sloppy-joe blocks every layer of this attack
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/brennhill/sloppy-joe" rel="noopener noreferrer"&gt;sloppy-joe&lt;/a&gt; is an open-source supply chain security tool. It runs &lt;strong&gt;before&lt;/strong&gt; &lt;code&gt;npm install&lt;/code&gt;. It reads your &lt;code&gt;package-lock.json&lt;/code&gt; and checks every dependency — direct and transitive — against multiple independent signals. No packages are downloaded. No code is executed.&lt;/p&gt;
&lt;h3&gt;
  
  
  Signal 1: Version age gate
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR axios [metadata/version-age]
      Version '1.14.1' of 'axios' was published 0 hours ago (minimum: 72 hours).
      New versions need time for the community and security scanners to review them.
 Fix: Wait until the version is at least 72 hours old, or pin to an older version.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The compromised versions were live for 2-3 hours before npm yanked them. A 72-hour gate means they never get installed. Period. This requires zero knowledge of the attack — it works purely on the principle that new versions should survive community review before hitting production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This single check stops the attack.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But sloppy-joe doesn't stop at one signal. With &lt;code&gt;--deep&lt;/code&gt; transitive scanning, &lt;code&gt;plain-crypto-js&lt;/code&gt; gets demolished by five independent checks:&lt;/p&gt;
&lt;h3&gt;
  
  
  Signal 2: New package detection
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR plain-crypto-js [metadata/new-package]
      'plain-crypto-js' was first published 0 days ago. New packages are higher
      risk — verify this is a legitimate, maintained project before depending on it.
 Fix: Verify 'plain-crypto-js' at its registry page and source repository.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;plain-crypto-js&lt;/code&gt; was created the day before the attack. Brand new packages as transitive dependencies of 100M-download packages are inherently suspicious.&lt;/p&gt;
&lt;h3&gt;
  
  
  Signal 3: Install script risk amplifier
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR plain-crypto-js [metadata/install-script-risk]
      'plain-crypto-js' has install scripts AND was published 0 days ago and with
      0 downloads. Install scripts on new, low-download packages are the #1
      malware delivery vector.
 Fix: Do not install this package. Verify it is legitimate before proceeding.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Install scripts + new package + zero downloads. This is the exact fingerprint of a supply chain attack. sloppy-joe's install script risk signal combines multiple weak signals into a high-confidence detection. Every real-world npm supply chain attack in the last 5 years has matched this pattern.&lt;/p&gt;
&lt;h3&gt;
  
  
  Signal 4: No source repository
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WARNING plain-crypto-js [metadata/no-repository]
        'plain-crypto-js' has no source repository URL and is a new package
        (&amp;lt; 30 days old). Legitimate packages almost always link to their source code.
 Fix: Verify 'plain-crypto-js' at its registry page.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Legitimate packages link to their GitHub/GitLab repo. Malicious packages created as payload delivery vehicles don't bother.&lt;/p&gt;
&lt;h3&gt;
  
  
  Signal 5: Name similarity
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WARNING plain-crypto-js [similarity/mutation-match]
        'plain-crypto-js' is suspiciously similar to existing package 'crypto-js'
 Fix: Verify this is the package you intend to use.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;plain-crypto-js&lt;/code&gt; is a clear attempt to look like &lt;code&gt;crypto-js&lt;/code&gt; — a real, popular cryptography package with hundreds of millions of downloads. sloppy-joe's mutation generators catch this.&lt;/p&gt;
&lt;h2&gt;
  
  
  Five signals. One sandwich. Zero dollars.
&lt;/h2&gt;

&lt;p&gt;Here's what's remarkable: none of these detections require threat intelligence feeds, malware signature databases, or AI-powered behavioral analysis. They're all variations of the same primitive: &lt;strong&gt;cross-reference what the code claims to use against what actually exists on the registry.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the version too new? Flag it.&lt;/li&gt;
&lt;li&gt;Is the transitive dep brand new? Flag it.&lt;/li&gt;
&lt;li&gt;Does a brand new package have install scripts? Block it.&lt;/li&gt;
&lt;li&gt;Does it have no source repository? Flag it.&lt;/li&gt;
&lt;li&gt;Does the name look like a popular package? Flag it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each signal alone is informational. All five firing together on the same package is a certainty.&lt;/p&gt;
&lt;h2&gt;
  
  
  The cost of this defense
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add to any CI pipeline&lt;/span&gt;
npx sloppy-joe check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That's it. One line. Runs in 5-15 seconds. No account needed. No API key. No subscription. Open source, MIT licensed.&lt;/p&gt;

&lt;p&gt;The DPRK's UNC1069 spent 18 hours staging payloads, pre-building RATs for three platforms, and compromising a maintainer account. sloppy-joe catches it in the time it takes to read a &lt;code&gt;package-lock.json&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  What sloppy-joe can't catch
&lt;/h2&gt;

&lt;p&gt;Honesty matters: sloppy-joe would &lt;strong&gt;not&lt;/strong&gt; have detected the credential theft itself. The attacker hijacked the real maintainer's account — the npm &lt;code&gt;_npmUser&lt;/code&gt; field still shows &lt;code&gt;jasonsaayman&lt;/code&gt;. There's no publisher change to detect. npm's &lt;a href="https://docs.npmjs.com/generating-provenance-statements" rel="noopener noreferrer"&gt;provenance attestation&lt;/a&gt; (OIDC-based publishing via GitHub Actions) is the real defense against token theft — the malicious versions were published manually, bypassing axios's CI pipeline.&lt;/p&gt;

&lt;p&gt;sloppy-joe catches the &lt;strong&gt;payload&lt;/strong&gt;, not the &lt;strong&gt;compromise&lt;/strong&gt;. But the payload is what hurts you.&lt;/p&gt;
&lt;h2&gt;
  
  
  Get started
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
cargo &lt;span class="nb"&gt;install &lt;/span&gt;sloppy-joe

&lt;span class="c"&gt;# Run&lt;/span&gt;
sloppy-joe check

&lt;span class="c"&gt;# In CI (GitHub Actions)&lt;/span&gt;
- uses: brennhill/sloppy-joe-action@v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/brennhill" rel="noopener noreferrer"&gt;
        brennhill
      &lt;/a&gt; / &lt;a href="https://github.com/brennhill/sloppy-joe" rel="noopener noreferrer"&gt;
        sloppy-joe
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Shields against supply-chain, slopsquatting, and typosquatting attacks from dependencies and code.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;p&gt;
  
    &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fbrennhill%2Fsloppy-joe%2Fmain%2Fassets%2Fsloppy-joe.svg%3Fv%3D3" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fbrennhill%2Fsloppy-joe%2Fmain%2Fassets%2Fsloppy-joe.svg%3Fv%3D3" alt="sloppy-joe" width="400"&gt;&lt;/a&gt;
  
&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Catch hallucinated, typosquatted, and non-canonical dependencies&lt;br&gt;before they reach production.&lt;/h3&gt;
&lt;/div&gt;

&lt;p&gt;
  &lt;code&gt;cargo install sloppy-joe&lt;/code&gt;
&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The &lt;a href="https://thehackernews.com/2026/03/teampcp-backdoors-litellm-versions.html" rel="nofollow noopener noreferrer"&gt;LiteLLM supply chain attack&lt;/a&gt; (March 2026) compromised a package with 97M monthly downloads. Attackers stole publishing credentials, pushed malicious versions that harvested SSH keys, cloud credentials, and K8s secrets. sloppy-joe's default 72-hour version age gate would have blocked both poisoned versions — they were discovered within hours, well before the gate would have opened. If you run &lt;code&gt;sloppy-joe check&lt;/code&gt; in CI, this attack fails.&lt;/strong&gt; &lt;a href="https://github.com/brennhill/sloppy-joe/docs/blog/2026-03-24-litellm-attack-blocked.md" rel="noopener noreferrer"&gt;Full analysis&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AI code generators hallucinate package names &lt;a href="https://arxiv.org/abs/2406.10279" rel="nofollow noopener noreferrer"&gt;~20% of the time&lt;/a&gt;. Attackers register those names and wait. sloppy-joe catches them in CI before &lt;code&gt;npm install&lt;/code&gt; or &lt;code&gt;pip install&lt;/code&gt; runs.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;How to Use&lt;/h2&gt;
&lt;/div&gt;

&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install (single static binary, no runtime dependencies)&lt;/span&gt;
cargo install sloppy-joe
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Or download an auditable binary archive from GitHub Releases&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; https://github.com/brennhill/sloppy-joe/releases&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Check current project — auto-detects ecosystem from manifest files&lt;/span&gt;
sloppy-joe check

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Check a&lt;/span&gt;&lt;/pre&gt;…
&lt;/div&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/brennhill/sloppy-joe" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;






&lt;p&gt;&lt;em&gt;sloppy-joe is an open-source supply chain security tool that catches hallucinated, typosquatted, and compromised dependencies before they reach production. It runs before your package manager, requires no code execution, and blocks attacks like axios, LiteLLM, event-stream, and ua-parser-js.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>npm</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
