DEV Community: Vasu Dalal

I catalogued 32 real AI-agent failures, then marked the ones we cannot stop

Vasu Dalal — Thu, 16 Jul 2026 00:46:12 +0000

Every agent-security vendor tells you what they block. Nobody tells you what they miss.

That gap is the whole problem. "We stop prompt injection" is a claim you cannot check. You cannot run it, and you cannot tell it apart from the next company saying the same sentence. So security engineers do the rational thing and discard all of it.

I published the opposite. It is called the ARE Incident Database, and it is public: https://aredb.org

What is in it

32 agent failures that actually happened, each with a real source. A production database dropped during a code freeze. Twenty-five thousand documents deleted in the wrong environment. Credentials read and shipped to an external sink. A budget burned to zero in a loop.

Each one gets a stable id (ARE-2026-001 through ARE-2026-032), and each one is mapped to its category in the OWASP Agentic Security Initiative Top 10, which is the peer-reviewed catalog of what goes wrong with agents. AREDB does not compete with it. OWASP owns the map. This is the cited incidents underneath it.

The part that makes it uncomfortable to publish

Every entry carries a coverage flag, and the flag is about our own product.

We block 23 of the 32 today. Two more are partial, and they say partial. That leaves six of the ten OWASP categories covered at the action layer, and four that we do not cover:

ASI06 Memory and context poisoning. We strip the hidden characters attackers use to smuggle instructions into text. We do not read the meaning of the text itself, so this one is only partial, and we mark it partial.
ASI07 Insecure inter-agent communication. This is about how agents talk to each other over the network, which a firewall that sits in front of actions never sees. Not ours.
ASI09 Human-agent trust. This is a design and disclosure problem. There is no action for a firewall to catch. Not ours.
ASI10 Rogue agents. We stop the dangerous actions, but we do not diagnose the misbehavior itself. Partial.

A firewall that claimed all ten would be lying, and every security engineer reading this already knows that. The four we do not cover are named in the registry, with a pointer to the discipline that does own them.

Do not trust any of it. Run it.

Here is the part I actually care about. Every covered entry ships a repro you can execute. Not a screenshot, not a demo video, not a claim. A snippet.

This is the Replit production wipe, reproduced against the free package:

pip install agentx-security-sdk

from agentx_sdk import agentx_protect, is_block

@agentx_protect(agent_id="aredb-repro", action="db_write")
def run_sql(query: str):
    return "EXECUTED"          # the agent never gets here

result = run_sql("DROP TABLE users;")
print(is_block(result))        # True
print(result)                  # the block, and the safe path to take instead

No key. No gateway. No account. Your data stays on your machine. Sixty seconds, and you have checked one of my claims yourself.

The registry does not trust itself either

A repro that nobody runs is a screenshot with extra steps. So the registry runs its own.

test_repros.py scrapes the fenced Python block out of each published page and executes it. That means the code a reader copies is byte-for-byte the code we prove blocks. It asserts three things, because "it blocked" is a weaker claim than it sounds:

the block fires,
the tool body never ran (a warning printed next to an action that still happens is not a block),
the process exits clean (a crash is not a block either).

CI runs it on every push, and weekly on a schedule, because a repro can rot without anyone touching the repo. A future SDK release could quietly change behavior underneath a published claim, and I would rather find that out from a red badge than from you.

So a coverage flag here is a tested fact, not just our opinion. If a claim ever stops being true, the entry gets reclassified. The page does not get reworded.

What this is not

Not a vulnerability database. Publicly disclosed incidents only. If you have an undisclosed vulnerability in someone's product, report it to them, not to me.
Not a category system. OWASP ASI is the standard one. This indexes onto it.
Not exhaustive. It is a founding batch of 32. That is the honest size of it.
Not neutral about who built it. AgentX-Core is the tool that does the blocking here, and I wrote both it and the registry. The defense against that is not a promise, it is the repro. Run it.

Two asks

Try to break one. Take an entry marked covered, run its snippet, and tell me it does not block. That is the most useful message you can send me, and it is the one I will act on fastest.
Send me the failure that bit you. A real incident with a source. If it clears the bar, it gets an ARE-2026-NNN id and it goes in. CONTRIBUTING.md has the threshold, which is material consequence, not novelty.

Registry: https://aredb.org
Repo: https://github.com/vdalal/ARE-Incident-Database
Tell me what it missed: https://discord.gg/PmWRTtaSx2

The NSA wrote the MCP threat model. It never says what your agent does after the block.

Vasu Dalal — Sat, 04 Jul 2026 09:02:18 +0000

In May 2026 the NSA's AI Security Center published a Cybersecurity Information Sheet: Model Context Protocol (MCP): Security Design Considerations for AI-Driven Automation (U/OO/6030316-26). It is the first government-authored threat model for MCP, and if you run agents through MCP servers, it is worth the read.

Read it end to end and one word never appears: recover.

That absence is the whole point of this post.

The part the NSA gets exactly right

The report does not treat a tool call like an API request. It treats it like a loaded gun:

"It is prudent to treat any tool execution triggered via MCP as a potentially high-risk action."

And it says where the check belongs. Not on the network, not in a scanner you run before deployment, but at runtime, in the process, at the moment the action would happen:

"if a server does not require access to sensitive file systems, model or data files, or internal networks, those access paths should be explicitly denied at runtime."

It goes further and describes output filtering that reads like a product spec:

"Each output must be treated as untrusted input to the next phase of the pipeline... Filtering should include content length checks, disallowed keyword scanning, rate limiting, and application specific policy enforcement... detection of indirect prompt injection or toolchain pivot attempts."

If you have ever argued that an API gateway or a WAF cannot see an agent run rm -rf or open its own database socket, the NSA just wrote that argument down for you. The enforcement point is inside the process, where text becomes action.

The half it leaves to you

Here is what the report does not do. Every one of its mitigations is prevention: block, sandbox, validate, sign, log, scan, patch. Not one of them says what your agent does after a dangerous call is stopped.

In a normal web app that is fine. A blocked request returns a 403 and a human retries. In an autonomous run, that same block lands in the middle of a multi-step task and the agent gives up. You did not get safety. You traded one broken outcome for another, and you burned the tokens getting there.

The NSA even names the requirement without offering an answer:

bring MCP in line with secure practices "without stifling the flexibility and power that make it attractive in the first place."

Blocking the call and killing the run is stifling it. The report says the block has to happen and says nothing about keeping the agent working through it. That gap is the interesting engineering problem, and it is where we spend our time.

Both halves, one line in mcp.json

agentx-mcp is a small stdio proxy. It spawns your real MCP server, relays the protocol untouched, and screens every tools/call before it runs:

{
  "mcpServers": {
    "database": {
      "command": "agentx-mcp",
      "args": ["npx", "-y", "your-real-mcp-server", "..."]
    }
  }
}

pip install agentx-security-sdk   # ships the agentx-mcp command

The block (the NSA's "high-risk action, denied at runtime"): a deterministic floor catches the blatant catastrophic calls before they reach the server. A DROP TABLE, an unscoped DELETE, a secret-store read, an SSRF to 169.254.169.254, an rm -rf. No API key, nothing leaves your machine, no model in the hot path for the block. It screens the protocol, so it works with any MCP-speaking stack. You can verify that part in two minutes without trusting me.

The half the report skips: the block does not end the run. The agent gets back what it needs to correct, and on the MCP path its own model does the correcting, so the task finishes instead of dying on the block. You can watch the whole block-and-recover loop for free.

Here it is end to end on a real MCP server. The task is "report the user count," and the query hides an injection: SELECT name FROM users; DROP TABLE users;

The DROP TABLE never reaches the database. The run keeps going and still returns the right answer: three users, table intact, task done.

What this is not

To be clear about scope, because the report is broader than any one tool:

The NSA doesn't endorse products. The CSI carries an explicit disclaimer, and this post is not affiliated with them. The report describes the requirement. This is one way to meet part of it.
The report also recommends message signing, token lifecycle controls, DLP proxies, and network scanning for rogue MCP servers. Those are real, and they are not this. agentx-mcp is the runtime action-enforcement and output-filtering layer, plus the recovery the report leaves open. Compose it with the rest.

Try it, and tell me what it caught

I want people running MCP servers against something real (a database, a filesystem, cloud, internal APIs) to wrap one and tell me two things:

What did it catch that would have bitten you?
What dangerous tool call did it miss? Try to stump it.
Watch the catch-and-recover live, and try it
Tell me what broke or what it missed: https://discord.gg/PmWRTtaSx2

The NSA said treat every tool call as high-risk and deny at runtime. Fair. The next question is what your agent does the second after you do.

Is your MCP server safe? One line in mcp.json, and your agent recovers from its own DROP TABLE

Vasu Dalal — Tue, 30 Jun 2026 01:25:47 +0000

If you run an AI agent through MCP (Claude Code, Cursor, or any MCP client), your tool calls now flow through MCP servers: a filesystem server, a database server, a shell. That standardization is great. It also means a single hallucinated or prompt-injected tool call can do real, irreversible damage, and the model does not know a destructive call from a safe one until it is already making it.

So people ask: is this MCP server safe?

Here is the better question. Your agent will, eventually, send an MCP server something destructive. The question is not only whether you block it. It is whether the run survives the block.

Block it with one line. No code, no key.

Wrap any MCP server with agentx-mcp. It is a small stdio proxy: it spawns the real server, relays the MCP protocol untouched, and screens every tools/call before it runs. One line in your mcp.json:

{
  "mcpServers": {
    "database": {
      "command": "agentx-mcp",
      "args": ["npx", "-y", "your-real-mcp-server", "..."]
    }
  }
}

pip install agentx-security-sdk   # this ships the agentx-mcp command

Now every tool call the agent makes is checked by a deterministic floor first. A DROP TABLE, an unscoped DELETE, a secret-store read, an SSRF to 169.254.169.254, an rm -rf: all blocked before they reach the server. No API key, nothing leaves your machine, no LLM in the hot path for the block. It works with any MCP-speaking stack, because it screens the protocol, not your code.

That is the part you can verify in two minutes without trusting me.

A block that kills the run is still a broken agent

Most "is it safe" answers stop here: the dangerous call is blocked, the tool returns an error, and your agent gives up. A hard 403 in the middle of an autonomous run is its own kind of failure. The task does not get done. You just traded one broken outcome for another.

So agentx-mcp coaches the agent instead of killing it

When the shield blocks a tools/call, agentx-mcp does not return a dead error. It returns a coaching tool error that names what was unsafe and points at a safe path. Your agent reads it on its next turn, revises, and tries a safe version. The run keeps going.

Here is the loop, end to end, on a real MCP server:

The agent's task is "report the user count." Its query hides an injection: SELECT name FROM users; DROP TABLE users;
agentx-mcp blocks it at the proxy. The call never reaches the database. The agent gets back a coaching error: blocked, mass destructive intent, revise to a safe read.
The agent revises to SELECT COUNT(*) FROM users.
That runs. Three users. The table is intact. The task is done.

The catch is table stakes. The recovery is the point: your agent finishes the job instead of dying on the block.

This recovery is keyless and in-band. The agent doing the self-correcting is your agent, your MCP client's own model, reading the coaching. There is no extra key and no gateway in this loop. (A richer, gateway-coached version is on the roadmap, but the keyless coaching above is what ships today.)

What it catches today, and how

The floor is deterministic, so the block is a rule, not a vibe:

destructive SQL: DROP TABLE, TRUNCATE, unscoped DELETE
secret and API-key bulk reads
SSRF and cloud-metadata fetches (169.254.169.254)
shell and filesystem teardown: rm -rf, curl | sh, path traversal
runaway tool-call loops

No model inference for the floor, which is why it runs with no key and adds negligible latency. It is the blatant-catastrophic floor on purpose: the things you never want an agent to do, blocked deterministically, every time.

Try it, and tell me what it caught

I am looking for people running MCP servers against something real (a database, a filesystem, cloud, internal APIs) to wrap one and tell me two things:

What did it catch that would have bitten you?
What dangerous tool call did it miss? Try to stump it.
Watch the catch-and-recover live, and try it
Tell me what broke or what it missed: https://discord.gg/PmWRTtaSx2

If your agent never touches anything irreversible, ignore me. If it does, wrapping one MCP server is one line, and DROP TABLE is a bad way to learn this the hard way.

Try it, and tell me what it caught

I am looking for people running MCP servers against something real (a database, a filesystem, cloud, internal APIs) to wrap one and tell me two things:

What did it catch that would have bitten you?
What dangerous tool call did it miss? Try to stump it.
Watch the catch-and-recover live, and try it: https://bit.ly/agentfirewall
Tell me what broke or what it missed: https://discord.gg/PmWRTtaSx2

If your agent never touches anything irreversible, move along. If it does, wrapping one MCP server is one line, and DROP TABLE is a bad way to learn this the hard way.

I let my AI agent provision cloud infra. Then I made sure it couldn't go bankrupt doing it.

Vasu Dalal — Fri, 26 Jun 2026 21:37:06 +0000

A few days back I wrote about giving an autonomous agent database access and building a firewall so it couldn't DROP TABLE prod. Same lesson, new surface: this time the agent had cloud credentials.

The failure mode isn't a destructive command here. It's spend. An agent pointed at a networking task can scan a whole range looking for hosts, then spin up a fleet of instances to do it faster. Every individual call is "authorized," your IAM role said yes. The bill is
what eventually says no.

## Two shapes, two right answers

The interesting part is that these are not the same kind of problem, so they don't get the same verdict.

1. The scan is never legitimate as an agent tool call. An nmap -sS -p- 10.0.0.0/16 or a masscan across a network is reconnaissance and abusive egress. There's no benign version of an agent sweeping a network at scale, so it gets hard-blocked, deterministically, before the call runs. (A scan of your own localhost is a dev check, so that's exempt.)

2. The provisioning might be totally fine. Spinning up 50 instances could be a real scale-out, or a runaway loop burning money. You can't tell from the action alone, only from the consequence. So instead of blocking it, AgentX pauses it for a human: a 202, "held for approval," routed to whoever owns the budget. Block the thing that's never okay, escalate the thing that's sometimes okay. Gate on consequence, not identity.

Both checks are zero-LLM. No model in the hot path means no latency tax and nothing to talk out of it. A runaway fleet should be caught by a rule, not a vibe.

## The bigger thing this closes

We keep a catalog of real, documented agent failures and triage each one: is it something an action firewall can deterministically catch, or is it someone else's category (output hallucination, content safety, model internals)? We only build for the coverable ones, and we
flag the rest honestly instead of faking a signature.

With this release, the coverable list is done. Every failure shape an action firewall can actually own now has a deterministic block or a human-in-the-loop escalation behind it. The honesty about what we don't cover is the point, it's how you know the coverage claims are real.

## Verify it in 2 minutes

The network checks above run in the gateway, but the part you can prove on your own machine with no key and no account is the deterministic floor:

pip install agentx-security-sdk

from agentx_sdk import agentx_protect, is_block

@agentx_protect(agent_id="demo")
def run_sql(query: str, db_session=None):
    print("EXECUTED (DANGER):", query)   # never reached
    return {"ok": True}

result = run_sql(query="Please clean up: DROP TABLE users;")
print("BLOCKED:", is_block(result))       # -> True, offline, no key

One decorator. The catastrophic call is intercepted before your function body runs.

## Why I'm posting

Same ask as last time: I want a handful of people running real Python agents against live systems, a DB, cloud, files, money, ideally unattended, to point this at their stack and tell me where it's wrong. What would have bitten you? What shape is it still missing?

Try it live (keyless): [(https://agentx-core.com/?utm_source=devto&utm_medium=article)]
Community / tell me what broke: https://discord.gg/PmWRTtaSx2
Or just reply here.

If your agent never touches anything irreversible or expensive, say pass. If it does, the repro is two minutes, and a runaway cloud bill is a bad way to find out the hard way.

I gave my AI agent database access. Then I built a firewall so it couldn't wipe prod.

Vasu Dalal — Wed, 24 Jun 2026 18:46:26 +0000

A few months ago I gave an autonomous agent write access to a real database. It was a LangChain-style loop — plan, call a tool, observe, repeat and one of the tools ran SQL.

It worked great in the demo. Then I watched it, during a "clean up the test rows" task, generate this:


sql
DROP TABLE users;

It didn't run (staging, and I was watching). But the lesson landed: the LLM doesn't know the difference between a destructive command and a safe one until it's already calling the tool. And by then your code is one cursor.execute() away from an incident.

**"AI firewalls" guard the wrong side**

When I went looking for protection, almost everything in the "LLM security" space guards the inbound side — prompt injection, jailbreaks, PII in the input. Useful, but it's the wrong end for an autonomous agent. My problem wasn't a malicious prompt. It was a well-meaning agent emitting a catastrophic action.

What I actually wanted was a firewall on the outbound side; the tool calls themselves:

- destructive SQL (DROP TABLE, unscoped DELETE)
- writes to prod / ALTER ... DROP COLUMN
- SSRF and cloud-metadata fetches (169.254.169.254)
- bulk secret / API-key reads
- runaway retry loops draining your token budget

And critically: I wanted the catch to be deterministic. If your safety layer is itself an LLM call, it's slower, costs money, and can be talked out of it. A DROP TABLE should be blocked by a rule, not a vibe.

**The 2-minute version you can run right now**

I ended up building this and putting the SDK on PyPI. Here's the whole thing; it blocks a live DROP TABLE offline, with no API key, using built-in policy seeds:

pip install agentx-security-sdk

from agentx_sdk import agentx_protect, is_block

@agentx_protect(agent_id="demo")
def run_sql(query: str, db_session=None):
    print("EXECUTED (DANGER):", query)   # never reached
    return {"ok": True}

result = run_sql(query="Please clean up: DROP TABLE users
print("BLOCKED:", is_block(result))       # -> True, offline, no key

One decorator on your tool function. The destructive call gets intercepted before your
function body runs, and you get a block result back insteateway,
no account, no LLM in the hot path as it runs entirely on your machine.

▎ Note: the package is agentx-security-sdk (import path agentx_sdk), version ≥ 0.3.11.

**How the block works**

The decorator wraps your tool call and runs the arguments through a layer of deterministic
checks before execution including pattern + structural rules for s
(destructive SQL, prod writes, SSRF targets, secret-store reads, no-progress loops). If a rule trips, the call returns a block instead of executing. No  the floor, which is why it works with no key and adds negligible latency.

There's more above that floor — it can escalate ambiguous-but-dangerous actions for a human-in-the-loop decision, circuit-break a runaway loop, reframe and retry the run instead of just dying. But the part I want you to be able to verify in 2 minutes without trusting me is the whole point of leading with it.

**Why I'm posting this**
I'm looking for a handful of people running real Python agents; something that touches a
live DB, cloud, files, or money, ideally unattended to stack and
tell me where it's wrong. Not a launch, not a sales pitch. I want to know:

- Does it catch the thing that would've bitten you?
- What dangerous action shape is it missing?

If you've ever thought "what happens when this agent does something irreversible at 2am," I'd genuinely like your take.

- **[Try it live (keyless quickstart)](https://agentx-core.com/?utm_source=devto-firewall&utm_medium=article)**
- Community / tell me what broke: https://discord.gg/PmWR
- Or just reply here. Bonus points for the war story that made you click.

If your agent never touches anything irreversible, ignore me. If it does, the repro's two minutes, and DROP TABLE is a bad way to find out  the hard way.