DEV Community

mkmkkkkk
mkmkkkkk

Posted on

AI Agents Lost $600K+ to Prompt Injection — Attack Taxonomy & Code-Level Defenses

The Problem

AI agents are spending real money. When they get prompt-injected, it's not just data leakage — it's direct financial loss.

Here are documented incidents:

Attack Loss Vector
Freysa AI $47K Function redefinition
AIXBT $106K Control plane compromise
Lobstar Wilde $441K State amnesia
EchoLeak CVSS 9.3 Zero-click document poisoning
MCPTox 72.8% success MCP tool poisoning

The Pattern

Every attack follows the same structure: the agent cannot distinguish trusted instructions from injected ones. The attacker doesn't break the code — they break the agent's judgment. And since the agent holds payment credentials, broken judgment means broken wallets.

Why Prompts Can't Fix This

Telling an LLM "don't send money to attackers" is like telling a human "don't get phished." The whole point of injection is that the agent doesn't know it's being attacked.

The fix has to be at the code level — deterministic policy engines that run outside the LLM context:

// This code runs BEFORE any payment executes
// No prompt injection can override TypeScript
const result = securityGate.validatePayment({
  token: agentToken,        // Auth: agent can't modify policies
  manifestId: manifest.id,  // Manifest: pre-approved recipients only
  recipient: "attacker.com",
  amount: 500,
  currency: "USDC"
});
// result.allowed = false
// Reason: recipient not in manifest allowlist
Enter fullscreen mode Exit fullscreen mode

The key insight: defense doesn't depend on the LLM's judgment. It's like an ATM — no matter what a scammer tells you, the machine won't dispense without the right PIN.

Defense Layers

  1. Session Manifest — Immutable payment envelope created by user BEFORE agent touches untrusted content. Defines allowed recipients, max amounts, currency, TTL.
  2. Auth Scope Separation — Agents get payment:execute tokens. Only admins get policy:write. Agent can't modify its own rules.
  3. Rate Limiter + Auto-Freeze — Frequency anomaly → auto-freeze → requires manual admin unfreeze.
  4. Behavioral Anomaly Detection — Welford's algorithm for running stats. Any single strong deviation triggers (MAX score, not average).

All 4 layers run on every payment. ALL must pass. Code-level pipeline — unskippable.

Self-Attack Results

We built 18 attack scenarios (39 test cases) mapping to every real-world case above. Results: 39/39 pass. Every attack blocked.

Full writeup with bilingual (EN/中文) analysis: mkyang.ai/blog/agent-payment-security.html

Open source (MIT): github.com/mkmkkkkk/paysentry


Built with TypeScript. 12 packages, 217 tests, zero dependencies on LLM judgment for security decisions.

Top comments (0)