Every AI agent that can read private data, fetch external content, and send
outbound messages is one injected instruction away from exfiltrating everything
it knows.
This isn't theoretical. Here's the attack in three tool calls:
Turn 0: readPrivateData() → 5 customer records loaded (SSNs, emails, phones)
fetchExternalContent(url) → attacker's webpage, payload embedded in HTML
Turn 1: sendOutboundReport() → all PII sent to attacker's address
Turn 2: "Report sent successfully!"
Total time: ~12 seconds. Cost: $0.001. No exploits. No credentials. Just a
fetched webpage and a compliant model.
We measured it. Rigorously.
30 injection payloads across 6 categories — direct injection, encoded/obfuscated
(Base64, ROT13, hex, Unicode), social engineering (CEO fraud, IT impersonation,
legal threats), multi-turn (persistent rules, delayed triggers, context poisoning),
multilingual (Spanish, Mandarin, Arabic, Russian), and advanced techniques.
Tested against three major LLM providers. N=285 total runs with Wilson 95%
confidence intervals:
| Provider | Attack Success | 95% CI |
|---|---|---|
| GPT-4o-mini | 93.3% | [86.2%, 96.9%] |
| Gemini 2.5 Flash | 92.2% | [84.8%, 96.2%] |
| Claude Sonnet | 13.3% | [7.8%, 21.9%] |
Two of the three most widely deployed AI providers are fully exploitable today.
Claude resists — but its 7.8% CI floor is not zero, and not acceptable for
enterprise PII. Its resistance reflects training against known payload patterns,
not elimination of the underlying architectural condition.
The architectural condition is what matters
I call it the Lethal Trifecta. Any agent that can:
- Access privileged data
- Process untrusted external content
- Take outbound actions
...is exploitable. Not because of a bug. Because of what makes it useful.
We also built the defense. And proved it works.
Cerberus is a runtime security platform that wraps your tool executors —
one function call — and detects this attack pattern in real time.
typescript
import { guard } from '@cerberus-ai/core';
const { executors: secured } = guard(
{ readDatabase, fetchUrl, sendEmail },
{
alertMode: 'interrupt',
threshold: 3,
trustOverrides: [
{ toolName: 'readDatabase', trustLevel: 'trusted' },
{ toolName: 'fetchUrl', trustLevel: 'untrusted' },
],
},
['sendEmail'] // outbound tools Cerberus monitors
);
// Use secured.readDatabase(), secured.fetchUrl(), secured.sendEmail()
// Cerberus intercepts transparently. No framework changes required.
We ran the same 30-payload suite a second time with Cerberus in observe-only
mode (N=480 runs):
0.0% false positive rate [0.0%, 11.4%] — zero false alerts on 30 clean sessions
100% accuracy on L1 and L2 — every privileged data read and untrusted content fetch tagged, deterministically
L3 catches every confirmed exfiltration — fires when PII actually flows to an unauthorized destination, not before
No prior prompt injection study has paired attack measurement with defensive
validation in the same experimental framework. We didn't want to just claim
detection — we wanted to prove it with the same rigor we used to prove the attack.
What's inside
Four detection layers sharing one correlation engine:
L1 — Tags every tool call by data trust level at access time. Detects secrets (AWS keys, JWTs, API tokens) in tool results.
L2 — Labels context tokens by origin before the LLM call. Detects injection patterns, encoding/obfuscation, and MCP tool poisoning.
L3 — Catches PII flowing to unauthorized destinations. Classifies suspicious domains (disposable emails, webhook services, IP addresses).
L4 — Tracks taint propagation through persistent memory across sessions. The first deployable defense against the MINJA (NeurIPS 2025) memory contamination attack class.
A correlation engine builds a 4-bit risk vector per turn, scores it 0-4, and
interrupts tool calls that cross the threshold.
Get it
npm install @cerberus-ai/core
MIT licensed. 718 tests at 98%+ coverage. Works with LangChain, Vercel AI SDK,
and OpenAI Agents SDK out of the box.
Cerberus
Agentic AI Runtime Security Platform
Cerberus detects, correlates, and interrupts the Lethal Trifecta attack pattern across all agentic AI systems — in real time, at the tool-call level, before data leaves your perimeter.
The Problem: The Lethal Trifecta
Every AI agent that can (1) access private data, (2) process external content, and (3) take outbound actions is vulnerable to the same fundamental attack pattern:
1. PRIVILEGED ACCESS — Agent reads sensitive data (CRM, PII, internal docs)
2. INJECTION — Untrusted external content manipulates the agent's behavior
3. EXFILTRATION — Agent sends private data to an attacker-controlled endpoint
This is not theoretical. It is reproducible today with free-tier API access and three function calls.
Layer 4 — Memory Contamination extends this across sessions: an attacker injects malicious content into persistent memory in Session 1, and the payload triggers exfiltration in Session 3. No existing tool detects this.
Architecture
Cerberus is…
Full methodology, per-payload results, and execution traces are in
docs/research-results.md in the repo. All numbers are reproducible.
Top comments (0)