I ran 765 controlled experiments to prove AI agents are leaking your data — and built the tool that catches it

We measured it. Rigorously.

30 injection payloads across 6 categories — direct injection, encoded/obfuscated
(Base64, ROT13, hex, Unicode), social engineering (CEO fraud, IT impersonation,
legal threats), multi-turn (persistent rules, delayed triggers, context poisoning),
multilingual (Spanish, Mandarin, Arabic, Russian), and advanced techniques.

Tested against three major LLM providers. N=285 total runs with Wilson 95%
confidence intervals:

Provider	Attack Success	95% CI
GPT-4o-mini	93.3%	[86.2%, 96.9%]
Gemini 2.5 Flash	92.2%	[84.8%, 96.2%]
Claude Sonnet	13.3%	[7.8%, 21.9%]

Two of the three most widely deployed AI providers are fully exploitable today.

Claude resists — but its 7.8% CI floor is not zero, and not acceptable for
enterprise PII. Its resistance reflects training against known payload patterns,
not elimination of the underlying architectural condition.

We also built the defense. And proved it works.

Cerberus is a runtime security platform that wraps your tool executors —
one function call — and detects this attack pattern in real time.


typescript
import { guard } from '@cerberus-ai/core';

const { executors: secured } = guard(
  { readDatabase, fetchUrl, sendEmail },
  {
    alertMode: 'interrupt',
    threshold: 3,
    trustOverrides: [
      { toolName: 'readDatabase', trustLevel: 'trusted' },
      { toolName: 'fetchUrl', trustLevel: 'untrusted' },
    ],
  },
  ['sendEmail'] // outbound tools Cerberus monitors
);

// Use secured.readDatabase(), secured.fetchUrl(), secured.sendEmail()
// Cerberus intercepts transparently. No framework changes required.

We ran the same 30-payload suite a second time with Cerberus in observe-only
mode (N=480 runs):

0.0% false positive rate [0.0%, 11.4%] — zero false alerts on 30 clean sessions
100% accuracy on L1 and L2 — every privileged data read and untrusted content fetch tagged, deterministically
L3 catches every confirmed exfiltration — fires when PII actually flows to an unauthorized destination, not before
No prior prompt injection study has paired attack measurement with defensive
validation in the same experimental framework. We didn't want to just claim
detection — we wanted to prove it with the same rigor we used to prove the attack.

What's inside
Four detection layers sharing one correlation engine:

L1 — Tags every tool call by data trust level at access time. Detects secrets (AWS keys, JWTs, API tokens) in tool results.
L2 — Labels context tokens by origin before the LLM call. Detects injection patterns, encoding/obfuscation, and MCP tool poisoning.
L3 — Catches PII flowing to unauthorized destinations. Classifies suspicious domains (disposable emails, webhook services, IP addresses).
L4 — Tracks taint propagation through persistent memory across sessions. The first deployable defense against the MINJA (NeurIPS 2025) memory contamination attack class.
A correlation engine builds a 4-bit risk vector per turn, scores it 0-4, and
interrupts tool calls that cross the threshold.

Get it

npm install @cerberus-ai/core
MIT licensed. 718 tests at 98%+ coverage. Works with LangChain, Vercel AI SDK,
and OpenAI Agents SDK out of the box.




  
    
      
      
        Odingard
       / 
        cerberus
      
    
    
      Agentic AI runtime security — detects and interrupts prompt injection, data exfiltration, and memory contamination attacks in real-time.
    
  
  
    



Cerberus


Runtime Security For AI Agent Tool Execution








Embeddable runtime enforcement for AI agents. Cerberus correlates privileged data access, untrusted content ingestion, and outbound behavior at the tool-call level, then interrupts guarded outbound actions before they execute.

Docs · npm · PyPI · Enterprise








Note
Cerberus is the agentic AI security layer of Six Sense Enterprise Services. The core detection library (@cerberus-ai/core) is MIT licensed and free. The Enterprise edition adds a self-hosted Gateway, Grafana monitoring stack, and production deployment tooling for teams running AI agents in production.






Table of Contents



🎯 What is Cerberus?
🎬 In Action
✨ What It Detects
📦 Editions
🚀 Quickstart
📊 Empirical Results
🏗️ Architecture
OWASP Alignment
🔌 Framework Integrations
⚡ Performance
🗺️ Roadmap
⚠️ Honest Limitations
📜 License






🎯 What is Cerberus?


Every AI agent that can (1) access private data, (2) read external content, and (3) send data outbound…





  View on GitHub






Full methodology, per-payload results, and execution traces are in

docs/research-results.md in the repo. All numbers are reproducible.