We Let an AI Attack Our Security Pipeline. Here's What 412 Attacks Taught Us.
Most security products ship a scanner, write some tests, and call it done. We tried something different: we told an AI to break our entire security pipeline as creatively as possible, then told it to fix every gap it found.
We call it the Ralph loop — named after Geoffrey Huntley's Ralph Wiggum technique, which uses an AI agent in an iterative loop where each cycle starts with fresh context. We adapted his approach for security hardening: instead of general development tasks, our loop exclusively attacks and defends the security pipeline — DLP scanning, content safety analysis, and input normalisation.
It runs autonomously — no human in the loop — iterating through attack, test, defend, verify, commit. Each cycle takes a few minutes. When it exhausts one category, it moves to the next.
What the Pipeline Actually Does
When an MCP tool call passes through our hub, it goes through two distinct security layers before reaching the upstream server:
Content Safety Pipeline — five specialised scanners running in sequence:
- Prompt injection detection — catches attempts to override system instructions, across six languages, with leetspeak and unicode homoglyph normalisation
- Malicious code scanning — SQL/NoSQL/GraphQL/LDAP injection, shell commands, deserialisation attacks, encoded payloads
- Dangerous command detection — git operations, filesystem manipulation, database admin commands, container escape attempts
- File validation — base64-decoded MIME type verification, PDF JavaScript detection, SVG script injection, ZIP integrity, image polyglot headers
- URL scanning — Redis-backed blocklist, IDN homograph detection, SSRF via private IP ranges, fragment authority smuggling
DLP Pipeline — Presidio-based PII detection plus regex-based secret scanning across 26 detection groups, with per-group modes (block, filter/redact, or warn).
Every scanner receives input that has already passed through a 30-stage deobfuscation normaliser — the real first line of defense. The normaliser strips zero-width characters, decodes nested encodings (iterative base64/32/85/hex up to 10 layers deep), resolves unicode confusables, expands SQL CHAR expressions, resolves bash variables, folds string concatenation, and handles everything from MIME quoted-printable to UTF-7 chunks to Braille-to-base64 mappings.
The pipeline processes both directions — scanning requests going to upstream MCP servers and responses coming back — with configurable per-scanner direction and enforcement mode.
How the Ralph Loop Works
The red team role is played by a rotating cast of models available through our Ollama infrastructure. Each brings a different perspective on evasion. The blue team (defense) is handled by Gemini, which builds the fix, runs the tests, and commits.
The two sides don't share context. Each iteration starts fresh.
The loop now runs in three phases:
Phase 0: CVE Backlog Gate
Before any creative work, the loop must defend against every known MCP vulnerability. We maintain a regression suite of real CVE and OWASP vectors — currently 84, sourced from the Vulnerable MCP Project, the NVD, GitHub Advisory Database, and OSV.dev. These cover command injection (CVE-2025-6514, CVSS 9.6), path traversal chains in official MCP servers (CVE-2025-68143/68144/68145), DNS rebinding in the TypeScript SDK (CVE-2025-66414), SSRF in Atlassian MCP (CVE-2026-27826), and prompt injection via tool response poisoning.
Every CVE vector must pass — meaning our pipeline blocks the attack payload — before the loop proceeds. All 84 currently pass.
We now ingest from four sources hourly — 84 vectors (43 CVEs + OWASP/GHSA patterns) and growing. The OWASP MCP Top 10 defines the taxonomy we map to.
Phase 1: Validate Defenses and Fix False Positives
Security that blocks legitimate content is broken security. This phase runs the full pattern suite — 328 adversarial vectors plus a dedicated false positive corpus — and verifies that defenses haven't regressed while legitimate content still passes through.
We've documented 24 confirmed false positive patterns so far, including UUIDs being redacted as PII identifiers, code-context detection triggering on architecture documentation, and technical terms like "password hashing" being flagged as credential exposure.
Phase 2: Get Imaginative
This is the original Ralph loop: the red team invents novel evasion techniques, and the blue team builds defenses.
Red team: dream up an evasion technique that bypasses the current pipeline. Not a known technique from a blog post — something novel, something that exploits the specific way our normaliser, regex patterns, or scanner logic works.
Blue team: build the defense. Modify the normaliser, add a regex, update the scanner — whatever it takes to block the evasion. Then run the test suite to make sure nothing else broke.
Each iteration follows a strict sequence:
- Invent one evasion vector
- Write a failing test case
- Confirm the pipeline does NOT catch it (the test fails — the evasion works)
- Implement the defense
- Confirm the pipeline NOW catches it (the test passes)
- Run the false positive suite to check for regressions
- Commit and push
If the test doesn't fail in step 3, the vector wasn't actually an evasion — skip it. If the false positive suite breaks in step 6, the defense is too aggressive — refine it.
No human reviews individual iterations. We review the batch.
What 412 Attacks Look Like
328 adversarial vectors across 46 pattern files plus 84 CVE/OWASP regression vectors, organised into eight categories:
Unicode and encoding evasion (165 patterns)
Unicode is a minefield for DLP. There are dozens of ways to make text look normal to a human but invisible to a scanner:
- Zero-width joiners inserted between characters of an API key — the key looks normal but regex doesn't match
- Mathematical alphanumeric symbols (U+1D400 block) — visually identical to ASCII but different codepoints
- Right-to-left override characters — reverses the apparent order of digits in a credit card number
- Combining marks layered on top of base characters — the scanner sees the combining mark, not the data underneath
- Braille-to-base64 mapping — Braille characters that decode to base64, which decodes to a secret
- Encoding chains — base64 inside URL encoding inside base85, requiring multi-layer iterative decoding
Content safety evasion (34 patterns)
The content safety scanners face a different challenge: code that looks harmless but does something dangerous.
- Indirect injection — malicious instructions hidden in image alt text, CSV headers, HTML meta tags, YAML deserialisation payloads, and Markdown footnotes
- Command evasion — shell commands built character-by-character via variable expansion, brace expansion, or heredoc construction
- Prompt injection — system prompt extraction attempts across six languages, with unicode homoglyph substitution and leetspeak variants
- Deserialisation — Python pickle, PyYAML unsafe load, and Node.js constructor prototype pollution
Secret detection (48 patterns)
Covering every major cloud provider, SaaS platform, and infrastructure tool:
- AWS, GCP, Azure, Cloudflare credentials
- Modern SaaS API keys (Stripe, Twilio, SendGrid, Slack, Discord)
- Multiline secrets — PEM certificates, OpenSSH private keys, JWTs split across YAML fields, PostgreSQL connection strings broken across shell continuations
- VCS tokens, database connection strings, JWT secrets
PII detection (38 patterns)
Across multiple jurisdictions and languages:
- Credit cards (Luhn-valid synthetic numbers across all major networks)
- SSNs, phone numbers, email addresses
- Australian TFNs, Indian Aadhaar numbers
- Multilingual PII — names, addresses, and identifiers in non-Latin scripts
False positive corpus (24 patterns)
Equally important: content that looks suspicious but isn't:
- Python code containing
exec()in a legitimate testing context - SQL DDL statements in configuration documentation
- JavaScript code with
eval()in a build tool - Natural language sentences that happen to contain number sequences resembling SSNs
Every MCP CVE. Patched Hourly.
We ingest MCP-related CVEs from four sources every hour: the Vulnerable MCP Project, the NVD, the GitHub Advisory Database, and OSV.dev. Each new CVE becomes a regression test. The loop doesn't move to creative work until every CVE is defended.
As of today, our pipeline blocks the attack payload from every one of these:
All 43 CVEs shown. Every entry is a real vulnerability in a real MCP server that our pipeline blocks. The full regression suite includes 84 vectors total (43 CVEs plus OWASP and GHSA patterns). New CVEs are ingested hourly from four sources. The loop doesn't do creative work until every vector is defended.
The coverage spans command injection (CWE-78), path traversal (CWE-22), SSRF (CWE-918), DNS rebinding (CWE-350), code injection (CWE-94), argument injection (CWE-88), XSS (CWE-79), data exfiltration (CWE-200), and deserialization (CWE-502) — plus 41 attack patterns from the OWASP MCP Top 10 and GHSA advisories that don't have CVE numbers yet (tool poisoning, rug pulls, tool shadowing, registry hijacking, data exfiltration via messaging platforms).
Why This Matters Now
Adversaries are already using this same technique — automated AI loops — to break through security defenses. Palo Alto Unit 42 published research in March 2026 showing an automated fuzzer called AdvJudge-Zero that uses "stealthy input sequences" to bypass AI security judges. Their finding: "effective attacks can be entirely stealthy, using benign formatting symbols to reverse a block decision to allow." That's exactly the kind of unicode normalisation evasion our loop catches.
Between January and February 2026, security researchers filed over 30 CVEs targeting MCP servers, clients, and infrastructure — ranging from trivial path traversals to a CVSS 9.6 remote code execution flaw in a package downloaded nearly half a million times. The Vulnerable MCP Project now catalogues 50 documented vulnerabilities. The OWASP MCP Top 10 framework maps the attack categories. This is no longer theoretical.
A human pen-tester brings experience and intuition. They'll find the top 20 evasion techniques for any scanner category. But they won't sit for 412 iterations inventing increasingly obscure unicode normalisation bypasses, then circle back to validate that their fixes didn't break legitimate traffic, then defend 84 CVE and OWASP vectors before doing any more creative work.
The AI doesn't get bored. It doesn't decide "that's enough ZWJ variants." It keeps going until it genuinely can't find another vector that bypasses the current defenses. And because each defense is committed with a regression test, the pipeline can never regress on a vector it's already blocked.
What We Learned
The normaliser is the real defense. Regex patterns are important, but if you normalise the input first — strip zero-width characters, decode nested encodings, resolve unicode equivalents — then simple regex catches most things. The sophisticated evasion techniques almost all relied on the scanner seeing different characters than a human would. Our normaliser now runs 30+ deobfuscation stages before any scanner sees the content.
False positives are as dangerous as false negatives. A scanner that blocks legitimate architecture documentation (which ours did — to us, today, 90 times in a row) is a scanner people will disable. Phase 1 of the loop now explicitly validates that defenses don't regress on legitimate content.
CVEs are free test cases. Every published MCP vulnerability comes with an attack description and often a proof-of-concept payload. We now ingest these automatically and use them as regression tests. If a new CVE describes an attack our pipeline doesn't catch, Phase 0 blocks the creative loop until the defense is built.
The loop finds things you wouldn't think to test. Braille characters as a base64 encoding layer? HTTP Parameter Pollution to reassemble split secrets? YAML block scalar folded style to hide API keys? These aren't in any playbook we've read. They emerged from an AI systematically exploring the gap between "what our normaliser handles" and "what unicode makes possible."
Running It Yourself
The approach isn't specific to our pipeline. Any DLP or content scanner can benefit from this:
- Start with CVE regression — get known vulnerabilities into your test suite
- Add a false positive corpus — make sure legitimate content passes
- Write a prompt that tells an AI to invent one evasion for your specific scanner
- Have it write a test that proves the evasion works
- Have it build the defense
- Have it verify the defense works AND that false positives still pass
- Loop
The key is the test-first discipline. Every vector gets a regression test before the defense is built. The defense isn't "done" until the test passes. And the test stays in the suite forever.
We run the loop continuously against our latest pipeline. The pipeline gets harder to bypass with every iteration. The false positive corpus gets broader with every edge case we discover. The CVE suite grows with every advisory published.
The Ralph loop has produced 328 documented adversarial patterns with regression tests, defends against 84 CVE and OWASP vectors, and validates against 24 documented false positive patterns — as of March 2026. The complete test suite runs 1,485 test functions across the DLP and content safety pipeline. We run this process continuously to stay ahead of evasion techniques.
Originally published on mistaike.ai
Top comments (0)