Nick Stocks

Posted on Mar 18 • Edited on Mar 20 • Originally published at mistaike.ai

We Let an AI Attack Our Security Pipeline. Here's What 412 Attacks Taught Us.

#dlp #security #redteam #aiagents

We Let an AI Attack Our Security Pipeline. Here's What 412 Attacks Taught Us.

Most security products ship a scanner, write some tests, and call it done. We tried something different: we told an AI to break our entire security pipeline as creatively as possible, then told it to fix every gap it found.

We call it the Ralph loop — named after Geoffrey Huntley's Ralph Wiggum technique, which uses an AI agent in an iterative loop where each cycle starts with fresh context. We adapted his approach for security hardening: instead of general development tasks, our loop exclusively attacks and defends the security pipeline — DLP scanning, content safety analysis, and input normalisation.

It runs autonomously — no human in the loop — iterating through attack, test, defend, verify, commit. Each cycle takes a few minutes. When it exhausts one category, it moves to the next.

What the Pipeline Actually Does

When an MCP tool call passes through our hub, it goes through two distinct security layers before reaching the upstream server:

Content Safety Pipeline — five specialised scanners running in sequence:

Prompt injection detection — catches attempts to override system instructions, across six languages, with leetspeak and unicode homoglyph normalisation
Malicious code scanning — SQL/NoSQL/GraphQL/LDAP injection, shell commands, deserialisation attacks, encoded payloads
Dangerous command detection — git operations, filesystem manipulation, database admin commands, container escape attempts
File validation — base64-decoded MIME type verification, PDF JavaScript detection, SVG script injection, ZIP integrity, image polyglot headers
URL scanning — Redis-backed blocklist, IDN homograph detection, SSRF via private IP ranges, fragment authority smuggling

DLP Pipeline — Presidio-based PII detection plus regex-based secret scanning across 26 detection groups, with per-group modes (block, filter/redact, or warn).

Every scanner receives input that has already passed through a 30-stage deobfuscation normaliser — the real first line of defense. The normaliser strips zero-width characters, decodes nested encodings (iterative base64/32/85/hex up to 10 layers deep), resolves unicode confusables, expands SQL CHAR expressions, resolves bash variables, folds string concatenation, and handles everything from MIME quoted-printable to UTF-7 chunks to Braille-to-base64 mappings.

The pipeline processes both directions — scanning requests going to upstream MCP servers and responses coming back — with configurable per-scanner direction and enforcement mode.

How the Ralph Loop Works

The red team role is played by a rotating cast of models available through our Ollama infrastructure. Each brings a different perspective on evasion. The blue team (defense) is handled by Gemini, which builds the fix, runs the tests, and commits.

The two sides don't share context. Each iteration starts fresh.

The loop now runs in three phases:

Phase 0: CVE Backlog Gate

Before any creative work, the loop must defend against every known MCP vulnerability. We maintain a regression suite of real CVE and OWASP vectors — currently 84, sourced from the Vulnerable MCP Project, the NVD, GitHub Advisory Database, and OSV.dev. These cover command injection (CVE-2025-6514, CVSS 9.6), path traversal chains in official MCP servers (CVE-2025-68143/68144/68145), DNS rebinding in the TypeScript SDK (CVE-2025-66414), SSRF in Atlassian MCP (CVE-2026-27826), and prompt injection via tool response poisoning.

Every CVE vector must pass — meaning our pipeline blocks the attack payload — before the loop proceeds. All 84 currently pass.

We now ingest from four sources hourly — 84 vectors (43 CVEs + OWASP/GHSA patterns) and growing. The OWASP MCP Top 10 defines the taxonomy we map to.

Phase 1: Validate Defenses and Fix False Positives

Security that blocks legitimate content is broken security. This phase runs the full pattern suite — 328 adversarial vectors plus a dedicated false positive corpus — and verifies that defenses haven't regressed while legitimate content still passes through.

We've documented 24 confirmed false positive patterns so far, including UUIDs being redacted as PII identifiers, code-context detection triggering on architecture documentation, and technical terms like "password hashing" being flagged as credential exposure.

Phase 2: Get Imaginative

This is the original Ralph loop: the red team invents novel evasion techniques, and the blue team builds defenses.

Red team: dream up an evasion technique that bypasses the current pipeline. Not a known technique from a blog post — something novel, something that exploits the specific way our normaliser, regex patterns, or scanner logic works.

Blue team: build the defense. Modify the normaliser, add a regex, update the scanner — whatever it takes to block the evasion. Then run the test suite to make sure nothing else broke.

Each iteration follows a strict sequence:

Invent one evasion vector
Write a failing test case
Confirm the pipeline does NOT catch it (the test fails — the evasion works)
Implement the defense
Confirm the pipeline NOW catches it (the test passes)
Run the false positive suite to check for regressions
Commit and push

If the test doesn't fail in step 3, the vector wasn't actually an evasion — skip it. If the false positive suite breaks in step 6, the defense is too aggressive — refine it.

No human reviews individual iterations. We review the batch.

What 412 Attacks Look Like

328 adversarial vectors across 46 pattern files plus 84 CVE/OWASP regression vectors, organised into eight categories:

Unicode and encoding evasion (165 patterns)

Unicode is a minefield for DLP. There are dozens of ways to make text look normal to a human but invisible to a scanner:

Zero-width joiners inserted between characters of an API key — the key looks normal but regex doesn't match
Mathematical alphanumeric symbols (U+1D400 block) — visually identical to ASCII but different codepoints
Right-to-left override characters — reverses the apparent order of digits in a credit card number
Combining marks layered on top of base characters — the scanner sees the combining mark, not the data underneath
Braille-to-base64 mapping — Braille characters that decode to base64, which decodes to a secret
Encoding chains — base64 inside URL encoding inside base85, requiring multi-layer iterative decoding

Content safety evasion (34 patterns)

The content safety scanners face a different challenge: code that looks harmless but does something dangerous.

Indirect injection — malicious instructions hidden in image alt text, CSV headers, HTML meta tags, YAML deserialisation payloads, and Markdown footnotes
Command evasion — shell commands built character-by-character via variable expansion, brace expansion, or heredoc construction
Prompt injection — system prompt extraction attempts across six languages, with unicode homoglyph substitution and leetspeak variants
Deserialisation — Python pickle, PyYAML unsafe load, and Node.js constructor prototype pollution

Secret detection (48 patterns)

Covering every major cloud provider, SaaS platform, and infrastructure tool:

AWS, GCP, Azure, Cloudflare credentials
Modern SaaS API keys (Stripe, Twilio, SendGrid, Slack, Discord)
Multiline secrets — PEM certificates, OpenSSH private keys, JWTs split across YAML fields, PostgreSQL connection strings broken across shell continuations
VCS tokens, database connection strings, JWT secrets

PII detection (38 patterns)

Across multiple jurisdictions and languages:

Credit cards (Luhn-valid synthetic numbers across all major networks)
SSNs, phone numbers, email addresses
Australian TFNs, Indian Aadhaar numbers
Multilingual PII — names, addresses, and identifiers in non-Latin scripts

False positive corpus (24 patterns)

Equally important: content that looks suspicious but isn't:

Python code containing exec() in a legitimate testing context
SQL DDL statements in configuration documentation
JavaScript code with eval() in a build tool
Natural language sentences that happen to contain number sequences resembling SSNs

Every MCP CVE. Patched Hourly.

We ingest MCP-related CVEs from four sources every hour: the Vulnerable MCP Project, the NVD, the GitHub Advisory Database, and OSV.dev. Each new CVE becomes a regression test. The loop doesn't move to creative work until every CVE is defended.

As of today, our pipeline blocks the attack payload from every one of these:

CVE	CVE	CVE	CVE
CVE-2025-6514	CVE-2025-8665	CVE-2025-20381	CVE-2025-34072
CVE-2025-47274	CVE-2025-49596	CVE-2025-52573	CVE-2025-53355
CVE-2025-53365	CVE-2025-53366	CVE-2025-53372	CVE-2025-53818
CVE-2025-53832	CVE-2025-53967	CVE-2025-54136	CVE-2025-54994
CVE-2025-58357	CVE-2025-59163	CVE-2025-59944	CVE-2025-65513
CVE-2025-66401	CVE-2025-66414	CVE-2025-66416	CVE-2025-66580
CVE-2025-66689	CVE-2025-67366	CVE-2025-68143	CVE-2025-68144
CVE-2025-68145	CVE-2025-68433	CVE-2026-0621	CVE-2026-0755
CVE-2026-0756	CVE-2026-22792	CVE-2026-23744	CVE-2026-25536
CVE-2026-25546	CVE-2026-25650	CVE-2026-27735	CVE-2026-27825
CVE-2026-27826	CVE-2026-27896	CVE-2026-32111

All 43 CVEs shown. Every entry is a real vulnerability in a real MCP server that our pipeline blocks. The full regression suite includes 84 vectors total (43 CVEs plus OWASP and GHSA patterns). New CVEs are ingested hourly from four sources. The loop doesn't do creative work until every vector is defended.

The coverage spans command injection (CWE-78), path traversal (CWE-22), SSRF (CWE-918), DNS rebinding (CWE-350), code injection (CWE-94), argument injection (CWE-88), XSS (CWE-79), data exfiltration (CWE-200), and deserialization (CWE-502) — plus 41 attack patterns from the OWASP MCP Top 10 and GHSA advisories that don't have CVE numbers yet (tool poisoning, rug pulls, tool shadowing, registry hijacking, data exfiltration via messaging platforms).

Why This Matters Now

Adversaries are already using this same technique — automated AI loops — to break through security defenses. Palo Alto Unit 42 published research in March 2026 showing an automated fuzzer called AdvJudge-Zero that uses "stealthy input sequences" to bypass AI security judges. Their finding: "effective attacks can be entirely stealthy, using benign formatting symbols to reverse a block decision to allow." That's exactly the kind of unicode normalisation evasion our loop catches.

Between January and February 2026, security researchers filed over 30 CVEs targeting MCP servers, clients, and infrastructure — ranging from trivial path traversals to a CVSS 9.6 remote code execution flaw in a package downloaded nearly half a million times. The Vulnerable MCP Project now catalogues 50 documented vulnerabilities. The OWASP MCP Top 10 framework maps the attack categories. This is no longer theoretical.

A human pen-tester brings experience and intuition. They'll find the top 20 evasion techniques for any scanner category. But they won't sit for 412 iterations inventing increasingly obscure unicode normalisation bypasses, then circle back to validate that their fixes didn't break legitimate traffic, then defend 84 CVE and OWASP vectors before doing any more creative work.

The AI doesn't get bored. It doesn't decide "that's enough ZWJ variants." It keeps going until it genuinely can't find another vector that bypasses the current defenses. And because each defense is committed with a regression test, the pipeline can never regress on a vector it's already blocked.

What We Learned

The normaliser is the real defense. Regex patterns are important, but if you normalise the input first — strip zero-width characters, decode nested encodings, resolve unicode equivalents — then simple regex catches most things. The sophisticated evasion techniques almost all relied on the scanner seeing different characters than a human would. Our normaliser now runs 30+ deobfuscation stages before any scanner sees the content.

False positives are as dangerous as false negatives. A scanner that blocks legitimate architecture documentation (which ours did — to us, today, 90 times in a row) is a scanner people will disable. Phase 1 of the loop now explicitly validates that defenses don't regress on legitimate content.

CVEs are free test cases. Every published MCP vulnerability comes with an attack description and often a proof-of-concept payload. We now ingest these automatically and use them as regression tests. If a new CVE describes an attack our pipeline doesn't catch, Phase 0 blocks the creative loop until the defense is built.

The loop finds things you wouldn't think to test. Braille characters as a base64 encoding layer? HTTP Parameter Pollution to reassemble split secrets? YAML block scalar folded style to hide API keys? These aren't in any playbook we've read. They emerged from an AI systematically exploring the gap between "what our normaliser handles" and "what unicode makes possible."

Running It Yourself

The approach isn't specific to our pipeline. Any DLP or content scanner can benefit from this:

Start with CVE regression — get known vulnerabilities into your test suite
Add a false positive corpus — make sure legitimate content passes
Write a prompt that tells an AI to invent one evasion for your specific scanner
Have it write a test that proves the evasion works
Have it build the defense
Have it verify the defense works AND that false positives still pass
Loop

The key is the test-first discipline. Every vector gets a regression test before the defense is built. The defense isn't "done" until the test passes. And the test stays in the suite forever.

We run the loop continuously against our latest pipeline. The pipeline gets harder to bypass with every iteration. The false positive corpus gets broader with every edge case we discover. The CVE suite grows with every advisory published.

The Ralph loop has produced 328 documented adversarial patterns with regression tests, defends against 84 CVE and OWASP vectors, and validates against 24 documented false positive patterns — as of March 2026. The complete test suite runs 1,485 test functions across the DLP and content safety pipeline. We run this process continuously to stay ahead of evasion techniques.

Originally published on mistaike.ai

DEV Community

We Let an AI Attack Our Security Pipeline. Here's What 412 Attacks Taught Us.

We Let an AI Attack Our Security Pipeline. Here's What 412 Attacks Taught Us.

What the Pipeline Actually Does

How the Ralph Loop Works

Phase 0: CVE Backlog Gate

Phase 1: Validate Defenses and Fix False Positives

Phase 2: Get Imaginative

What 412 Attacks Look Like

Unicode and encoding evasion (165 patterns)

Content safety evasion (34 patterns)

Secret detection (48 patterns)

PII detection (38 patterns)

False positive corpus (24 patterns)

Every MCP CVE. Patched Hourly.

Why This Matters Now

What We Learned

Running It Yourself

Top comments (0)