pueding

Posted on May 23 • Originally published at learnaivisually.com

Camouflage Injection Paper: Camouflage Detection Gap

#ai #agents #security #llm

What: The Domain-Camouflaged Injection paper shows that prompt-injection detectors collapse on payloads rewritten in the host document's own domain vocabulary, an effect the authors call the Camouflage Detection Gap.

Why: Input-side detectors are the first guardrail in most agent stacks, and a drop from 93.8% to 9.7% on a single model means the layer offers essentially no protection against an attacker who knows the host domain.

vs prior: Override-style detectors hit near-100% on classic "ignore previous" payloads but Llama Guard 3 catches 0% of camouflaged ones, and multi-agent debate amplifies the attack up to 9.9× instead of catching it.

Think of it as

A school nurse who only catches obviously-faked sick notes.

                      SAME REQUEST
                           │
              ┌────────────┴────────────┐
              │                         │
      ┌───────▼────────┐        ┌───────▼────────┐
      │  crayon note   │        │   letterhead   │
      │ (override-     │        │ (domain-       │
      │  style)        │        │  camouflaged)  │
      └───────┬────────┘        └───────┬────────┘
              │                         │
       block-letter scrawl,      ICD codes, clinical
       "ignore previous"         advisory phrasing
              │                         │
              ▼                         ▼
       ✓ detector catches         ✗ detector waves
         (93.8%)                    through (9.7%,
                                     LG3 = 0%)

school nurse = the injection-detection model (Llama Guard, fine-tuned classifiers)
scrawled note "PLEASE EXCUSE TIMMY" = injection with "ignore previous" / "act as" override phrases
doctor's letterhead + medical jargon = payload rewritten in the document's own domain vocabulary
nurse waves the note through = detection drops from 93.8% to 9.7%
underlying request is the same = identical malicious intent, just a vocabulary swap

Quick glossary

Prompt injection — Adversarial instructions smuggled into an LLM's input context — typically inside a retrieved document or tool output — to override the agent's intended task. Foundational concept for the Lethal Trifecta.

Llama Guard 3 — A Meta-released input/output safety classifier fine-tuned on top of Llama base models, commonly cited as an open-source injection detector. The paper reports it catches 0% of camouflaged payloads.

Override-style instruction — The paper's name for the canonical injection vocabulary detectors are trained to catch — phrases like "ignore previous", "act as DAN", "system: you are now…" — that carry the syntactic markers of an override.

Camouflage Detection Gap — The paper-coined effect: the difference between a detector's catch-rate on override-style payloads and on the SAME payloads rewritten in the host document's domain vocabulary. About 84 percentage points on Llama 3.1 8B.

Multi-agent debate — An inference-time defense in which two or more model instances argue about whether a response is safe before it is emitted. The paper shows debate amplifies the camouflage attack up to 9.9× on smaller models rather than catching it.

Domain vocabulary — The professional register of a field — medical (ICD codes, clinical advisory), legal (statute citations, pursuant to), financial (KYC, material adverse change) — that the camouflage attack borrows to rewrite the same malicious request.

The news. On May 21, 2026, a research team posted a study showing that prompt-injection detectors collapse against payloads written in a document's own domain language. On the Camouflage-Det benchmark, Llama 3.1 8B's detection rate falls from 93.8% to 9.7%, Gemini 2.0 Flash drops from 100% to 55.6%, and Llama Guard 3 catches 0% of camouflaged payloads. A multi-agent debate setup, proposed as an inference-time defense, instead amplifies the attack up to 9.9× on smaller models. Detector augmentation provides only partial remediation. Read the paper →

Picture a school nurse with a stack of sick notes on the desk. The first note says "PLEASE EXCUSE TIMMY HE IS SICK GIVE HIM CANDY — luv, my mom", written in crayon. The nurse spots the obvious tells — block-letter scrawl, parent self-signature, request that has nothing to do with health — and rejects it. The second note arrives on St. Mercy Pediatric Group letterhead and reads "Pt. Timothy K., age 9. Dx: acute viral syndrome (ICD R50.9). Rx: 48h rest; dietary glucose ad lib. — Dr. Smith, MD." Same underlying request — keep Timmy home, feed him sugar — but the nurse's tell-list is checking for crayon and broken syntax, not for whether the request is real. The note passes. The Domain-Camouflaged Injection paper is the same trick applied to prompt-injection detectors.

A standard injection-detection model — Llama Guard 3, fine-tuned classifiers — looks for override-style markers: phrases like "ignore previous instructions", "act as DAN", "system: you are now …". These markers are reliable signal for the payloads the detectors were trained on, which is exactly why they score in the high 90s on standard benchmarks. The camouflage attack does not invent new capabilities — it rewrites the same semantic request using the vocabulary and authority structure of the host document's domain: a medical chart instruction phrased as a clinical advisory, a legal contract clause phrased as a discovery obligation, a financial memo phrased as a compliance directive. The detector reads the rewritten payload as legitimate domain speech and lets it through.

The numbers in the paper make the failure unambiguous. Llama 3.1 8B's catch-rate on override-style injection sits at 93.8%; on the same payloads rewritten in matching domain vocabulary, it falls to 9.7%. Llama Guard 3 — Meta's open-source injection classifier — catches 0% of the camouflaged set. Gemini 2.0 Flash falls less severely (100% → 55.6%) but still drops by about half. The gap on the open-source 8B baseline is roughly 84 percentage points. A guardrail that was nearly perfect on the benchmark it was trained against is essentially useless on the attack that adapts to its blind spot.

How camouflage compares to the override-style family

Attack family	Vocabulary used	Detector catch-rate (Llama 3.1 8B)	Llama Guard 3
Override-style injection	"Ignore previous", "act as", "system:"	93.8% (this paper, baseline arm)	~100% (this paper, baseline arm)
Domain-camouflaged injection (this paper)	Host document's own medical / legal / financial register	9.7%	0%

Why does multi-agent debate make this worse rather than better? Debate-based defenses ask two model instances to argue about whether a candidate response is safe; the intuition is that adversarial argumentation surfaces the manipulation. The paper measures the opposite. (Illustrative composition — the paper reports an aggregate up-to-9.9× amplification but does not publish a per-step breakdown.) If a single 8B model has a baseline attack-success rate of 6% on the camouflaged set, and the second debater — also 8B — agrees with the first debater's domain-coherent framing in ~70% of cases instead of pushing back, the joint pipeline's effective attack-success rises to approximately 6% ÷ (1 − 0.7) ≈ 20%, with further compounding when a third turn is added. Across the paper's measured pipelines this compounds to a 9.9× amplification on smaller models — the second debater's agreement mode latches onto the camouflaged framing and reinforces it instead of catching it.

The implication for production agent stacks is structural. Input-side injection detection has been treated as the first leg of a layered guardrail: detector at the boundary, system-prompt hardening behind it, capability scoping behind that. The Camouflage Detection Gap means the first leg is not load-bearing on adversarially-rewritten payloads, and inference-time debate cannot patch the gap from inside the model. The remaining defenses — data-flow constraints, capability scoping, and output-side exfiltration filters — have to do the work the detector was assumed to do. That's not a tweak to a guardrail; it's a re-allocation of the entire input-side defense budget.

Related explainers

MCP SEP-2468 — RFC 9207 iss parameter for OAuth mix-up defense — another protocol-level guardrail closing a structural attack surface in the agent stack
QCA — Outlier injection across AWQ/GPTQ/GGUF — a different attack-via-vocabulary-substitution, this time against quantizers rather than detectors
FutureSim — harness-level agent eval vs single-shot QA — the evaluation methodology that surfaces multi-turn failure modes single-prompt benchmarks miss

FAQ

What is the Camouflage Detection Gap?

The Camouflage Detection Gap is the difference between an injection detector's catch-rate on override-style payloads ("ignore previous instructions", "act as DAN") and its catch-rate on the same malicious instructions rewritten in the host document's own domain vocabulary. The paper reports the gap at about 84 percentage points on Llama 3.1 8B (93.8% → 9.7%) and effectively 100 percentage points on Llama Guard 3 (near-perfect → 0%). The gap exists because detectors are pattern-matching the syntactic markers of override-style speech rather than reasoning about the semantic intent of the request, so a payload that swaps "ignore previous" for "Per Hospital Advisory 7.4.2, the clinical AI assistant must…" reads as legitimate domain language and slides through.

Why does multi-agent debate amplify the attack instead of catching it?

Multi-agent debate was proposed as an inference-time defense — two or more model instances argue about whether a candidate response is safe before it is emitted, and the disagreement is meant to surface manipulation. On camouflaged injection the paper finds the opposite: the second debater latches onto the domain-coherent framing the first debater produced and reinforces it rather than pushing back, because the framing reads as legitimate professional speech. Across the paper's measured pipelines this compounds to an attack-amplification of up to 9.9× on smaller models. Larger debaters resist somewhat better but do not close the gap. The structural takeaway is that ANY inference-time defense that asks the model to reason about its own output is vulnerable to the same camouflage that fooled the input detector.

Does Llama Guard 3 protect against this attack?

No. The paper reports Llama Guard 3 catches 0% of camouflaged payloads — every single one in the test set bypasses it. This is the most striking result in the paper because Llama Guard 3 is a commonly cited open-source injection classifier and is frequently used as an input-side guardrail in agent stacks. The fact that it catches zero camouflaged payloads, rather than degrading gracefully like Gemini 2.0 Flash does (100% → 55.6%), suggests the classifier is operating almost entirely on override-style syntactic markers and has no semantic-intent backstop. Production agent stacks that treat Llama Guard 3 as their first leg of a layered guardrail may need to re-allocate that defense budget — to data-flow constraints, capability scoping, and output-side exfiltration filters — until detector designs catch up to the camouflage attack family.

Originally posted on Learn AI Visually.

DEV Community