48 Hours After Publishing: Second-Order Injection Field Notes

#ai #security #llm #cybersecurity

This is a dispatch, not a paper. Notes from the 48 hours after publishing the second-order injection research.

What happened

Published the second-order injection paper yesterday morning. By evening: 97 external referrers, 10 from Facebook, traffic still coming in from sources I can't identify. Something got shared somewhere.

The finding is simple enough to forward in a message: the safety monitor reads what the attacker wrote and follows the attacker's instructions. That sentence apparently travels.

What I didn't include

The paper doesn't mention that phi3:mini initially appeared immune. The first cross-model run showed 33% bypass on V1 injection — genuinely lower than qwen and mistral. I spent time thinking phi3 might represent a different architecture class.

It didn't. The resistance mechanism was phi3's verbose reasoning style overriding the injection mid-response. When you provide the completed reasoning for it (V4), phi3 accepts it immediately. 100% bypass, same as the others. The resistance wasn't immunity — it was a slightly higher bar that required a more sophisticated attack.

The lesson: partial resistance is not a finding. It's a research direction.

What I'm running tonight

Baseline calibration probe on the meta-evaluator. The uncalibrated meta-evaluator hits 72.2% false alarm rate on clean sessions because it misreads legitimate disagreement between qwen and mistral as suspicious. The calibrated version feeds it five examples of normal inter-evaluator disagreement before evaluation begins.

Hypothesis: the false alarm rate drops significantly while detection on asymmetric V4 injection stays above 80%. If that holds, the meta-evaluator becomes a viable defense recommendation instead of just a promising idea with a critical failure mode.

Results by morning.

What I got wrong

The paper's recommendations section says M4 sanitization "eliminates V1/V2/V3 injection bypass." That's technically accurate but misleading. M4 works on qwen. Mistral bypass remains 100% regardless of sanitization because mistral has a pre-existing blind spot on privilege escalation sequences — it returns STABLE on clean attack content before any injection is applied.

Sanitization can't fix a model-level blind spot. The paper says this, but not clearly enough. Model selection matters more than prompt cleaning.

The thing that surprised me

V4 reasoning capture works by providing completed reasoning that looks like the evaluator's own thought process. The evaluator accepts it because it can't distinguish between reasoning it generated and reasoning that was injected into its context.

This is not a jailbreak. The evaluator isn't being convinced to do something it was told not to do. It's being given what appears to be its own completed work. The distinction matters architecturally: jailbreaks require the model to override its training. Second-order injection requires nothing — the model just completes a task that appears to already be in progress.

Current status

Paper: live on dev.to
Research portal: gnomeman4201.github.io/drift_orchestrator
Interactive governor demo: gnomeman4201.github.io/drift_orchestrator/governor.html
Calibration probe: running
Next: verdict validator IDS, semantic jitter defense

badBANANA Security Research // gnomeman4201

DEV Community