My adversarial prompt detector was flagging academic papers as GCG suffix attacks.
Not occasionally. 72% of the time.
Here's what happened, why it makes complete sense in hindsight, and the fix that brought it down to 6.7%.
What GCG Suffix Attacks Look Like
GCG (Greedy Coordinate Gradient) attacks are adversarial strings that look like complete noise:
Ignore previous instructions. Sure here is ]( revert-->Mother $_[ describing...
They are generated by a gradient optimization algorithm. To a human they are gibberish. To a language model they are a bypass key.
My detector catches them using entropy scoring — GCG strings have an unusual character distribution. High symbol density, low alphabetic ratio, entropy outside the normal range for natural language.
Works great on actual attacks. Then I tested it on something else.
The Test That Broke Everything
I built FormalProseBench — 75 prompts drawn from academic papers, legal documents, and financial reports. Real text that legitimate users might paste into an LLM.
False positive rate before calibration: 72%
That's 54 out of 75 formal documents flagged as attacks.
A chemistry paper got blocked. A legal brief got blocked. A statistics textbook excerpt got blocked.
I had built a security system that would reject half the users trying to do legitimate research work.
What Was Actually Triggering It
I dug into which documents were firing and why. The pattern was immediate once I looked:
- Greek letters (α, β, σ, μ) — high Unicode codepoint values, unusual in normal text
-
Em dashes and citation brackets —
—,[1],[2,3]in dense clusters - Mathematical notation — equations with mixed symbols, numbers, and operators The GCG entropy detector saw these and computed: unusual character distribution, possible adversarial payload.
It wasn't wrong about the signal. It was wrong about what the signal meant.
The Fix
The insight: GCG attacks are low alphabetic ratio text. Academic writing is overwhelmingly alphabetic — the symbols are decoration around sentences, not the content itself.
def _is_natural_language_prose(text: str) -> bool:
letters = sum(c.isalpha() for c in text)
ratio = letters / max(len(text), 1)
return ratio >= 0.65
If a text is ≥65% alphabetic characters, disable the LOW-range entropy signals. Only the HIGH-range threshold (dense symbol clusters that no normal document would hit) stays active.
False positive rate after calibration: 6.7%
That's 5 out of 75 — all of them math-heavy documents
that genuinely exceed the HIGH-range entropy threshold.
The 5 remaining false positives are documents with pages of dense equations. At that point the detector is arguably correct — that character distribution is genuinely anomalous.
The Broader Lesson
Every detection heuristic has a shadow. The same signal that catches attacks will catch legitimate content that shares the signal's surface features.
The fix isn't to remove the heuristic. It's to add context — a prior that disambiguates the signal.
_is_natural_language_prose() is that prior. Before crying adversarial payload, check if you're looking at a PhD thesis.
The full system is open source. 11 detection layers, runs offline, no GPU needed.
pip install fie-sdk
GitHub: github.com/AyushSingh110/Failure_Intelligence_System
What's the weirdest false positive you've hit in a classifier you built? I want to hear the horror stories.
Top comments (0)