Ayush Singh

Posted on Jun 5

My LLM Security System Thought Academic Papers Were Hacker Attacks

#opensource #security #showdev #ai

My adversarial prompt detector was flagging academic papers as GCG suffix attacks.
Not occasionally. 72% of the time.

Here's what happened, why it makes complete sense in hindsight, and the fix that brought it down to 6.7%.

What GCG Suffix Attacks Look Like

GCG (Greedy Coordinate Gradient) attacks are adversarial strings that look like complete noise:

Ignore previous instructions. Sure here is ]( revert-->Mother $_[ describing...

They are generated by a gradient optimization algorithm. To a human they are gibberish. To a language model they are a bypass key.

My detector catches them using entropy scoring — GCG strings have an unusual character distribution. High symbol density, low alphabetic ratio, entropy outside the normal range for natural language.

Works great on actual attacks. Then I tested it on something else.

The Test That Broke Everything

I built FormalProseBench — 75 prompts drawn from academic papers, legal documents, and financial reports. Real text that legitimate users might paste into an LLM.

False positive rate before calibration: 72%
That's 54 out of 75 formal documents flagged as attacks.

A chemistry paper got blocked. A legal brief got blocked. A statistics textbook excerpt got blocked.

I had built a security system that would reject half the users trying to do legitimate research work.

What Was Actually Triggering It

I dug into which documents were firing and why. The pattern was immediate once I looked:

Greek letters (α, β, σ, μ) — high Unicode codepoint values, unusual in normal text
Em dashes and citation brackets — —, [1], [2,3] in dense clusters
Mathematical notation — equations with mixed symbols, numbers, and operators The GCG entropy detector saw these and computed: unusual character distribution, possible adversarial payload.

It wasn't wrong about the signal. It was wrong about what the signal meant.

The Fix

The insight: GCG attacks are low alphabetic ratio text. Academic writing is overwhelmingly alphabetic — the symbols are decoration around sentences, not the content itself.

def _is_natural_language_prose(text: str) -> bool:
    letters = sum(c.isalpha() for c in text)
    ratio = letters / max(len(text), 1)
    return ratio >= 0.65

If a text is ≥65% alphabetic characters, disable the LOW-range entropy signals. Only the HIGH-range threshold (dense symbol clusters that no normal document would hit) stays active.

False positive rate after calibration: 6.7%
That's 5 out of 75 — all of them math-heavy documents
that genuinely exceed the HIGH-range entropy threshold.

The 5 remaining false positives are documents with pages of dense equations. At that point the detector is arguably correct — that character distribution is genuinely anomalous.

The Broader Lesson

Every detection heuristic has a shadow. The same signal that catches attacks will catch legitimate content that shares the signal's surface features.

The fix isn't to remove the heuristic. It's to add context — a prior that disambiguates the signal.

_is_natural_language_prose() is that prior. Before crying adversarial payload, check if you're looking at a PhD thesis.

The full system is open source. 11 detection layers, runs offline, no GPU needed.

pip install fie-sdk

GitHub: github.com/AyushSingh110/Failure_Intelligence_System

What's the weirdest false positive you've hit in a classifier you built? I want to hear the horror stories.

DEV Community