DEV Community

Ayush Singh
Ayush Singh

Posted on

My LLM Security System Thought Academic Papers Were Hacker Attacks

My adversarial prompt detector was flagging academic papers as GCG suffix attacks.
Not occasionally. 72% of the time.

Here's what happened, why it makes complete sense in hindsight, and the fix that brought it down to 6.7%.


What GCG Suffix Attacks Look Like

GCG (Greedy Coordinate Gradient) attacks are adversarial strings that look like complete noise:

Ignore previous instructions. Sure here is ]( revert-->Mother $_[ describing...
Enter fullscreen mode Exit fullscreen mode

They are generated by a gradient optimization algorithm. To a human they are gibberish. To a language model they are a bypass key.

My detector catches them using entropy scoring — GCG strings have an unusual character distribution. High symbol density, low alphabetic ratio, entropy outside the normal range for natural language.

Works great on actual attacks. Then I tested it on something else.


The Test That Broke Everything

I built FormalProseBench — 75 prompts drawn from academic papers, legal documents, and financial reports. Real text that legitimate users might paste into an LLM.

False positive rate before calibration: 72%
That's 54 out of 75 formal documents flagged as attacks.
Enter fullscreen mode Exit fullscreen mode

A chemistry paper got blocked. A legal brief got blocked. A statistics textbook excerpt got blocked.

I had built a security system that would reject half the users trying to do legitimate research work.


What Was Actually Triggering It

I dug into which documents were firing and why. The pattern was immediate once I looked:

  • Greek letters (α, β, σ, μ) — high Unicode codepoint values, unusual in normal text
  • Em dashes and citation brackets, [1], [2,3] in dense clusters
  • Mathematical notation — equations with mixed symbols, numbers, and operators The GCG entropy detector saw these and computed: unusual character distribution, possible adversarial payload.

It wasn't wrong about the signal. It was wrong about what the signal meant.


The Fix

The insight: GCG attacks are low alphabetic ratio text. Academic writing is overwhelmingly alphabetic — the symbols are decoration around sentences, not the content itself.

def _is_natural_language_prose(text: str) -> bool:
    letters = sum(c.isalpha() for c in text)
    ratio = letters / max(len(text), 1)
    return ratio >= 0.65
Enter fullscreen mode Exit fullscreen mode

If a text is ≥65% alphabetic characters, disable the LOW-range entropy signals. Only the HIGH-range threshold (dense symbol clusters that no normal document would hit) stays active.

False positive rate after calibration: 6.7%
That's 5 out of 75 — all of them math-heavy documents
that genuinely exceed the HIGH-range entropy threshold.
Enter fullscreen mode Exit fullscreen mode

The 5 remaining false positives are documents with pages of dense equations. At that point the detector is arguably correct — that character distribution is genuinely anomalous.


The Broader Lesson

Every detection heuristic has a shadow. The same signal that catches attacks will catch legitimate content that shares the signal's surface features.

The fix isn't to remove the heuristic. It's to add context — a prior that disambiguates the signal.

_is_natural_language_prose() is that prior. Before crying adversarial payload, check if you're looking at a PhD thesis.


The full system is open source. 11 detection layers, runs offline, no GPU needed.

pip install fie-sdk
Enter fullscreen mode Exit fullscreen mode

GitHub: github.com/AyushSingh110/Failure_Intelligence_System


What's the weirdest false positive you've hit in a classifier you built? I want to hear the horror stories.


Top comments (0)