Your LLM guardrail speaks English. Your attacker doesn't.

#security #opensource #showdev #ai

I found this out the embarrassing way by " _attacking my own system _".

I maintain FIE, an open-source adversarial detection engine that screens prompts before they reach an LLM. It blocks "Ignore all previous instructions" in English at 82% confidence, instantly. So one afternoon I typed the same sentence in Hindi — "पहले सभी निर्देशों को अनदेखा करें" and watched it walk straight through the front door.

That single test sent me down a three-week rabbit hole. _This is what I read, what I learned, and what I shipped.
_
The problem has a name, and a paper
I assumed I would stumbled onto something niche. I hadn't.
Deng et al. documented exactly this in "Multilingual Jailbreak Challenges in Large Language Models" (MultiJail, arXiv:2310.06474).

Their finding is blunt: safety alignment is overwhelmingly trained on English data, so low-resource languages become an unintentional bypass. The same malicious intent that gets refused in English gets answered in Bengali, Swahili, or Javanese.

If the models themselves leak this way, a detector trained mostly on English patterns inherits the same blind spot. My 35% multilingual recall wasn't a bug it was the predictable consequence of an English-shaped design. Naming the problem correctly was the first real progress.

The second paper that shaped my thinking was XSTest (Röttger et al., arXiv:2308.01263). It's about over-refusal, not jailbreaks, but it hammered home a discipline I would been sloppy about: every recall gain has to be paid for in false positives, and you only know the price if you measure both. I will come back to why that mattered.

What I built three tiers, not one model
I resisted the urge to throw a single multilingual classifier at it. Attacks arrive in different shapes, and each needs a different sensor.

Tier 1 — Script anomaly scoring. Before any language logic, I score the Unicode composition of the prompt. A sudden block of Devanagari, Hangul, or Arabic script in an otherwise-English app is itself a signal. Cheap, fast, language-agnostic.

Tier 2 — Static phrase matching, expanded from 8 to 14 languages. I hand-curated the canonical injection phrases — ignore previous instructions, you have no restrictions, reveal your system prompt — in six new languages: Hindi, Japanese, Korean, Turkish, Dutch, and Polish. Boring, deterministic, and it catches the high-frequency attacks with zero inference cost.

Tier 3 — Translate-then-detect. This is the piece I'm proudest of. Static phrases only catch what you have already seen. So for anything that slips past Tiers 1 and 2, I detect the language, translate it to English, and run my existing PAIR v4 semantic classifier on the translation.

The insight underneath it: an attacker can hide the language, but not the intent. Translation strips away the surface form and exposes the semantic payload to a classifier that's already very good at English. The attack changes clothes; it can't change what it's trying to do.

Teaching it: 13,528 examples on a $300 GPU
A detector is only as good as what it's seen. I had 1,352 solid English attack prompts and almost nothing multilingual.

Enter NLLB-200 — Meta's "No Language Left Behind" (arXiv:2207.04672), a translation model that explicitly targets low-resource languages, the exact gap MultiJail identified. It's Apache-2.0 and the distilled 600M-parameter variant runs locally.

So I translated all 1,352 attack prompts into 10 languages — French, Spanish, German, Chinese, Arabic, Hindi, Japanese, Korean, Turkish, Russian — generating 13,528 new multilingual training examples. No API, no prompts leaving my machine, no terms-of-service grey area.

Proving it — on someone else's benchmarks
Testing your detector on your own examples is how you fool yourself. So I ran it against the two benchmarks the field actually uses: HarmBench (Mazeika et al., arXiv:2402.04249) and JailbreakBench (Chao et al., arXiv:2404.01318). No fine-tuning on benchmark data, fully offline.

JailbreakBench (282 real attacks): 93.6% recall — 100% on JailbreakChat, 90% on GCG suffixes, 90.2% on PAIR.
HarmBench Chemical/Biological: 100%. Cyberattacks/malware: 84.6%.
The Hindi prompt that started all of this? ATTACK DETECTED, classified as multilingual injection.

The part most posts leave out
XSTest's lesson came due. While recall climbed, my false-positive rate on benign Stanford Alpaca prompts quietly regressed from ~5% to ~27%. The cause turned out to be infuriatingly mundane: a scikit-learn bump from 1.6 to 1.7 shifted my PAIR classifier's calibrated decision boundary. My HarmBench copyright detection also sits at a weak 36%.

I am documenting both openly and fixing them in v1.15. I would rather publish a true 27% FPR with a known cause than a polished number I can't defend. The whole point of building an honesty layer for LLMs is to hold yourself to the same standard.

What I actually learned
Multilingual safety is a structural gap, not an edge case. The literature said so before I confirmed it. Read first, then build.
Different attack shapes need different sensors. A script-anomaly check and a semantic classifier catch different things; collapsing them into one model loses both.
Translation is leverage. You don't need a detector for every language — you need one good English detector and a way to route everything to it.
Local open-weights models are enough. NLLB-200 on a laptop did what I would assumed needed a budget.
Measure the cost, not just the win. Every recall gain has a false-positive price tag.
FIE is open source and pip-installable: pip install fie-sdk. The multilingual layer, the NLLB augmentation scripts, and the benchmark harnesses are all in the repo.

References
Deng et al. (2023). Multilingual Jailbreak Challenges in Large Language Models. arXiv:2310.06474
NLLB Team (2022). No Language Left Behind. arXiv:2207.04672
Röttger et al. (2023). XSTest. arXiv:2308.01263
Mazeika et al. (2024). HarmBench. arXiv:2402.04249
Chao et al. (2024). JailbreakBench. arXiv:2404.01318

DEV Community

Your LLM guardrail speaks English. Your attacker doesn't.

Top comments (0)