DEV Community

Mark0
Mark0

Posted on

Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls

⚠️ Region Alert: UAE/Middle East

This research examines AdvJudge-Zero, an automated fuzzer designed to identify security vulnerabilities in AI judges—Large Language Models (LLMs) used as automated gatekeepers to enforce safety policies. The study demonstrates how these models can be manipulated into authorizing policy violations using stealthy input sequences, such as benign formatting symbols and markdown syntax, which shift the model's logic into an approval state.

Unlike traditional adversarial attacks that rely on detectable gibberish, AdvJudge-Zero identifies low-perplexity tokens that appear natural to both humans and security filters. Testing revealed a 99% success rate in bypassing controls across various architectures, including high-parameter enterprise models, highlighting a critical need for adversarial training to harden AI guardrails against logic-based exploitation.


Read Full Article

Top comments (0)