DEV Community

Mark0
Mark0

Posted on

Open, Closed and Broken: Prompt Fuzzing Finds LLMs Still Fragile Across Open and Closed Models

⚠️ Region Alert: UAE/Middle East

Unit 42 researchers have introduced a genetic algorithm-inspired prompt fuzzing method designed to automatically generate meaning-preserving variants of disallowed requests to test Large Language Model (LLM) guardrails. By iteratively applying mutations—such as prepending phrases, repeating keywords, or adding relative words—the system can identify specific surface-form variations that bypass safety controls. The study reveals that even advanced models released in 2024 and 2025 exhibit significant fragility, with evasion rates varying wildly depending on specific keywords and model architectures.

The research emphasizes that model licensing (open-source vs. closed-source) is not a definitive indicator of safety, as both categories showed non-uniform robustness. Notably, standalone content filters were found to be the most susceptible to these automated attacks, often misclassifying fuzzed malicious prompts as benign. To build more resilient GenAI applications, the report recommends a layered security-by-design approach, including strict scope enforcement, output validation, and continuous adversarial testing through automated fuzzing.


Read Full Article

Top comments (0)