DEV Community

Mark0
Mark0

Posted on

Benchmarking Self-Hosted LLMs for Offensive Security

This article explores the effectiveness of self-hosted Large Language Models (LLMs) in offensive security scenarios, specifically benchmarking local models against the OWASP Juice Shop. Using a minimal harness and basic HTTP tools, the study evaluates models like gemma4:31b, qwen3.5:27b, and devstral-small-2:24b across challenges involving SQL injection, JWT manipulation, and path traversal.

The findings indicate that while local models excel at single-step exploit validation—reaching pass rates as high as 98.5%—they falter during complex, multi-step operations such as UNION-based extraction or algorithm confusion attacks. The research highlights a significant knowledge-execution gap, suggesting that model performance is heavily influenced by the surrounding agent framework and tool design rather than raw parameter count alone.


Read Full Article

Top comments (0)