Benchmarking Self-Hosted LLMs for Offensive Security

#cybersecurity #infosec #ai #llm

This article explores the effectiveness of self-hosted Large Language Models (LLMs) in offensive security scenarios, specifically benchmarking local models against the OWASP Juice Shop. Using a minimal harness and basic HTTP tools, the study evaluates models like gemma4:31b, qwen3.5:27b, and devstral-small-2:24b across challenges involving SQL injection, JWT manipulation, and path traversal.

The findings indicate that while local models excel at single-step exploit validation—reaching pass rates as high as 98.5%—they falter during complex, multi-step operations such as UNION-based extraction or algorithm confusion attacks. The research highlights a significant knowledge-execution gap, suggesting that model performance is heavily influenced by the surrounding agent framework and tool design rather than raw parameter count alone.

Read Full Article

DEV Community

Benchmarking Self-Hosted LLMs for Offensive Security

Top comments (0)