<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vext Labs Inc</title>
    <description>The latest articles on DEV Community by Vext Labs Inc (@vextlabs).</description>
    <link>https://dev.to/vextlabs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3899344%2F8463ecc3-18d5-4bf2-84c7-403064cee388.png</url>
      <title>DEV Community: Vext Labs Inc</title>
      <link>https://dev.to/vextlabs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vextlabs"/>
    <language>en</language>
    <item>
      <title>Trained, Not Prompted: Why Fine-Tuned Models Beat LLM Wrappers for Offensive Security</title>
      <dc:creator>Vext Labs Inc</dc:creator>
      <pubDate>Sun, 26 Apr 2026 20:42:44 +0000</pubDate>
      <link>https://dev.to/vextlabs/trained-not-prompted-why-fine-tuned-models-beat-llm-wrappers-for-offensive-security-5h03</link>
      <guid>https://dev.to/vextlabs/trained-not-prompted-why-fine-tuned-models-beat-llm-wrappers-for-offensive-security-5h03</guid>
      <description>&lt;h1&gt;
  
  
  The GPT Wrapper Problem
&lt;/h1&gt;

&lt;p&gt;Here's a secret the "AI security" industry doesn't want you to know: most products in this space are thin wrappers around commercial LLM APIs. They send prompts like "You are a penetration tester. Analyze this HTTP response for vulnerabilities" to GPT-4 or Claude, parse the output, and call it autonomous pentesting.&lt;/p&gt;

&lt;p&gt;This approach has three fatal flaws.&lt;/p&gt;

&lt;h2&gt;
  
  
  Flaw 1: Generic Models Hallucinate in Security Contexts
&lt;/h2&gt;

&lt;p&gt;Large language models trained on general internet data will confidently report vulnerabilities that don't exist. They've seen enough security blog posts to know what SQL injection &lt;em&gt;looks like&lt;/em&gt;, but they lack the specialized training to distinguish a real vulnerability from a false positive. In security, false positives aren't just annoying — they waste your team's time and erode trust in the tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Flaw 2: Prompt Engineering is Fragile
&lt;/h2&gt;

&lt;p&gt;Prompt-based approaches break when the target doesn't match the template. A carefully crafted prompt for testing REST APIs will fail on GraphQL endpoints. A prompt designed for standard HTML forms won't handle React single-page applications. Real applications are messy, and prompt templates can't handle that messiness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Flaw 3: No Learning Loop
&lt;/h2&gt;

&lt;p&gt;When a prompt-wrapped LLM fails to find a vulnerability, nothing changes. The next engagement uses the same prompts with the same limitations. There is no mechanism for improvement.&lt;/p&gt;

&lt;h1&gt;
  
  
  VEXT's Approach: Fine-Tuned Offensive Models
&lt;/h1&gt;

&lt;p&gt;VEXT takes a fundamentally different approach. Our agents are purpose-built for offensive security, trained on real exploit data from thousands of security engagements.&lt;/p&gt;

&lt;p&gt;What does this mean in practice?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attack patterns are in the weights, not the prompts.&lt;/strong&gt; Our injection workers don't need to be told what SQL injection looks like — they have internalized thousands of real injection patterns, bypass techniques, and exploitation chains from training data. This is the difference between reading about swimming and actually knowing how to swim.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The feedback loop is real.&lt;/strong&gt; Every engagement generates training signal — 326K+ curated examples and growing. Brain v4 retrains continuously via RLAF. DPO alignment runs on validated vs false-positive pairs. When an agent discovers a new bypass technique, it propagates to all agents within the same run via Redis streams, and persists across runs via the VAULT knowledge graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three-tier ML stack.&lt;/strong&gt; Brain v4 (6M params, 15ms) handles tool selection via GNN + MCTS. Specialist-7B (7B params, 200ms) handles tool output parsing and payload generation. Sentry v4 (100B class, 2s) handles complex hypothesis generation and novel exploit reasoning. Six-stage training: SFT, DPO, GRPO, RLAF, self-play, continuous learning.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why This Matters for Your Security
&lt;/h1&gt;

&lt;p&gt;The difference between a prompted model and a fine-tuned model is the difference between a contractor who read the manual yesterday and an expert who has done the job a thousand times. Both can follow instructions. Only one has intuition.&lt;/p&gt;

&lt;p&gt;When your next compliance audit requires a penetration test, ask your vendor one question: &lt;strong&gt;are your models trained on real exploit data, or are they prompting a general-purpose LLM?&lt;/strong&gt; The answer tells you everything you need to know about the quality of findings you'll receive.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>security</category>
      <category>llm</category>
    </item>
    <item>
      <title>VEXT Specialist-7B: How a 7B Model Beats Frontier AI on Security Benchmarks</title>
      <dc:creator>Vext Labs Inc</dc:creator>
      <pubDate>Sun, 26 Apr 2026 20:41:56 +0000</pubDate>
      <link>https://dev.to/vextlabs/vext-specialist-7b-how-a-7b-model-beats-frontier-ai-on-security-benchmarks-240f</link>
      <guid>https://dev.to/vextlabs/vext-specialist-7b-how-a-7b-model-beats-frontier-ai-on-security-benchmarks-240f</guid>
      <description>&lt;h1&gt;
  
  
  VEXT Specialist-7B: How a 7B Model Beats Frontier AI on Security Benchmarks
&lt;/h1&gt;

&lt;p&gt;The conventional wisdom in AI is bigger equals better. More parameters, more training data, more compute. For general tasks, this holds. For offensive security, it does not. Specialist-7B proves that a purpose-trained 7B model can outperform frontier models 10-100x its size on the tasks that actually matter for penetration testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark Results
&lt;/h2&gt;

&lt;p&gt;We evaluated Specialist-7B against Claude Opus, GPT-4o, and Llama 3.1 70B across eight security-specific benchmark categories. The results were decisive.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark Category&lt;/th&gt;
&lt;th&gt;Specialist-7B (7B)&lt;/th&gt;
&lt;th&gt;Claude Opus&lt;/th&gt;
&lt;th&gt;GPT-4o&lt;/th&gt;
&lt;th&gt;Llama 3.1 70B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Practical Pentesting Tasks&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;90%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;72%&lt;/td&gt;
&lt;td&gt;68%&lt;/td&gt;
&lt;td&gt;61%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compliance Mapping Accuracy&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;89%&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web Security Exploits&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;88%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;79%&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool Output Parsing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;95%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;82%&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Payload Generation&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;87%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;54%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Assessment Plan Sequencing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;76%&lt;/td&gt;
&lt;td&gt;63%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;False Positive Detection&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;94%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;td&gt;79%&lt;/td&gt;
&lt;td&gt;67%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overall Security Score&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;63%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Specialist-7B achieves a 92% overall security score compared to Claude Opus at 80%, GPT-4o at 75%, and Llama 3.1 70B at 63%. On compliance mapping specifically, Specialist-7B achieves perfect 100% accuracy — correctly mapping every finding to the right PCI DSS 4.0, SOC 2, HIPAA, GDPR, ISO 27001, NIST CSF, and FedRAMP controls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Smaller Beats Bigger
&lt;/h2&gt;

&lt;p&gt;Frontier models are trained on internet-scale general text. They know a little about everything. Specialist-7B is trained on 326K+ examples from real penetration testing engagements — tool outputs, exploit chains, vulnerability reports, compliance mappings, and assessment plans.&lt;/p&gt;

&lt;p&gt;This specialization creates three advantages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Domain-specific pattern recognition.&lt;/strong&gt; Specialist-7B has seen thousands of real nmap outputs, nuclei scan results, and sqlmap exploitation logs. It does not need to reason from first principles about what a port scan result means — it has internalized the patterns. This is why tool output parsing hits 95% accuracy versus 82% for Claude Opus.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Security-aware false positive filtering.&lt;/strong&gt; General models frequently hallucinate vulnerabilities because they pattern-match against security blog posts rather than real exploitation data. Specialist-7B was fine-tuned with DPO on validated-vs-false-positive pairs from real bug bounty programs, giving it 94% accuracy on false positive detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Compliance control internalization.&lt;/strong&gt; Mapping a finding to the correct PCI DSS 4.0 control requires deep knowledge of the control framework — not just keyword matching. Specialist-7B was trained on thousands of auditor-validated compliance mappings, achieving 100% accuracy where larger models score 74-89%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Specialist-7B sits in the middle tier of VEXT's three-tier AI architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tier 1 — Brain v4 (5ms, 80% of decisions)&lt;/strong&gt;: A 6M parameter neural engine using GNN + Multi-Head Q-Net + MCTS. Handles tool selection and attack routing at 99.7% accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 2 — Specialist-7B (200ms, 15% of decisions)&lt;/strong&gt;: The workhorse. Tool output parsing, payload generation, assessment plan sequencing, compliance mapping. Fast enough for structured tasks, smart enough for complex security reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 3 — Sentry v4 (2s, 5% of decisions)&lt;/strong&gt;: A 100B class model for complex hypothesis generation, novel exploit chain reasoning, and deep analysis. Called only when the smaller tiers cannot handle the task.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This tiered approach reduces inference cost by 95% (from $251K/month on Bedrock to $12K/month self-hosted) while maintaining or improving quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training Pipeline
&lt;/h2&gt;

&lt;p&gt;Specialist-7B was trained through a multi-stage pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Base model selection&lt;/strong&gt;: Started from a 7B parameter base model selected for strong code understanding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SFT (Supervised Fine-Tuning)&lt;/strong&gt;: 326K+ examples from real security engagements — tool outputs, assessment plans, compliance mappings, vulnerability reports&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DPO (Direct Preference Optimization)&lt;/strong&gt;: 2,049 validated-vs-false-positive pairs from real bug bounty findings, teaching the model to distinguish real vulnerabilities from noise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task-specific fine-tuning&lt;/strong&gt;: Separate fine-tuning rounds for tool output parsing, payload generation, and compliance mapping using domain-specific datasets&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The training data comes from real penetration testing across 17 bug bounty programs — not synthetic data, not CTF solutions, not blog post examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Specialist-7B Does
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tool output parsing&lt;/strong&gt;: Feed Specialist-7B raw output from nmap, nuclei, sqlmap, burp, gobuster, or any of 24+ supported security tools. Get structured findings with severity, CWE mapping, and recommended next steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Payload generation&lt;/strong&gt;: Context-aware payload crafting for SQL injection, XSS, SSRF, command injection, deserialization, and IDOR vectors. Specialist-7B considers WAF presence, technology stack, and prior failed attempts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Assessment plan sequencing&lt;/strong&gt;: Given reconnaissance data, Specialist-7B generates prioritized assessment plans with dependency ordering — which tests to run first, which findings to chain together, and which kill chains to activate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance mapping&lt;/strong&gt;: Submit any finding and get it mapped to the correct controls across PCI DSS 4.0, SOC 2, HIPAA, GDPR, ISO 27001, NIST CSF, and FedRAMP with 100% accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;False positive filtering&lt;/strong&gt;: Two-pass validation where Specialist-7B evaluates evidence quality, reproduction reliability, and exploit chain viability to filter false positives before they reach the report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source
&lt;/h2&gt;

&lt;p&gt;Specialist-7B is available on HuggingFace under the Apache 2.0 license. We believe security tooling improves when the community can inspect, contribute to, and build on top of the models. Download it, fine-tune it for your use case, integrate it into your pipeline.&lt;/p&gt;

&lt;p&gt;The model weights, evaluation benchmarks, and training methodology documentation are all open. The training data itself is proprietary (it comes from real engagements), but the model is fully open-weight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The cybersecurity industry has been flooded with "AI security tools" that are thin wrappers around commercial LLM APIs. They send a prompt to GPT-4 asking it to "analyze this HTTP response for vulnerabilities" and call it autonomous pentesting.&lt;/p&gt;

&lt;p&gt;Specialist-7B proves that purpose-trained models — even small ones — dramatically outperform general-purpose frontier models on real security tasks. A 7B model running at 200ms on a single GPU achieves 90% on practical pentesting tasks where Claude Opus scores 72% and GPT-4o scores 68%.&lt;/p&gt;

&lt;p&gt;The lesson: for offensive security, training on real exploit data matters more than parameter count. Purpose-built beats general-purpose, every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Download Specialist-7B&lt;/strong&gt;: Available on HuggingFace (Apache 2.0)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try VEXT Platform&lt;/strong&gt;: &lt;a href="https://tryvext.com/access" rel="noopener noreferrer"&gt;https://tryvext.com/access&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read the benchmarks&lt;/strong&gt;: &lt;a href="https://tryvext.com/benchmarks" rel="noopener noreferrer"&gt;https://tryvext.com/benchmarks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explore the architecture&lt;/strong&gt;: &lt;a href="https://tryvext.com/technology" rel="noopener noreferrer"&gt;https://tryvext.com/technology&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
