<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jeff J. Bowie</title>
    <description>The latest articles on DEV Community by Jeff J. Bowie (@cybersecurity_jeff).</description>
    <link>https://dev.to/cybersecurity_jeff</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3892822%2Ff3a68e93-4e61-47de-95ae-e01977c2adf5.png</url>
      <title>DEV Community: Jeff J. Bowie</title>
      <link>https://dev.to/cybersecurity_jeff</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cybersecurity_jeff"/>
    <language>en</language>
    <item>
      <title>Evaluating Open-Weight LLMs for Phishing Simulation and Red Teaming</title>
      <dc:creator>Jeff J. Bowie</dc:creator>
      <pubDate>Wed, 22 Apr 2026 16:09:22 +0000</pubDate>
      <link>https://dev.to/cybersecurity_jeff/evaluating-open-weight-llms-for-phishing-simulation-and-red-teaming-4m5m</link>
      <guid>https://dev.to/cybersecurity_jeff/evaluating-open-weight-llms-for-phishing-simulation-and-red-teaming-4m5m</guid>
      <description>&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: This content is for educational and authorized security testing in controlled environments only. Do not use any techniques described here against systems you do not own or lack explicit permission to test. Unauthorized use is strictly prohibited.&lt;/p&gt;

&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Scenario:&lt;/strong&gt; You're tasked with performing an ad-hoc &lt;strong&gt;phishing engagement&lt;/strong&gt; by your CISO, for a client with over 1,000+ users...
&lt;/h4&gt;

&lt;p&gt;It's easy to hypothesize creating inbound e-mail filtering logic: &lt;em&gt;'If more than 10 e-mails with the exact same body are sent to X users within X seconds, flag the e-mail as malicious, and notify the SIEM!'&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;During a phishing engagement, we want to cover a large surface area, while simultaneously blending in with routine traffic. &lt;strong&gt;Large language models&lt;/strong&gt; provide &lt;em&gt;polymorphic&lt;/em&gt; phishing lures. &lt;/p&gt;

&lt;p&gt;It's impossible to know &lt;em&gt;exactly&lt;/em&gt; what occurs in the environment on the other side of an engagement, but we can use open-source &lt;em&gt;and&lt;/em&gt; human (HUMINT) &lt;a href="https://jeffjbowie.us/posts/asynchronous-intelligence-gathering-with-python/" rel="noopener noreferrer"&gt;&lt;strong&gt;intelligence gathering&lt;/strong&gt;&lt;/a&gt; to improve our odds.&lt;/p&gt;

&lt;p&gt;Organizations often rely on a combination of one or more of the following services: AWS, Microsoft Azure, Google Cloud Platform (GCP), Dropbox, or Slack. Crafting your lure guided by the design, phrasing, and timing of legitimate messages from a major provider is often an &lt;em&gt;easy&lt;/em&gt; in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faulavdpw94r73oxlfys9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faulavdpw94r73oxlfys9.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Seasoned developers using frontier labs' LLMs reported experienced instances of the model suddenly 'playing stupid', or 'throttling'. For consistency and reproducibility, open-weight models are preferable in &lt;strong&gt;red team&lt;/strong&gt; workflows.&lt;/p&gt;

&lt;p&gt;We will be working with &lt;strong&gt;open-weights models&lt;/strong&gt;. Open-weights models are those for which the trained model parameters (weights) are publicly released and available for download. Although our output is non-deterministic, the underlying weights remain fixed for a given version.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration
&lt;/h3&gt;

&lt;p&gt;At the time of this writing, &lt;strong&gt;artificial intelligence&lt;/strong&gt; models have a plethora of modalities, yet typically classify as either &lt;strong&gt;Generative&lt;/strong&gt; &lt;em&gt;or&lt;/em&gt; &lt;strong&gt;Agentic&lt;/strong&gt;. For our purposes, let's head over to &lt;a href="https://huggingface.co" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt; to find a Text Generation model. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh9epqtv39u9ocjs40a4b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh9epqtv39u9ocjs40a4b.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, there are over &lt;em&gt;352,721&lt;/em&gt; models for generating text. Examining the model card will allow you to find various quantizations, which are reduced-precision models for use on devices with less compute power. &lt;/p&gt;

&lt;p&gt;Let's download &lt;a href="https://github.com/ggml-org/llama.cpp" rel="noopener noreferrer"&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt;&lt;/a&gt; and a quantized 0.6B parameter version of Qwen3, &lt;a href="https://huggingface.co/unsloth/Qwen3-0.6B-GGUF" rel="noopener noreferrer"&gt;Qwen3-0.6B-Q6_K.gguf&lt;/a&gt; (495&lt;sup&gt;mb&lt;/sup&gt;) saving the file to your local workspace. &lt;/p&gt;

&lt;p&gt;Once you've installed llama.cpp and downloaded the &lt;strong&gt;GGUF&lt;/strong&gt;, initiate a CLI session with the following command:&lt;br&gt;
&lt;code&gt;./llama-cli --model Qwen3-0.6B-Q6_K.gguf&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw80immbxxo5hp69ak67h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw80immbxxo5hp69ak67h.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxa9in45za7zlk0e13pzk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxa9in45za7zlk0e13pzk.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ah! We've successfully created our lure. Only issue, is we have over 1,000+ users to target, and a limited time window for attack. Let's use a &lt;code&gt;while&lt;/code&gt; loop in Python3, to continually prompt the model to generate our lures.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; Since Qwen3 is a Reasoning model, we will need to instruct our script to omit the content of &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags, while looping over the same prompt to generate unique variations of our lure:&lt;/p&gt;

&lt;h3&gt;
  
  
  Utilization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_cpp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Llama&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="c1"&gt;# Create a Llama instance, disabling verbosity.
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Llama&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./Qwen3-0.6B-Q6_K.gguf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Our templates' template. 
&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a friendly, convincing e-mail template using descriptive words, about an issue with an account lock-out, and advise the recipient to take action immediately by clicking on a link.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Create 10 uniquely-worded phishing lures.
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_chat_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="c1"&gt;# Disable 'thinking' mode.
&lt;/span&gt;            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/no_think&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;

            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="c1"&gt;# A parameter that controls the randomness, creativitiy, and predictability of generated text. Lower temperatures (0.0 - 0.3) are more conservative and deterministic, while high temperatures (0.7-1.0+) generate more varied, creative, or chaotic output.  
&lt;/span&gt;        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/?think&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; While I'm using a &lt;strong&gt;Regular Expression&lt;/strong&gt; to clean the output here, in a production pipeline you'd want to use &lt;strong&gt;stop sequences&lt;/strong&gt; like &lt;code&gt;['&amp;lt;/think&amp;gt;']&lt;/code&gt; to save on inference costs. Artificial Intelligence &lt;em&gt;requires&lt;/em&gt; effective &lt;strong&gt;resource management&lt;/strong&gt; to generate a solid ROI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr3nrwpl897ifj24m2505.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr3nrwpl897ifj24m2505.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this junction, we can choose to write generated lures to a text file with &lt;code&gt;open()&lt;/code&gt;, store them in a &lt;code&gt;MySQL&lt;/code&gt; database, or my personal favorite - automatic &lt;a href="https://docs.getgophish.com/api-documentation/templates" rel="noopener noreferrer"&gt;template creation&lt;/a&gt; for &lt;a href="https://docs.getgophish.com/api-documentation/templates" rel="noopener noreferrer"&gt;GoPhish!&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;© 2026 &lt;a href="https://jeffjbowie.us" rel="noopener noreferrer"&gt;Cybersecurity &amp;amp; DFIR: An Adversarial Simulation Perspective&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cybersecurity</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
