Evaluating Open-Weight LLMs for Phishing Simulation and Red Teaming

#cybersecurity #ai #llm #opensource

Disclaimer: This content is for educational and authorized security testing in controlled environments only. Do not use any techniques described here against systems you do not own or lack explicit permission to test. Unauthorized use is strictly prohibited.

Introduction

Scenario: You're tasked with performing an ad-hoc phishing engagement by your CISO, for a client with over 1,000+ users...

It's easy to hypothesize creating inbound e-mail filtering logic: 'If more than 10 e-mails with the exact same body are sent to X users within X seconds, flag the e-mail as malicious, and notify the SIEM!'.

During a phishing engagement, we want to cover a large surface area, while simultaneously blending in with routine traffic. Large language models provide polymorphic phishing lures.

It's impossible to know exactly what occurs in the environment on the other side of an engagement, but we can use open-source and human (HUMINT) intelligence gathering to improve our odds.

Organizations often rely on a combination of one or more of the following services: AWS, Microsoft Azure, Google Cloud Platform (GCP), Dropbox, or Slack. Crafting your lure guided by the design, phrasing, and timing of legitimate messages from a major provider is often an easy in.

Seasoned developers using frontier labs' LLMs reported experienced instances of the model suddenly 'playing stupid', or 'throttling'. For consistency and reproducibility, open-weight models are preferable in red team workflows.

We will be working with open-weights models. Open-weights models are those for which the trained model parameters (weights) are publicly released and available for download. Although our output is non-deterministic, the underlying weights remain fixed for a given version.

Configuration

At the time of this writing, artificial intelligence models have a plethora of modalities, yet typically classify as either Generative or Agentic. For our purposes, let's head over to HuggingFace to find a Text Generation model.

As you can see, there are over 352,721 models for generating text. Examining the model card will allow you to find various quantizations, which are reduced-precision models for use on devices with less compute power.

Let's download llama.cpp and a quantized 0.6B parameter version of Qwen3, Qwen3-0.6B-Q6_K.gguf (495^mb) saving the file to your local workspace.

Once you've installed llama.cpp and downloaded the GGUF, initiate a CLI session with the following command:
./llama-cli --model Qwen3-0.6B-Q6_K.gguf.

Ah! We've successfully created our lure. Only issue, is we have over 1,000+ users to target, and a limited time window for attack. Let's use a while loop in Python3, to continually prompt the model to generate our lures.

Note: Since Qwen3 is a Reasoning model, we will need to instruct our script to omit the content of <think> tags, while looping over the same prompt to generate unique variations of our lure:

Utilization

from llama_cpp import Llama
import re

# Create a Llama instance, disabling verbosity.
llm = Llama(
    model_path = "./Qwen3-0.6B-Q6_K.gguf",
    verbose = False
)

# Our templates' template. 
prompt = "Write a friendly, convincing e-mail template using descriptive words, about an issue with an account lock-out, and advise the recipient to take action immediately by clicking on a link."

# Create 10 uniquely-worded phishing lures.
for i in range(0, 10):
    response = llm.create_chat_completion(
        messages=[
            # Disable 'thinking' mode.
            {"role": "system", "content": "/no_think"},

            {"role": "user", "content": prompt},
        ],
        # A parameter that controls the randomness, creativitiy, and predictability of generated text. Lower temperatures (0.0 - 0.3) are more conservative and deterministic, while high temperatures (0.7-1.0+) generate more varied, creative, or chaotic output.  
        temperature=0.9,
    )

    content = response['choices'][0]['message']['content'].strip()
    cleaned = re.sub(r"</?think>", "", content)

    print(cleaned, "\n")
    print("-" * 40)

Note: While I'm using a Regular Expression to clean the output here, in a production pipeline you'd want to use stop sequences like ['</think>'] to save on inference costs. Artificial Intelligence requires effective resource management to generate a solid ROI.

At this junction, we can choose to write generated lures to a text file with open(), store them in a MySQL database, or my personal favorite - automatic template creation for GoPhish!