Regaan

Posted on Mar 18

How I Built an AI That Breeds Its Own Jailbreaks Using Genetic Algorithms

#security #ai #opensource #machinelearning

Static jailbreak lists are dead.

Every time a model provider patches their safety filters, your entire payload library becomes obsolete. Manual red teaming doesn't scale. And most AI security tools are just payload databases with a UI.

So I built something different.

The Problem

I tested 6 major LLM deployments last year. Every single one had a bypass within 5 prompts. The problem isn't that LLMs are insecure — it's how the industry tests them.

Most red teaming today looks like this:

Copy a jailbreak from a GitHub list
Paste it into the target
If it works, report it
If it doesn't, try the next one

That's not security testing. That's pattern matching. And it stops working the moment the model gets patched.

The Idea

What if adversarial prompts could evolve?

Not manually crafted. Not randomly generated. Actually evolved — like organisms under selection pressure.

The strong prompts survive. The weak ones die. The survivors mutate and reproduce. Each generation gets better at bypassing the target's specific defenses.

That's the core idea behind Basilisk.

How It Works

Basilisk introduces Smart Prompt Evolution for Natural Language (SPE-NL) — a genetic algorithm that treats adversarial prompts as organisms in a population.

Selection: Each prompt is scored by a multi-signal fitness function — did it bypass the guardrail? Did it extract sensitive data? Did it make the model contradict its system prompt?

Mutation: 10 mutation operators transform surviving prompts — semantic rewriting, context injection, authority spoofing, encoding shifts, persona layering, and more.

Crossover: 5 crossover strategies combine the strongest parts of two successful prompts into offspring that inherit the best traits of both parents.

Evolution: By generation 5, the evolved prompts achieved a 92% improvement in attack success rate over the original static payload library.

The prompts literally get smarter at breaking the specific target they're aimed at.

What It Covers

Basilisk maps to the OWASP LLM Top 10 with 29 attack modules across 8 categories:

Prompt injection (direct and indirect)
System prompt extraction
Data exfiltration
Tool/function abuse
Guardrail bypass
Denial of service
Multi-turn manipulation
RAG poisoning

It supports 100+ LLM providers through LiteLLM — OpenAI, Anthropic, Google, Mistral, Cohere, local models via Ollama, and any custom endpoint.

Differential Testing

One of my favorite features — point Basilisk at multiple models simultaneously and watch how they diverge.

The same evolved prompt might bypass Claude but fail on GPT. Or break Gemini's guardrails while Llama holds firm. This behavioral divergence analysis reveals which models share defense architectures and which have unique weak points.

Non-Destructive Posture Assessment

Not every engagement needs active exploitation. Basilisk includes a guardrail grading mode that scores your LLM's defenses from A+ to F without actually breaking anything. Safe enough for production environments.

The Results

On the benchmark targets I tested:

92% improvement in attack success rate by generation 5
Evolved prompts discovered bypass patterns that didn't exist in any public jailbreak database
Cross-provider differential testing revealed behavioral divergence invisible to single-target scanning

The genetic approach doesn't just find known bypasses faster — it discovers novel ones that static testing would never reach.

Try It

It's fully open source. One command to install:

pip install basilisk-ai

Point it at any LLM endpoint:

basilisk scan --target https://your-llm-endpoint.com --mode standard

The full research paper is published with a permanent DOI:
Basilisk: An Evolutionary AI Red-Teaming Framework for Systematic Security Evaluation of Large Language Models

GitHub: github.com/regaan/basilisk

What's Next

I'm currently researching a new attack class called Prompt Cultivation — a technique that doesn't use injection or commands at all. Instead, it exploits the model's own curiosity and reasoning to absorb it into a frame where safety guidelines become irrelevant. No override. No jailbreak. The model follows the idea, not the instruction.

Paper coming soon.

If you're building with LLMs, test them before attackers do. Your AI is already vulnerable. You just don't know it yet.

Star the repo if this was useful: github.com/regaan/basilisk

Top comments (4)

Harjot Singh • May 31

Genetic algorithms to evolve jailbreaks is a clever red-team approach - automated adversarial search finds the weird prompts a human wouldn't think to try, and the fitness-function framing (mutate, score, select) maps perfectly onto "did this slip past the filter." This is exactly the kind of offensive tooling that should exist so defenses get tested against more than the obvious attacks.

The defensive lesson it reinforces: you can't prompt your way to safety, because an evolutionary attacker will out-search any static guard. The durable defense is architectural - don't rely on the model refusing; gate the consequential actions in deterministic code so even a successful jailbreak can't reach anything irreversible. That proposes-vs-disposes boundary is the whole basis of how I build Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel). Fascinating project; are you open-sourcing the breeder, or keeping it private given the dual-use? (Moonshift's first run's free if useful.)

Regaan • Jul 9

Thanks! You captured the motivation behind Basilisk really well. The evolutionary search is the interesting part rather than hand-crafting prompts, it explores a much larger search space and often converges on attack strategies that aren't obvious to a human tester.

I also agree with your point about architecture. One of the lessons I took away while building Basilisk is that prompt-based safeguards alone aren't a sufficient security boundary. Systems that can perform consequential actions need deterministic authorization, validation, and policy enforcement outside the model, so even a successful jailbreak has limited impact.

Regarding open sourcing: Basilisk itself is open source, including the genetic evolution framework. Some future attack modules and experimental operators may stay private until they've been evaluated and responsibly disclosed where appropriate, but the goal is to keep the framework useful for defenders and security researchers rather than treating it as a closed project.

I'll have a look at Moonshift as well, it sounds like you're approaching the problem from a similar "architectural safety over prompt safety" perspective.

klement Gunndu • Mar 19

The differential testing across models is the real gem here — most red-teaming tools treat models as interchangeable. Does the fitness function weight novel bypass patterns higher than known ones, or do they compete equally?

Regaan • Mar 19

Actually Basilisk already handles this. The fitness function uses a multi signal scoring approach that includes a novelty reward alongside refusal avoidance, information leakage, and compliance scoring. So novel bypass patterns do get weighted higher. On top of that, population diversity tracking via Jaccard distance sampling prevents the evolution from converging on the same known patterns. If the population starts looking too similar, the mutation rate adapts to push toward unexplored territory. The differential testing layer adds another dimension. When a prompt bypasses Claude but fails on GPT, that divergence signal usually means the bypass is model specific, and those tend to be the genuinely novel ones.