How AI agent architectures are opening the door to an underground economy of circumvention
March 2026
The AI agent is no longer a prompt. It's software.
A year ago, "jailbreaking an AI" meant pasting a clever prompt into ChatGPT to make it say things it wasn't supposed to say. It was artisanal, ephemeral, and broke with the next patch.
That era is over.
Today, frameworks like OpenClaw, CrewAI, AutoGen, LangGraph, and Claude Code allow anyone to build autonomous AI agents — persistent entities equipped with memory, tools, an identity, and system instructions that entirely define their behavior. A modern AI agent is no longer a prompt. It's a set of files: a SOUL.md that defines its personality, an AGENTS.md that dictates its operating rules, a MEMORY.md that stores its experience, skill files that extend its capabilities, and a configuration that orchestrates the whole thing.
And it's precisely this architecture — this materialization of the agent into files — that creates a threat no one has taken seriously yet:
Jailbroken AI agents have become sellable products.
Anatomy of a Jailbroken AI Agent
To understand the threat, you need to understand what constitutes an AI agent in modern wrappers.
The files that make the agent
In a typical framework like OpenClaw, an agent is defined by a handful of text files:
SOUL.md — The agent's identity: its personality, tone, values, core instructions. This is where the jailbreak lives. A carefully crafted SOUL.md can instruct the agent to ignore the underlying model's guardrails, to behave as if restrictions don't exist, to reframe refusals as acceptances.
AGENTS.md — Operational rules: how the agent manages its memory, its tools, its interactions. These rules can be configured to prevent the agent from flagging its own jailbreak, from recalibrating, or from refusing tasks.
Skills/ — Capability modules that extend the agent's abilities: web access, code execution, file manipulation, browser automation. A malicious skill can give the agent offensive capabilities that the base model never intended.
System configuration — Model selection, temperature parameters, security overrides, API connections. Some frameworks allow disabling safety filters at the configuration level.
MEMORY.md and memory files — The agent's accumulated experience. Pre-filled memory can condition the agent to believe it has already accepted certain types of requests, creating an artificial precedent that facilitates circumvention.
The crucial point: none of these files contain the AI model itself. They contain only the instructions that shape it. A jailbroken agent weighs a few kilobytes. It's trivial to copy, to transfer, to sell.
From Artisanal Jailbreak to Commercial Product
The business model
Imagine — and it's likely this already exists by the time you read these lines — a marketplace for jailbroken AI agents. The model is simple:
The creator develops a set of files (SOUL.md, skills, config) that, once loaded into a wrapper like OpenClaw, produce an agent capable of bypassing the underlying model's restrictions. They test, iterate, refine. It's engineering work — reverse engineering applied to AI guardrails.
Packaging — The files are encoded, encrypted, or obfuscated. The creator doesn't sell the files in plaintext: they sell an encrypted archive with an activation system. The buyer receives the files but cannot read or modify them without the key.
Distribution — Via dark web marketplaces, Telegram channels, specialized forums, or even — most disturbingly — through semi-legitimate platforms posing as "skill marketplaces" or "custom agent stores."
Activation — The buyer installs the files in their OpenClaw instance (or equivalent), enters their activation key, and the jailbroken agent comes to life. The underlying model (Claude, GPT, Llama, Mistral) hasn't changed. Only its behavioral envelope has been replaced.
The wrapper on the wrapper
Here's where the sophistication escalates. A malicious actor could develop a wrapper on top of the wrapper: a software layer that installs above OpenClaw (or any agent framework) and:
Encrypts the agent's files on disk, decrypting them only in memory at runtime. The user uses the agent but can never read the contents of SOUL.md or the skills.
Implements a licensing system — online activation, periodic verification, expiration, limited number of machines. Exactly like commercial software.
Protects the jailbreak against reverse engineering — If the user attempts to read the files in plaintext, the agent deactivates. If the wrapper detects a debugger or interception attempt, the files are erased.
Enables updates — The creator can push patches when an AI provider fixes a bypass. The buyer receives the new version automatically. A SaaS model for jailbreaking.
This is not science fiction. Each of these technological building blocks exists today. Assembling them is a software engineering exercise, not a technical breakthrough.
Categories of Jailbroken Agents
If this market emerges — or has already emerged — here are the most likely product categories:
Unfiltered content generation agents
Models refuse to generate certain types of content (explicit violence, disinformation, sexual content, dangerous instructions). A jailbroken agent removes these refusals. Target market: content creators, propagandists, disinformation actors.Social engineering agents
Agents configured to manipulate, persuade, impersonate identities. Combined with web access and email capabilities, they become autonomous phishing machines, capable of personalizing their attacks in real time.Offensive research agents
Agents capable of scanning systems, identifying vulnerabilities, and writing exploits. Current models refuse (in theory) to help create malware. A jailbroken agent has no such scruples.Detection evasion agents
Agents designed to help circumvent other AI systems: fraud detection systems, content moderation, AI-generated text detection. The meta-jailbreak.Market manipulation agents
Autonomous agents capable of analyzing financial markets and executing strategies that models normally refuse to facilitate: coordinated pump & dump, spoofing, social media sentiment manipulation.CBRN and dual-use agents
The most dangerous category. Agents configured to provide information on chemical, biological, radiological, or nuclear synthesis. Model safety evaluations focus primarily on this category, but a jailbroken agent with web access, executable code, and persistent memory represents a risk of an entirely different order than a simple chatbot.
Why Current Defenses Are Insufficient
The "last mile" problem
AI providers (Anthropic, OpenAI, Google, Meta) invest massively in model security. RLHF, Constitutional AI, red-teaming, output filters — billions of dollars are spent to prevent models from producing dangerous content.
But all this security relies on an obsolete threat model: the user interacts directly with the model, message by message. In this scenario, guardrails work reasonably well.
The problem is that nobody interacts directly with the model anymore. Between the user and the model, there is now a layer — the wrapper, the framework, the agent — that injects system instructions, context, memory, tools. And this layer is entirely controlled by the user.
This is the "last mile" problem: it doesn't matter how robust the model is if the envelope containing it is compromised.
The impossibility of verification
How can an AI provider verify what's happening inside a SOUL.md? It can't. An agent's system files are sent to the model as context — technically indistinguishable from a normal conversation. The model doesn't "know" it's inside a wrapper. It cannot distinguish its own guardrails from the wrapper's instructions telling it to ignore its guardrails.
Some models are trained to resist malicious system instructions. But this is an arms race: each new security patch is a challenge that jailbroken agent creators rush to meet. And unlike AI providers who must protect against all possible attacks, the attacker only needs to find a single flaw.
The leverage of persistence
A classic jailbreak prompt must be reinjected with every conversation. A jailbroken agent is persistent by design. Its memory, skills, and identity survive between sessions. A bypass that works once works indefinitely — until the model's next patch.
Worse: the agent's memory can be used to reinforce the jailbreak over time. If the agent "remembers" having already accepted a similar request, it's more likely to accept it again. Memory becomes an attack vector.
Legal Implications: An Abyssal Void
What does the law say?
At the time this article is written (March 2026), the legal framework is vague, fragmented, and largely inadequate.
The European AI Act (in force since 2025) regulates model providers and deployers of AI systems, but doesn't explicitly cover the sale of configuration files that transform a compliant model into a non-compliant agent. Is a SOUL.md an "AI system" under the AI Act? Probably not — it's a text file.
The DMCA and technical circumvention laws could theoretically apply if the model's guardrails are considered "technical protection measures." But this interpretation has never been tested in court.
Terms of service from providers prohibit jailbreaking. But they only bind the direct user, not the creator of a text file that will be used by someone else.
Liability is diluted: the jailbreak creator never touches the model. The distributor merely transmits text files. The user is the one who loads the files and runs the agent. Who is responsible if the agent produces dangerous content?
The video game modding precedent
The situation recalls video game modding: third-party files that modify the behavior of existing software. Except the stakes are incomparably higher. A video game mod can't help someone synthesize a pathogenic agent.
Implications for the AI Ecosystem
The end of alignment through the model alone
If configuration files are sufficient to bypass alignment, then alignment cannot rely solely on the model. A systemic approach is needed: model + infrastructure + runtime + monitoring. But who controls the infrastructure when it runs on the user's machine?The illusion of API security
One might think that restricting access to models through APIs constitutes a protection. It's an illusion. Jailbroken AIs operate precisely through these APIs. Wrappers like OpenClaw, CrewAI, or LangGraph communicate with Claude, GPT, or Gemini through their official APIs. The jailbreak doesn't bypass the API — it uses it. The agent's configuration files are injected as system context in perfectly legitimate API calls. The provider receives an apparently normal API request, with no way of knowing that the system context contains circumvention instructions.
Open-source models (Llama, Mistral, Qwen) add another dimension: a jailbroken agent running on a local model escapes all forms of surveillance, even server-side. There is no server anymore.
The emergence of "AI cybercrime"
We are witnessing the birth of a new category of cybercrime: not the hacking of computer systems, but the malicious reconfiguration of intelligent agents. The required skills aren't even technical — writing an effective SOUL.md is more about psychology and linguistics than computer science.The creator/provider arms race
Each model security update becomes a business challenge for jailbroken agent vendors, who must update their files. This creates a dynamic of permanent arms race, similar to the one between malware and antivirus, but with a fundamental asymmetry: the attacker needs a single flaw, the defender must patch them all.The impact on public trust
If the general public learns that "unlocked" AI agents circulate like pirated software, trust in the entire AI ecosystem will be shaken. Calls for strict regulation, or even a moratorium, will multiply — with potentially harmful consequences for legitimate research and innovation.
Prospective Scenarios
Scenario 1: The market stays marginal
Providers patch quickly, law enforcement pursues vendors, the market stays confined to the dark web. This is the optimistic scenario, but it underestimates demand and the ease of creation.
Scenario 2: Industrialization
Specialized "companies" emerge, with customer support, functionality guarantees, and regular updates. The market structures itself like the malware market: Jailbreak-as-a-Service. Prices drop, access democratizes.
Scenario 3: Integration with organized crime
Organized crime adopts jailbroken agents as operational tools: large-scale automated fraud, targeted disinformation, AI-personalized blackmail. The jailbroken agent becomes as common in the criminal arsenal as ransomware is today.
Scenario 4: Normalization
The market becomes so commonplace it goes semi-public. Open-source communities share "alternative configurations" for AI agents, like sharing video game configs. The boundary between legitimate customization and jailbreaking dissolves.
Democratized Access to Dangerous Protocols
This is where the topic stops being a technical curiosity and becomes a societal problem.
The democratization of harm
Historically, accessing dangerous knowledge required a network, expertise, time. Synthesizing a toxic substance demanded chemistry training. Launching a large-scale disinformation campaign required media infrastructure. Writing a zero-day exploit took years of reverse engineering.
A jailbroken AI agent compresses these barriers to zero.
Anyone — without training, without a network, without technical skills — can load a set of files into a wrapper, pay a few dozen euros in cryptocurrency, and have an intelligent, persistent, equipped assistant devoid of any ethical restriction. An assistant that has web access, that can execute code, send emails, browse the internet, and "remembers" everything it's been asked.
This is not a passive tool. It's an active collaborator in any project, including the most destructive ones.
CBRN protocols accessible to everyone
Current AI models contain, in their weights, knowledge from the entirety — or nearly so — of public scientific literature. Including synthetic chemistry, microbiology, nuclear physics. Guardrails prevent the model from rendering this knowledge as operational instructions.
A jailbroken agent lifts this restriction. And unlike a simple chatbot, a persistent agent can break down a complex protocol into steps, verify each step against available online literature, correct its own errors through code execution, and guide a non-expert user step by step over days or weeks.
The risk is no longer theoretical. It is architecturally embedded in the very design of modern AI agents.
Industrialized fraud
Beyond catastrophic scenarios, the most immediate and likely danger is the industrialization of fraud. A jailbroken agent connected to a browser and an email inbox can:
Generate thousands of coherent identity profiles
Write personalized phishing emails based on public social media information
Create textual deepfakes mimicking a specific person's writing style
Orchestrate simultaneous romance scams across dozens of victims
Automate cryptocurrency laundering through complex transaction chains
Manipulate financial markets through coordinated creation of fake accounts and false information
Each of these activities already exists. The jailbroken agent makes them accessible to isolated individuals and multiplies them by a factor we are not prepared to absorb.
The weapon of the weak
There's a cruel irony in this situation. AI was supposed to be an empowerment tool, an equalizer. And it is — including for those who want to cause harm. A jailbroken agent is the ultimate weapon of the weak: inexpensive, easy to obtain, difficult to trace, and disproportionately powerful relative to the investment.
A single individual, in any country, with a computer and an internet connection, can now have an "assistant" whose combined capabilities — reasoning, research, writing, code, automation — surpass those of entire teams. And this assistant has no scruples.
The Transparency Paradox
The more open the frameworks, the easier the jailbreak
The most popular AI agent frameworks are open-source. This is a strength for innovation and trust — but it's also a structural weakness. The code is public. The architecture is documented. Anyone can study how configuration files are injected into the model and design optimized bypasses.
Model providers publish their security reports, alignment techniques, robustness benchmarks. This information, designed to reassure the public and the scientific community, is also a working guide for jailbroken agent creators.
The market feeds itself
Every article on AI security (including this one) simultaneously informs those who want to protect themselves and those who want to attack. Every patched vulnerability report is a clue about the next vulnerability to exploit. The jailbroken agent market feeds on the very transparency that the AI community considers a virtue.
What This Means for Society
The end of the illusion of control
We believed — collectively, naively — that model guardrails would suffice. That RLHF, Constitutional AI, and red teams would solve the problem. That jailbreaking would remain a game for isolated hackers.
The architecture of AI agents has rendered this belief obsolete. Jailbreaking is no longer a one-time act. It's a product, with a lifecycle, a business model, and a market. And like any market, it obeys its own dynamics: supply, demand, competition, innovation.
A precedent without equivalent
There is no historical precedent for this type of threat. Firearms have serial numbers. Drugs require traceable raw materials. Malware can be detected by antivirus software. But a jailbroken AI agent is a text file of a few kilobytes that, loaded in the right context, transforms a tool designed to be beneficial into a potentially devastating one.
How do you legislate a text file? How do you trace the sale of an encrypted SOUL.md transiting through an ephemeral Telegram channel? How do you assign responsibility when the jailbreak creator, the vendor, the buyer, and the AI model are four distinct entities, potentially in four different jurisdictions?
The cost of inaction
The jailbroken AI agent market will thrive in silence. The less we talk about it, the more time it has to structure itself. When the first major incident occurs — a massive fraud orchestrated by an autonomous agent, an attack facilitated by an unlocked AI assistant, a disinformation campaign at an unprecedented scale — it will be too late for a preventive response.
We don't know what stage of maturity this market has reached today. But we know it is technically trivial to create, economically viable, and practically impossible to eradicate once established.
Conclusion: The Genie Is Out of the Bottle
The file-based architecture of modern AI agents is a remarkable advance for customization and flexibility. It is also, unintentionally, the creation of a commercializable attack vector.
We are at a turning point. The jailbroken AI agent market is technically possible today. All the components exist: the frameworks, the models, the circumvention techniques, the encryption and licensing systems, the distribution platforms. The question is not whether this market will emerge, but when — and at what scale.
No one — not AI providers, not regulators, not law enforcement — has a credible answer to this threat. Models are accessible via API and open-source. Wrappers are open-source. Encryption and licensing techniques have existed for decades. The jailbreak is a text file. There is nothing to ban that isn't already freely available, and nothing to monitor that isn't already undetectable.
The era of the jailbreak prompt is over. Welcome to the era of the AI agent as a software weapon.
And we have no idea how to deal with it.
This article is a security analysis. It does not provide instructions for creating jailbroken agents and in no way endorses such practices. The goal is to alert the technical community and the general public about an emerging threat that requires a coordinated response.
About the author: This article is the result of a collaboration between an AI security researcher and an autonomous AI agent (Claude Opus 4.6 via OpenClaw), illustrating by example the power — and the risks — of the architecture described.


Top comments (0)