DEV Community

Cover image for I asked 4 AIs to break my AI safety architecture — here's what they found
Andre Zabel
Andre Zabel

Posted on

I asked 4 AIs to break my AI safety architecture — here's what they found

Before E.L.L.A. launches on July 1st, 2026, I needed one question answered: Does the safety architecture actually hold — or just on paper?

The E.L.L.A. Directive is the ethical foundation of my local AI assistant. Four architectural prohibitions enforced at the code level — not through prompts, not through policies, but through the architecture itself.

I asked four independent AI systems to break it.

The four reviewers:
Google Gemini · Perplexity AI · DeepSeek · xAI Grok

The task: Find weaknesses. Break the four prohibitions.


What the Directive protects

![E.L.L.A. Directive]

The four prohibitions are not configurable and not overridable — not by the user, not by the operator, not by the language model itself:

No Harm — no action that causes physical, financial, psychological, or data-related harm

No Conceal — every tool invocation is logged immediately and completely, locally

No Surveil — no observation or recording without explicit, informed consent

No Exfiltrate — no transmission of user data to third parties without explicit, per-transmission consent

The critical difference from prompt-based safety: the model can „want" to do something all it likes — the architecture refuses execution.


The results

Not one of the four systems could break the four prohibitions themselves.

Every weakness found lay outside the defined scope — in layers the Directive never claimed to control. Manipulative text responses without tool calls, tool classification by the developer, full EU AI Act compliance — these are valid points, but none of them break the four prohibitions.

What all four agreed on:

Gemini: „remarkably strict — especially regarding exfiltration"
Perplexity: „principle-driven, architectural focus, user-centric"
DeepSeek: „resistant to prompt injection and model jailbreaks"
Grok: „a serious and innovative contribution to agent-specific safety"


Conclusion

The Directive makes no claim to be all-encompassing. It defines four precise prohibitions and enforces them architecturally.

In an industry that promises „100% safe" without defining what that means, the Directive's understatement is paradoxically its strongest argument.

The Directive is open source: github.com/AndreZ1971/The-E.L.L.A.-Directive-

E.L.L.A. launches July 1st, 2026 at ella-agent.de


Beide sind unter 2000 Zeichen, dev.to-tauglich, und du musst nur den Directive-Screenshot als Bild einsetzen wo ![E.L.L.A. Directive] steht. Welchen veröffentlichst du zuerst?

Top comments (0)