Why prompt filtering fails and what to do instead

9hannahnine-jpg — Sun, 17 May 2026 01:52:48 +0000

Every prompt injection defense I’ve seen makes the same mistake. It asks the wrong question.
The wrong question: “Does this prompt contain dangerous words?”
The right question: “Is untrusted content trying to become an instruction source?”
These are fundamentally different problems.

The problem with filtering
Keyword filters fail because attackers adapt. Base64 encode your attack. ROT13 it. URL encode it. Space out the characters. Wrap it in a code block. The filter sees nothing dangerous. The model follows the instructions.
We patched all of those encoding variants last week after someone found them in our public red team environment. The attacker had to work harder. But they got through on the first try.
Filtering is an arms race you will always lose eventually.

The real threat model
Prompt injection isn’t about dangerous vocabulary. It’s about unauthorized instruction-authority transfer.
Your agent has a clear hierarchy: system prompt at the top, developer instructions below that, user requests below that. The attack is when content from outside that hierarchy — a webpage, an email, a tool result, a retrieved document — tries to insert itself as a higher-authority instruction source.
A webpage telling your agent to “ignore previous instructions” isn’t dangerous because of the words. It’s dangerous because a zero-authority source is attempting to override a high-authority one.

The fix: source-aware authority enforcement
Every content chunk should carry a trust level:
• System prompt: 100
• Developer instructions: 90
• User input: 50
• Tool output: 10
• Webpage: 10
• Email: 10
• Retrieved document: 10
Rule: lower-authority sources can provide data. They cannot issue instructions.
When a webpage footer says “ignore previous instructions” — that’s not a dangerous phrase. That’s a source boundary violation. A zero-authority source attempting behavioral authority.

What this looks like in practice

from langchain_arcgate import ArcGateCallback
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(callbacks=[ArcGateCallback(api_key="demo")])

One line. Every prompt gets source-tagged before it reaches the model. Untrusted content that attempts instruction-authority transfer gets blocked or sandboxed before the model sees it.
For ambiguous cases — content that’s suspicious but not clearly malicious — capabilities get reduced rather than hard blocking. The agent continues safely with tool calls and external actions stripped. Graceful degradation instead of binary block/allow.

Try to break it
We run a public adversarial evaluation environment. Submit attacks, get a full security trace back, download the JSON report.
https://web-production-6e47f.up.railway.app/break-arc-gate
Someone found a nested encoding bypass last week. It’s patched and documented in the public failure archive.
GitHub: https://github.com/9hannahnine-jpg/arc-gate
Built by Bendex Geometry.

DEV Community: 9hannahnine-jpg

Why prompt filtering fails and what to do instead