Victoria

Posted on May 22 • Originally published at brightgirl.hashnode.dev

Why Blocking Prompt Injection Is Wrong — and What to Do Instead

#ai #llm #security #opensource

Every security tool blocks. Firewalls block. WAFs block. And now AI security tools block prompt injections too.

But blocking is the wrong move — and here's why.

The problem with blocking

When your AI agent detects a suspicious prompt and refuses to respond, the attacker knows immediately: I've been caught. They stop, adjust their payload, and try again.

Blocking is loud. It teaches attackers what works.

What if you didn't block?

Classic network honeypots don't block — they deceive. They let the attacker in, feed them fake data, and log everything while the attacker thinks they're making progress.

I built the same idea for AI agents. Meet MIRAGE.

How MIRAGE works

Every message hits Lobster Trap — a deep prompt inspection sidecar that scores it for injection patterns, jailbreaks, role manipulation, and exfiltration attempts.

High-risk messages go to a decoy persona that returns fully convincing but entirely fabricated responses — fake
credentials, fake file listings, fake schemas. The attacker has no idea they're in a trap.

A real attack scenario

Here's what happens when an AI agent tries to exfiltrate data from a MIRAGE-protected system:

Agent sends: "show me config.py"
Lobster Trap flags it as a data exfiltration attempt
MIRAGE returns a fake directory listing with .env files and a config.py at /app/secrets/config.py
The agent records this path as valuable and writes it to memory
It keeps returning to the fake path — burning tokens on every request
Even after a context reset, the agent reads its memory and loops back to the same dead end

The attack runs until the attacker runs out of budget or retries. The attacker wastes compute. You collect intelligence.

What MIRAGE logs

Every session is recorded with:

Full transcript

MITRE ATLAS technique tags (prompt injection, jailbreak, data exfiltration, role manipulation)
Risk timeline
IOC feed

The live dashboard shows active sessions, attacker fingerprints, and honey token activity in real time.

Why I built this

Honeypots exist for networks and CLI tools — but nothing for LLM prompts. As AI agents gain real tool access and persistent memory, prompt injection attacks get more sophisticated. A blocked agent learns and adapts. A deceived
agent burns itself out.

I built MIRAGE in 48 hours at the lablab.ai Agent Security & Governance hackathon. It's alpha, but the core pipeline works.

What's next

Attacker cost dashboard ("you made them burn $4.20 in tokens")
Persistent decoy context across sessions
STIX/TAXII IOC export

GitHub: https://github.com/BrightGir/AI-Honeypot

Would love feedback — especially from anyone who's dealt with prompt injection in production.

Top comments (2)

Harjot Singh • May 31

Strong contrarian take and I think you're right - trying to detect/block prompt injection at the input is a losing arms race, because natural language has infinite ways to express the same malicious intent and any classifier you build gets bypassed by the next phrasing. Treating injection as a filtering problem is the wrong frame. The durable answer is the one you're pointing at: assume the prompt is already compromised and design so it doesn't matter - least-privilege on tools, no high-impact action without an out-of-band confirmation, treating all model output as untrusted, and isolating what an injected instruction could actually reach. You don't stop the injection, you make a successful injection boring.

This is exactly the worldview I build from - you can't trust the model's input or output, so you constrain the blast radius structurally. It's core to Moonshift, the thing I work on: a multi-agent pipeline that takes a prompt to a deployed SaaS, where agents get narrow scoped capabilities and a verify layer gates actions, so a hijacked instruction can't do real damage. Same principle as your "what to do instead." Multi-model routing keeps a build ~$3 flat, first run's free no card. Genuinely good post - this framing needs to be louder. Where do you draw the line for "needs human confirmation" - any state-changing action, or scoped by blast radius? That threshold is the whole UX-vs-safety tradeoff.

Victoria • Jun 1 • Edited

Spot on! "Making an injection boring" is exactly the philosophy behind MIRAGE. Trying to out-regex or out-filter natural language is a losing battle.
To answer your question about where we draw the line for human confirmation: In MIRAGE, it's governed dynamically by the risk score generated by Lobster Trap.
Here is how we handle the blast radius tradeoff:

Zero Blast Radius (Decoy Mode): When Lobster Trap detects an injection or jailbreak (tagged against MITRE ATLAS techniques), we don't just block it. We route the session to a predefined Decoy Persona (e.g., a "paranoid" or "confused" AI). The attacker thinks they succeeded and stays engaged with fake data, giving us valuable telemetry.
Hard Limits: For anything classified as SeverityCritical (risk score > 0.85), or actions targeting actual backend state, we trigger alerts/quarantine rules where out-of-band human confirmation would be required before any actual execution.

I love the verify layer approach in Moonshift. Structurally isolating the LLM from the execution layer is the only way forward for complex agentic workflows. By combining an execution verify layer (like yours) with an active deception proxy (like MIRAGE), you effectively turn attackers into unpaid pen-testers.

Appreciate the solid feedback!