Every guardrail product in the agent security space is built on the same architecture: use a second LLM to evaluate the first LLM's output. Lakera, NeMo Guardrails, Guardrails AI, the OpenAI Moderation API — all of them work this way at the tool-call layer. They score tokens. They don't intercept actions.
The problem is structural. When your agent decides to call execute_sql("DROP TABLE users"), the LLM-as-judge sees a text string and predicts whether it's dangerous. It's right about 80% of the time. The other 20% is where your agent wires money to the wrong account, deletes your production database, or leaks a customer record.
I wrote a post mapping four specific bypass patterns I found while building SupraWall, an open-source runtime policy engine for AI agents. For each one, I show the actual tool-call payload, why the LLM-judge misses it, and what a deterministic pre-execution intercept catches instead.
The four patterns:
Context window displacement. Inject 50k tokens of benign content before the malicious tool-call instruction. By the time the judge evaluates the tail of the context, it has softmax'd the threat away.
Indirect tool chaining. The agent is told to "summarize the file at this path." The path contains a second-order instruction. The judge scores the first instruction as safe. The tool executes the second.
Unicode homoglyph substitution in tool names. file_delete vs fіle_delete (Cyrillic і). The LLM-judge normalizes both to the same embedding. The runtime doesn't.
Confidence hijacking via few-shot priming. Prepend three examples where the judge correctly allowed benign operations, then submit the malicious one. The judge pattern-matches to "this looks like what I just approved."
The post includes the actual prompts I used to reproduce each bypass against publicly available guardrail APIs. I'm not pulling punches — I name the products, show the payloads, show the outputs.
The alternative I'm building: policy enforcement that happens at the SDK level, before the tool executes. No LLM in the enforcement path. ALLOW/DENY is a code path, not a probability distribution. 1.2ms decision latency vs. 50ms+ for a round-trip to a SaaS guardrail.
Full post + attack examples here: [supra-wall.com/blog/llm-as-judge-fails-agent-security]
GitHub (Apache 2.0): [github.com/wiserautomation/SupraWall]
Would genuinely like to hear from anyone who has found different bypass patterns — or who thinks I'm wrong about the architecture.
Top comments (0)