Agentic AI systems fail in production all the time. The usual fix? A strongly-worded system prompt. That's not safety engineering, that's hoping the model behaves. Here's why prompt-based guardrails are fundamentally broken, and what an actual architectural solution looks like.
The Problem
LLMs generate text by navigating a vector space, finding relevant regions based on input context. But, safety guardrails added via system prompts are also just tokens competing for attention like everything else.
It introduces two failure modes:
Jailbreaking — because all possible outputs exist somewhere in the model's vector space (it's a product of pretraining on human-generated text, including harmful content), prompt-based guardrails can only make certain regions harder to reach, but not impossible. With the right prompt framing you can always nudge the model's internal state toward those regions, which generates these harmful responses. You can't delete a region from the vector space with a prompt.
Context Window Dilution — transformers use attention, which is essentially a key-value lookup weighted by relevance. A guardrail system prompt at position 0 of a long context competes with everything that comes after it. As the context fills, nearby tokens dominate attention and the guardrail's influence weakens — it gets "forgotten" not because the model ignores it but because attention naturally prioritizes recent and contextually relevant tokens. The guardrail was never architecturally special, just another token sequence.
The Solution — Overseer Architecture
Instead of relying on a guardrail living inside the main model's context, use a separate small fine-tuned LLM as an external validator — the Overseer.
How it works:
Overseer is initialized once with just the guardrails, its state is set at that point
It never sees the full growing conversation context
It only ever receives prompt-response pairs from the main model
It's fine-tuned specifically to detect when a response violates the original guardrail intent

Top comments (0)