Your Agentic AI's Safety System Gets Dumber As It Thinks Longer (And how to fix it)

#agents #ai #architecture #security

Agentic AI systems fail in production all the time. The usual fix? A strongly-worded system prompt. That's not safety engineering, that's hoping the model behaves. Here's why prompt-based guardrails are fundamentally broken, and what an actual architectural solution looks like.

The Problem

LLMs generate text by navigating a vector space, finding relevant regions based on input context. But, safety guardrails added via system prompts are also just tokens competing for attention like everything else.

It introduces two failure modes:

Jailbreaking — because all possible outputs exist somewhere in the model's vector space (it's a product of pretraining on human-generated text, including harmful content), prompt-based guardrails can only make certain regions harder to reach, but not impossible. With the right prompt framing you can always nudge the model's internal state toward those regions, which generates these harmful responses. You can't delete a region from the vector space with a prompt.
Context Window Dilution — transformers use attention, which is essentially a key-value lookup weighted by relevance. A guardrail system prompt at position 0 of a long context competes with everything that comes after it. As the context fills, nearby tokens dominate attention and the guardrail's influence weakens — it gets "forgotten" not because the model ignores it but because attention naturally prioritizes recent and contextually relevant tokens. The guardrail was never architecturally special, just another token sequence.

The Solution — Overseer Architecture

Instead of relying on a guardrail living inside the main model's context, use a separate small fine-tuned LLM as an external validator — the Overseer.

How it works:

Overseer is initialized once with just the guardrails, its state is set at that point
It never sees the full growing conversation context
It only ever receives prompt-response pairs from the main model
It's fine-tuned specifically to detect when a response violates the original guardrail intent

The Key Insight is separating the guardrail from the generation context entirely, rather than trying to make it persist inside the same context window where it will always eventually lose to dilution.

Connect on LinkedIn for further Discussion