Why front-loaded rules drift in long evaluator and agent loops

#machinelearning #ai #llm #agents

Subtitle: In context is not the same as in control.

By Beamlak Adane

A colleague asked a sharp question that many of us hit once we move from demos to production evaluators and agents:

In multi-turn evaluation or agent loops, models often begin to ignore the initial rules in the system prompt even though those tokens are still inside the context window. What is happening at the token level during decoding that causes early “anchor” tokens to lose influence over time in a streaming context, and how do attention sinks, KV-cache reuse, and prefix caching affect this? Beyond increasing context length, what can engineers do to preserve instruction fidelity and judgment consistency across long sessions?

This post is my answer after digging into the mechanism and mapping it to something concrete I ship: a sales-email evaluator loop.

The core idea: visible is not the same as influential
Keeping the system prompt inside the window is a storage guarantee, not an influence guarantee.

In a decoder-only transformer, each new token is produced by forming a fresh query and comparing it against keys from all prior positions (under the causal mask). A past token affects the next token only through how much attention mass the current query assigns to that token’s key, after softmax normalization and any positional biases.

So the system rules do not “disappear” in the sense of vanishing from the input. They can stop mattering because the active computation at the current step no longer routes enough probability mass through those early positions—or because later tokens have become fresher, more specific carriers of a different trajectory.

Why multi-turn loops drift harder than one long completion
Three forces show up again and again in evaluator and agent traces:

Recency competition. Each turn adds user text, tool output, and prior model generations. Those tokens are often semantically closer to the immediate subtask than a long rubric paragraph at the front. They compete for attention mass on every decode step.

Self-conditioning. Autoregressive models always condition on what they already generated. A slightly off-rubric line becomes part of the context and can steer the next line. Drift compounds because the model is partly “arguing with its own last move.”

Turn boundaries change the query geometry. Instruction-following research suggests attention to system tokens can look relatively stable within a single answer, then shift more abruptly across turns. That matches engineering reality: each loop iteration injects new observations that compete directly with the original rubric.

Attention sinks, KV cache, and prefix caching (what they actually do)
Attention sinks
Attention sinks are a subtle but important correction to naive “early tokens always win” stories. In long streaming settings, some early tokens can receive surprisingly persistent attention mass—even when they are not semantically “the rule.” Softmax attention must allocate probability somewhere; when the model does not need to attend strongly to many past tokens, mass can pool in early positions that function partly as normalization anchors rather than semantic instruction consultation.

Practical implication: seeing high attention on the beginning of the prompt does not automatically mean the model is faithfully executing the rubric text that follows.

KV cache reuse
KV caching is primarily an inference optimization: once a token’s key and value tensors are computed, they can be reused during autoregressive decoding instead of recomputing the full prefix every step.

Behaviorally, cached states are frozen snapshots of earlier positions. They remain available to future queries, but they are not continuously reinterpreted as the conversation evolves. Each new token still has to “re-win” attention against an expanding set of competitors—often including very recent, task-specific states.

Prefix caching
Prefix caching (serving systems that reuse KV blocks for a shared static prefix across requests) is different again. It does not magically fix “the model stopped obeying rules inside one endless chat thread.”

Its biggest win is operational: it makes it cheap to replay a canonical rubric and tool schema on every call, which enables an architecture where you stop relying on one ever-growing mutable conversation as the carrier of policy.

A concrete example from my evaluator work
In my Week 11 bench, part of the scoring contract encodes “don’t overcommit under weak evidence” using simple lexical cues—downgrade language versus hard commitment language. That is a deliberate simplification, but it makes the drift mechanism obvious:

If the loop injects urgency language from tool output, the model may echo that tone.
Once it emits an overconfident phrase, that phrase becomes recent context on the next step.
The original rubric may still be present in the window, but the next-token distribution is increasingly steered by the trajectory, not only by the static policy block.
The point is not that lexical checks are enough for production. The point is that decode-time routing + self-conditioning can break instruction fidelity even when the rules are technically still “in context.”

What to do in production (beyond “use a bigger window”)
These are the interventions I take seriously for evaluators, judges, and agent planners:

Prefer stateless or quasi-stateless calls. Rebuild each step from an immutable rubric plus compact task state plus the current item. Do not assume one long thread keeps policy alive just because it fits.

Separate policy memory from episodic memory. Summarize observations and decisions; do not summarize the controlling rules into the same lossy blob unless you are okay with slow policy erosion.

Re-anchor rules near the decision. If a constraint matters for this verdict, repeat a compressed version of it immediately before the judgment step—not only at the front of a huge trace.

Add an explicit rule-recall pass. Make the model name the applicable rubric items before it commits to a final answer. This is a cheap guardrail against silent drift.

Structure the output. Schemas and checklists make violations easier to detect and correct than free-form prose alone.

Keep hard guarantees outside the model when possible. Deterministic validators, allowlists, and post-checks should enforce what you cannot afford to “mostly” follow.

Measure drift, not just accuracy. Track instruction adherence over turns and variance across seeds. Evaluators fail quietly; your metrics should be loud.

What I am explicitly not claiming
This post focuses on inference-time mechanics and systems patterns. Training-time fixes (instruction hierarchy fine-tuning, verifiable RL-style constraint learning, etc.) matter too, but they are a different lever. Get the loop architecture right first; otherwise you are paying training costs to compensate for a bad control path.

Further reading
Vaswani et al., Attention Is All You Need — attention definition and decoding stack: https://arxiv.org/pdf/1706.03762
Xiao et al., StreamingLLM — attention sinks and long streaming behavior: https://arxiv.org/pdf/2309.17453
Hugging Face Transformers — KV caching / generation: https://huggingface.co/docs/transformers/cache_explanation
vLLM — prefix caching design: https://docs.vllm.ai/en/stable/design/prefix_caching/
Closing
If you take one line away, take this: instruction drift in long loops is often a control-path failure, not a storage failure. Tokens can remain visible while losing authority; sinks can make early attention misleading; caches change cost and architecture, not the fundamental competition between old rules and new trajectory. The strongest fixes replay policy intentionally, separate policy from trajectory, re-anchor at decision time, and enforce hard constraints outside the model’s narrative.

DEV Community

Why front-loaded rules drift in long evaluator and agent loops

Top comments (0)