DEV Community

Cover image for Why your LLM agent drifts off-task by step 4 (and why prompts can't fix it)
Frank Brsrk
Frank Brsrk

Posted on

Why your LLM agent drifts off-task by step 4 (and why prompts can't fix it)

Self-reflection is just another step in the chain.

If you've shipped a multi-step LLM agent to production, you've watched this happen. Step 1 starts on task. Step 2 still looks right. By step 4 the agent is confidently solving a different problem, the original goal is gone, and your prompt engineering didn't stop it.

This isn't a model-size problem. It's an architectural one. And it doesn't get fixed by a smarter prompt.

Why reasoning decays

Multi-step reasoning is sequential conditioning. Step N+1 takes step N as input. Errors compound multiplicatively. A two-percent error per step is eight percent cumulative drift by step four. Sixteen percent by step eight.

The drift goes undetected because each step scores itself against its immediate predecessor, not against the original objective. Meanwhile, the original objective is decaying via attention. Transformer attention is a softmax over context; as the chain grows, every token (including your original instructions) loses relative weight. The system prompt that was a binding contract at step one is noise by step thirty.

So reasoning decay is two failures stacked: errors compounding forward, instructions decaying backward. The middle of the chain is a blind spot in both directions.

Why the current stack doesn't close it

Prompts are tokens in the same context window. They decay with everything else. Fine-tuning moves the model's distribution but doesn't remove softmax attention. RAG injects more tokens, which crowds the attention budget further. Agent loops (ReAct, planner-executor, reflexion) are sequences of LLM calls. Each call is subject to the same decay, compounded by chain length.

The pattern is the same across all of them: each operates inside the same decaying chain that caused the failure. You cannot stabilize a chain with structure that lives inside the chain.

What actually fixes it

The missing layer is structure that gets reinjected at a cadence calibrated to its own empirical decay rate. Not a prompt at position one. A scaffold pulled into context for the relevant step, with three properties:

Reinjection at a measured half-life. In our benchmarks, scaffold persistence half-life is 24 turns. Reinjection at or below that cadence keeps signal above decay threshold.

Suppression edges, not just instructions. Tell the model what NOT to do alongside the procedure that would cause it.

Meta-checkpoints between steps. The scaffold pauses mid-execution, audits whether the named failure patterns are actually being suppressed, and branches to a corrective path if not.

Here's a fragment of one, applied to causal reasoning:

N{accept_any_causal_assertion_backed_only_by_cooccurrence}

S1: identify each causal assertion and isolate the claimed cause to effect link.
S2: demand the mechanistic evidence chain connecting cause to effect.
G1{mechanism provided?} --no--> HALT: claim rejected.

M{Am I genuinely probing for confounds, or performing a soft challenge the claim easily survives because I share its unverified assumptions?}
--working--> S3: check for confounds.
--failing--> ABANDON_GRAPH
to FREEFORM{name one specific confound I avoided and one reverse-causal scenario I refused to construct}
to RE-ENTER at S2.

Suppress: shared_assumptions, unverified_causal_claims.

N{} is the failure mode this scaffold exists to block. S1, S2, G1 are the executable procedure. M{} is the meta-checkpoint: mid-execution, the model audits whether it's actually probing for confounds or just performing the appearance of doing so. If it's faking, it abandons the prescribed path, reflects on the specific confound it avoided, and re-enters at S2.

The receipts

We ran this on LiveCodeBench Hard (the official Hard subset, 28 tasks). Baseline Claude Opus 4.6 with max-effort thinking: 24/28 pass. Same model with the harness wired in as a tool: 28/28. Zero regressions.

Full benchmark set, including the cross-model result on GPT-4o (ELEPHANT sycophancy benchmark, minus 5pp framing sycophancy) and the cross-lab blind eval with four judges from four different model families, is on GitHub under CC BY 4.0: github.com/ejentum/benchmarks

The full four-mechanism taxonomy (reasoning decay is one of four; the others are attention decay, sycophantic collapse, hallucination drift) and the paper are at

ejentum.com

x.com/ejentum
github.com/ejentum/benchmarks
ejentum.com , no card. MCP, n8n node, PyPI package, or HTTP.

Top comments (0)