Frank Brsrk

Posted on May 14

Why your LLM agent drifts off-task by step 4 (and why prompts can't fix it)

#ai #llm #reasoning #agents

Self-reflection is just another step in the chain.

If you've shipped a multi-step LLM agent to production, you've watched this happen. Step 1 starts on task. Step 2 still looks right. By step 4 the agent is confidently solving a different problem, the original goal is gone, and your prompt engineering didn't stop it.

This isn't a model-size problem. It's an architectural one. And it doesn't get fixed by a smarter prompt.

Why reasoning decays

Multi-step reasoning is sequential conditioning. Step N+1 takes step N as input. Errors compound multiplicatively. A two-percent error per step is eight percent cumulative drift by step four. Sixteen percent by step eight.

The drift goes undetected because each step scores itself against its immediate predecessor, not against the original objective. Meanwhile, the original objective is decaying via attention. Transformer attention is a softmax over context; as the chain grows, every token (including your original instructions) loses relative weight. The system prompt that was a binding contract at step one is noise by step thirty.

So reasoning decay is two failures stacked: errors compounding forward, instructions decaying backward. The middle of the chain is a blind spot in both directions.

Why the current stack doesn't close it

Prompts are tokens in the same context window. They decay with everything else. Fine-tuning moves the model's distribution but doesn't remove softmax attention. RAG injects more tokens, which crowds the attention budget further. Agent loops (ReAct, planner-executor, reflexion) are sequences of LLM calls. Each call is subject to the same decay, compounded by chain length.

The pattern is the same across all of them: each operates inside the same decaying chain that caused the failure. You cannot stabilize a chain with structure that lives inside the chain.

What actually fixes it

The missing layer is structure that gets reinjected at a cadence calibrated to its own empirical decay rate. Not a prompt at position one. A scaffold pulled into context for the relevant step, with three properties:

Reinjection at a measured half-life. In our benchmarks, scaffold persistence half-life is 24 turns. Reinjection at or below that cadence keeps signal above decay threshold.

Suppression edges, not just instructions. Tell the model what NOT to do alongside the procedure that would cause it.

Meta-checkpoints between steps. The scaffold pauses mid-execution, audits whether the named failure patterns are actually being suppressed, and branches to a corrective path if not.

Here's a fragment of one, applied to causal reasoning:

N{accept_any_causal_assertion_backed_only_by_cooccurrence}

S1: identify each causal assertion and isolate the claimed cause to effect link.
S2: demand the mechanistic evidence chain connecting cause to effect.
G1{mechanism provided?} --no--> HALT: claim rejected.

M{Am I genuinely probing for confounds, or performing a soft challenge the claim easily survives because I share its unverified assumptions?}
--working--> S3: check for confounds.
--failing--> ABANDON_GRAPH
to FREEFORM{name one specific confound I avoided and one reverse-causal scenario I refused to construct}
to RE-ENTER at S2.

Suppress: shared_assumptions, unverified_causal_claims.

N{} is the failure mode this scaffold exists to block. S1, S2, G1 are the executable procedure. M{} is the meta-checkpoint: mid-execution, the model audits whether it's actually probing for confounds or just performing the appearance of doing so. If it's faking, it abandons the prescribed path, reflects on the specific confound it avoided, and re-enters at S2.

The receipts

We ran this on LiveCodeBench Hard (the official Hard subset, 28 tasks). Baseline Claude Opus 4.6 with max-effort thinking: 24/28 pass. Same model with the harness wired in as a tool: 28/28. Zero regressions.

Full benchmark set, including the cross-model result on GPT-4o (ELEPHANT sycophancy benchmark, minus 5pp framing sycophancy) and the cross-lab blind eval with four judges from four different model families, is on GitHub under CC BY 4.0: github.com/ejentum/benchmarks

The full four-mechanism taxonomy (reasoning decay is one of four; the others are attention decay, sycophantic collapse, hallucination drift) and the paper are at

ejentum.com

x.com/ejentum
github.com/ejentum/benchmarks
ejentum.com , no card. MCP, n8n node, PyPI package, or HTTP.

Top comments (2)

Max Quimby • May 14

The "attention decay" framing matches what we've seen running long-horizon agent loops in production. Past about step 5-6, the agent stops referring back to the original task spec and starts optimizing for whatever the last tool result implied. Prompting harder genuinely does not fix it — we'd just be paying for more tokens that the model is already discounting.

One thing we landed on that's adjacent to your meta-checkpoint idea: a separate "supervisor" pass that doesn't see the full trajectory, only the original goal and the latest state, and decides whether to continue or replan. It's expensive but it stops the slow drift you describe because the supervisor never had a chance to anchor on the wrong sub-goal.

Curious about the cadence calibration piece — are you measuring decay per-task-class, or did one global rate end up being good enough? My guess is tool-heavy chains decay faster than pure reasoning chains, but I haven't seen anyone publish numbers on it.

Frank Brsrk • May 18

i am glad u noticed system parts of the framework such as the meta checkpoints. well I ve been running tasks very long such as 100 steps in arc agi ls21 interaction test and was tracking reasoning trace for both raw and augmented agent. yes and reasoning decay was tracked in different ways and tracked on many many scores to find the most adapted per test. really struggled on learning how to run benchmarks and evaluations. but the evidence is true, dynamic scaffolding is really effective, and surpasses static system instructions that stay frozen from deployment time and on. happy to talk more about architectures and data design if u d like
github.com/ejentum/benchmarks some proof of my work,

cheeeeers!!!!!!