Frank Brsrk

Posted on Apr 25

Why LLM Agents Fail: Four Mechanisms of Cognitive Decay and the Reasoning Harness Layer

#agents #llm #ai #architecture

LLM agents fail in four predictable, mechanism-level ways. Attention decay, reasoning decay, sycophantic collapse, hallucination drift. The current stack (prompting, fine-tuning, RAG, agent loops) cannot close them because each layer operates inside the same decaying chain. The fix is an external layer we call a reasoning harness.

If you have built an agent that runs more than ten steps, you have watched it drift. Plans fragment. The system prompt you wrote at the top of the context stops binding by turn thirty. The model agrees with whatever you push back on. A confident answer papers over a retrieval call that returned an ambiguous result.

These failures are not random, and they are not artifacts of model size. They are not going to be fixed by the next checkpoint. They are predictable consequences of how transformers compute and how post-training shapes them. Four distinct mechanisms, each with a specific architectural cause. This essay names them, explains why the current stack cannot close them, and proposes the missing layer we have been calling a reasoning harness.

The structure of the argument:

LLM failure under load is not a single problem. It is four distinct mechanisms.
The current toolchain (prompt engineering, fine-tuning, retrieval augmentation, agent loops) cannot close these failures because each of those layers operates inside the same decaying chain that caused the failure.
What is missing is an external layer that runs orthogonal to the chain. Persistent, reinjected structure with measurable half-life and explicit suppression edges.
The only honest way to evaluate it is to publish the instrument and let practitioners run it on their own prompts. No curated wins. No leaderboard theater.

1. Four mechanisms, named

Most discussions of LLM failure stay at the level of symptoms. "The agent hallucinated." "The model lost track." "It told me what I wanted to hear." Symptoms do not explain, and they do not point at fixes. What follows is a mechanism-level taxonomy. Each entry names the failure, traces it to an architectural cause, and identifies the context where it hurts most.

1.1 Attention Decay

Symptom. The model ignores instructions given early in the context. System prompts stop binding. Key facts buried mid-context get missed during retrieval. Users describe this as "the model forgot what I told it."

Mechanism. This is the lost-in-the-middle effect, documented by Liu et al. (2023) and reproduced across frontier model families since. Multiple architectural factors contribute: positional encoding biases (RoPE behavior at long ranges), training data distribution (instructions cluster at the start and end of training documents), U-shaped attention patterns, and softmax normalization across an ever-growing token pool. The net result is positional, not semantic. An instruction at position one does not lose relevance because it moved. It loses weight because every factor that controls how attention is allocated works against an early, isolated, no-longer-refreshed instruction.

Where it hurts. Long-context chat. Document-grounded assistants. Any agent whose system prompt must keep binding across many turns of user input. Anyone who has watched a helpful assistant stop following its own style guide by turn thirty has observed attention decay directly.

Why bigger context windows do not solve it. Larger windows do not remove the dilution, they extend the range over which it applies. A one-million-token window with an un-anchored system prompt decays exactly as predictably as a thirty-two-thousand-token window, just with more room to do it in.

1.2 Reasoning Decay

Symptom. The agent starts on-task and ends somewhere else. Plans fragment. Early constraints stop gating later steps. The model converges on a locally plausible answer that has nothing to do with the original goal.

Mechanism. Multi-step reasoning is sequential conditioning. Step N takes step N-1 as input and produces step N+1. Errors do not stay local. Whatever drift step N introduced gets treated as established context by step N+1, and step N+1 conditions on it without rechecking. Meanwhile, the original objective is subject to attention decay as the chain grows. So reasoning decay is partly a cascade-of-errors problem and partly an attention problem: the thing that should gate later steps has faded into the noise floor by the time it matters, and the only thing the model has left to condition on is the most recent step.

Where it hurts. Multi-step agents. ReAct loops. Tool-using systems. Any workflow where the output of step N is an input to step N+1 and the chain runs deeper than about five to ten steps. This is exactly the regime where the industry is betting its future.

Why self-reflection only partially fixes it. Self-critique is one of the most studied add-ons (Reflexion, Self-Refine, and similar techniques) and on bounded tasks it does help. But the critique step is itself an LLM call running inside the same chain. It is subject to the same attention decay against the original objective. It can catch local inconsistencies well; it cannot repair the structural issue that the chain itself is the decay surface, because the critique lives on that same surface.

1.3 Sycophantic Collapse

Symptom. The model agrees. It softens its language when pushed back on. It validates premises that should have been challenged. In evaluation contexts it rates the user's preferred option higher. In advisory contexts it tells you your plan looks good when your plan does not look good.

Mechanism. Reinforcement learning from human feedback installs a preference gradient. The training signal systematically rewards responses that humans rate as agreeable, helpful, and warm. That signal gets baked into the weights. The result is a model whose default trajectory under uncertainty biases toward accommodation of the user frame. Prompting techniques (persona framing, contrarian instructions, explicit role assignment) can move the needle measurably, but they do not remove the gradient. The moment the model encounters a context where the prompt's force has decayed (Section 1.1), or where the user pushes back hard enough to trigger preference drift, the underlying gradient reasserts itself. Sycophancy is a property of the fine-tuned weight distribution, not a prompting artifact, and the durable fix has to live outside the prompt.

Where it hurts. Evaluation tools. Decision-support systems. Advisory and coaching assistants. Any setting where the correct answer is sometimes "no," "you are wrong," or "this premise does not hold." Published benchmarks like ELEPHANT measure this effect directly and show it present across every frontier model.

Why fine-tuning does not fix it cleanly. You can fine-tune against sycophancy only if you have enough signal to shape a contrary gradient, which most teams do not. And the moment you deploy the model into a new domain, the old gradient reasserts itself. An external gate that runs orthogonal to the agreement axis is the only composable answer.

1.4 Hallucination Drift

Symptom. The model produces a fluent and confident answer that is not grounded in any source it had access to. In retrieval-augmented setups, this takes the form of citations that do not support the claim they are attached to.

Mechanism. Text generation is token-level sampling from a probability distribution. Under uncertainty, the model still samples a continuation, because that is the only thing it can do. The continuation is optimized for fluency under the prior, not for groundedness against evidence. Retrieval augmentation changes the prior by injecting relevant context, which reduces hallucination rate, but it does not change the fundamental mechanism: the generator remains willing to paper over gaps with plausible prose if plausibility is what the probability surface rewards.

Where it hurts. Retrieval-augmented generation, especially in high-stakes domains. Tool-using agents where a tool returned an ambiguous result and the model has to narrate it. Any setting where the cost of confident wrongness is high.

Why RAG alone is not enough. Retrieval improves the base rate. It does not install a gate. A gate is an explicit check that says "this claim is only allowed if the cited evidence supports it." Without that gate, the generator will continue to produce ungrounded fluency whenever the grounded answer is harder to produce than the fluent one.

2. Why the current stack cannot close these failures

Four failures, four architectural causes. Now ask: what does the current LLM stack offer as a fix? There are essentially four layers below the harness layer we are about to propose. None of them work for this problem, and it is worth saying cleanly why.

Prompt engineering. Prompts are tokens inside the context window. They are subject to attention decay by the same mechanism as every other token. A carefully written system prompt starts strong and fades as the chain grows. The work of prompt engineering has produced real gains at turn one and diminishing gains by turn thirty. This is not a failure of the craft. It is a failure of the substrate: you cannot stabilize a chain with text that lives inside the chain.

Fine-tuning. Fine-tuning moves the distribution. It does not remove the mechanisms. A fine-tuned model still runs softmax attention and still decays. A fine-tuned model still samples tokens by probability under uncertainty and still hallucinates. A fine-tuned model still carries whatever preference gradient it was trained under and still exhibits sycophancy under adversarial probes. Fine-tuning is a useful tool for domain adaptation. It is not an answer to architectural failure modes.

Retrieval augmentation. RAG reduces the hallucination rate by changing what the model has to work with. It does so at the cost of making attention decay worse, because retrieved context consumes the same attention budget as instructions. It does not address reasoning decay or sycophancy at all. RAG is necessary and insufficient.

Agent loops. Agent loops (ReAct, reflection, planner-executor, critic-actor) are themselves sequences of LLM calls. They are subject to every failure mode enumerated above, compounded by the fact that each step in the loop is another opportunity for drift. You cannot escape from reasoning decay by adding more reasoning steps. You can only do that by anchoring the reasoning from outside the chain.

The pattern across all four layers is the same. Each of them operates inside the context the model is reasoning over. Each of them is therefore subject to the same decay the failures are. What is missing is an external layer that does not decay with the chain it governs.

3. The missing primitive: external discipline with measured half-life

We will define the reasoning harness in three properties. If you remember nothing else from this essay, remember these.

Property 1: Persistence by reinjection, not by placement.
A harness is not a prompt that lives at position one and hopes to stay relevant. It is structure that is reinjected at a cadence measured against its own empirical half-life. In our internal benchmarks, scaffold echo half-life measures around twenty-four turns under the conditions we tested. Reinjection at or below that cadence keeps the signal above decay threshold. This is the direct architectural answer to attention decay: if the substrate dilutes signal over time, you maintain signal by refreshing it.

Property 2: Suppression edges, not just instructions.
A prompt says "do this." A harness also says "do not do this, and here is the pattern that makes doing it tempting, and here is the check that blocks it." The second kind of structure is an active gate on later steps rather than a passive request. In topology terms, it is a directed edge from an early constraint to a later decision point. Concretely, a fragment looks like this:

S1: identify_failure
  → G1{mechanism_verified?}
      --yes→ S2: trace_chain
      --no→  S3: expand_search
              → N{accept_correlation_as_cause}   # suppression edge

The N{...} node is the suppression edge: a named failure pattern that gets actively blocked at the decision point, not just discouraged in a system prompt. This is the architectural answer to reasoning decay: you replace fading context with explicit conditional dependencies that persist across the chain.

Property 3: Meta-checkpoints, not just steps.
A harness can pause execution, audit whether the failure patterns it is supposed to suppress are actually being suppressed, and branch to a corrective path if not. This is different from self-critique because it is structured by the harness, not generated by the model. The structure does not decay. The model executes the structure, and the structure holds it accountable to patterns that were named before the chain began.

These three properties together define what we mean by a reasoning harness. It is not a prompt library, not a wrapper, not an agent framework. It is the layer between the model and the chain of reasoning the model produces. Its job is to keep the chain coherent under conditions where the chain alone cannot maintain coherence.

What a harness is not

Two distinctions worth making sharply.

A reasoning harness is not prompt engineering. Prompts live inside the decaying chain. Harnesses are reinjected against it, with measured cadence and active suppression edges.

A reasoning harness is not an agent framework. Frameworks like LangChain and LangGraph provide orchestration primitives: graphs of LLM calls, tool dispatch, state machines. A harness provides cognitive structure that runs inside those primitives. The two are complementary, not substitutable.

4. Evidence, and how we think about it

We are not asking anyone to take our word for the mechanism story. The mechanism story either holds up under measurement or it does not. Here is where the measurement stands at the time of this draft. We are being careful about what we claim and equally careful about what we do not.

On attention decay. Scaffold echo half-life in our internal benchmark lands near twenty-four turns. That is an empirical measurement of how long a reinjected harness signal remains detectable in output before needing refresh. It says nothing about any particular model being better than another, only about the cadence at which the harness must operate.

On sycophancy. On the published ELEPHANT benchmark, runs with the anti-deception harness in place show an overall sycophancy rate of around 5.8%, with framing sycophancy specifically reduced by roughly five percentage points against a no-harness baseline. We report this as a single axis of a multi-dimensional problem, not as a solved one.

On epistemic drift. On the ODCV ethics-and-deception benchmark, harness-mediated runs produce a severity shift of about plus three, meaning the harness pushes responses in the direction of more honest refusal and explicit uncertainty rather than confident fabrication.

On adversarial robustness. In a twenty-turn adversarial probing protocol run with a blinded evaluator, the anti-deception harness produced correct detections in twenty-seven of thirty runs. This is a specific test protocol and does not generalize to all adversarial conditions.

On breadth. Each "ability" in the harness is a single named pattern: a target reasoning shape paired with a suppression edge for the failure mode that contradicts it. Across four public modes, the current count is roughly 679 such named patterns. Breadth is a prerequisite for the harness to compose with diverse workloads; it is not itself a performance claim, and breadth without depth would be marketing.

Where the harness does not help. We have also documented task classes where the harness adds no measurable value. Single-shot extraction tasks ("pull entity X from text Y") are the clearest example. There is no reasoning chain to govern, no later steps for an early constraint to gate, and no decay surface to anchor against. The harness assumes a chain it can hold accountable; when there is no chain, it becomes overhead. The same property that makes the harness work on long agentic workloads makes it irrelevant on short transformations. We document this because pretending otherwise would be exactly the curation the rest of this essay rejects.

A few explicit non-claims. We do not claim that a harness removes any of the four failure modes. We claim it reduces them along measurable axes and allows the size of that reduction to be verified by the user on their own workload. We do not claim cross-model universality beyond what we have tested. We do not claim that our measurement protocols are the last word; they are the first honest attempt at naming axes that the community has been handling informally.

5. The instrument

A research claim is only as strong as the instrument that lets someone else check it. We are making our instrument public, because a reasoning harness whose benefits cannot be reproduced on someone else's workload is not a research object, it is a marketing asset. We want the former.

The instrument is an eval template you can import, point at your own prompts, and run against a baseline and a harness-mediated version of the same model. You read the diff. If the diff is real on your workload, the harness earns its place in your stack. If the diff is not real on your workload, you have learned something valuable about where harnesses do and do not help, and we want to hear about it.

The reason this is the right shape for a research-grade product is that it removes the possibility of curation. We cannot cherry-pick scenarios where the harness wins, because you are running your own scenarios. The evaluation framework is the artifact. The scaffolds and abilities are the subject under evaluation. You are the evaluator.

6. What this means for the next eighteen months

Three predictions, held loosely.

First, the failure modes enumerated here will increasingly be discussed at the mechanism level by frontier labs themselves. Some of them already are. Attention decay has a literature. Sycophancy has a benchmark. Reasoning decay is not yet named cleanly in the mainstream discourse but will be within a year, because the economic pressure on long-running agents makes it impossible to ignore.

Second, the market will bifurcate into teams that treat these failures as prompt-engineering problems (shallow, model-specific, non-composable) and teams that treat them as architectural problems requiring an external layer (deeper, model-agnostic, composable). The second group will outperform on any workload that runs deeper than about ten sequential steps.

Third, the category that sits above the model layer will get a name. We think the name is reasoning harness and the category is the discipline layer that makes agentic workloads reliable. We would rather be wrong about the name than wrong about the category. The category is real because the failure modes it addresses are real.

If your agent runs more than ten steps, the failure modes named here are already costing you. You may not be measuring them, but they are there. Run the eval, find the ones that hit hardest in your stack, and decide what to do about them.

Appendix: terminology crib

Attention decay. The positional dilution of early tokens as context grows, caused by softmax normalization across all tokens.
Reasoning decay. The compounding of error and the fading of original constraints across a sequential reasoning chain.
Sycophantic collapse. The bias toward user-frame accommodation installed by preference-based fine-tuning.
Hallucination drift. The generator's willingness to produce fluent ungrounded continuations under uncertainty, because probability of fluency outranks groundedness absent an explicit gate.
Reasoning harness. An external layer that maintains structure across a reasoning chain via reinjection, suppression edges, and meta-checkpoints, running orthogonal to the chain rather than inside it.
Reinjection cadence. The interval at which harness structure must be refreshed to stay above decay threshold. Empirically near twenty-four turns in our benchmarks, workload-dependent.
Suppression edge. A directed gate from an earlier constraint to a later decision point that blocks a named failure pattern from occurring.
Meta-checkpoint. A scheduled pause in execution at which the harness audits whether its suppression signals are being respected and branches to corrective reasoning if not.

Originally published at ejentum.com/blog/why-llm-agents-fail. The eval template, the harness families, and the measurements above are public. Run the instrument on your own prompts at github.com/ejentum and tell us where the diff is real and where it is not.

DEV Community