eleonorarocchi

Posted on May 1

Anthropic and the Runtime Harness for Persistent Agents

#ai #agents #llm

TL;DR

Anthropic shows that the real challenge for AI agents is not starting a task, but staying coherent throughout long executions.
Avoiding cognitive drift requires a runtime harness built on external memory, checkpoints, and continuous re-anchoring.
The next frontier is not autonomy alone: it is cognitive continuity.

Anthropic and the Runtime Harness: the Real Problem with Agents Is Not Acting, but Not Getting Lost While They Act

If the OpenAI case showed how a repository can be rethought to become readable for agents, the contribution published by Anthropic in Harness design for long-running application development opens an even more delicate question: what happens when the challenge is no longer how to start a task well, but how to keep it alive for hours?

Because this is where many agentic systems truly begin to break.

Not at the first tool call, nor at the first planning step, but perhaps at the twentieth minute-when context starts to thin out, micro-errors begin to accumulate, and the agent keeps acting while preserving only the illusion of coherence.

In its article Harness design for long-running application development, Anthropic puts its finger exactly on this point: the frontier of agent engineering is not simply autonomy, but the persistence of autonomy over time.

The Most Underestimated Failure Mode: Cognitive Drift

Many agents appear to work well as long as we observe them on short tasks:

generating a component;
fixing a function;
calling two or three tools.

But when the task stretches across dozens of files, multiple review phases, intermediate validations, and distributed dependencies, a phenomenon begins that is very familiar to those who use them in real settings: the agent continues to produce output, but progressively loses the center of its own intention.

Anthropic treats this as a structural problem, not as a simple "model limitation": and this is precisely where the runtime harness emerges.

From the Context Window to External Cognition

The starting point is almost brutal: the context window, by itself, is too fragile a memory to sustain long-running tasks.

Even with very large contextual windows, the model suffers from imperfect compression, unstable salience, priority loss, and partial retrieval of goals.

For this reason, Anthropic builds around Claude an external procedural memory composed of persistent scratchpads, task files, execution summaries, serialized checkpoints, and continuously updated state notes.

In practice, the model is no longer forced to "remember everything", because it can reread what it has already established.

This makes an enormous difference.

The Harness as a System of Continuous Re-Anchoring

In the classical paradigm, we tell the agent: continue.

In the Anthropic paradigm, instead, we tell it: stop, reread where you are, summarize what you are doing, update your state, then continue.

This creates a re-anchoring cycle.

The agent is periodically brought back to the goal, to the progress already completed, to the constraints still open, and to the errors that have emerged.

It is a form of "artificial continuity".

Cognition is not allowed to flow in a monolithic way; it is broken apart, recorded, and reconsolidated.

Multi-Agent Evaluation: Thinking Is Not Enough, You Need to Be Critiqued

Another interesting aspect of Anthropic's work is the use of generator/evaluator structures: one agent produces, and a second agent evaluates quality, coherence, usability, and adherence to requirements.

The result is not simply "more review".

It is something subtler: verification stops being a final phase and becomes part of cognitive continuity itself.

In this way, each evaluation prevents the primary agent from drifting too far away from the correct trajectory.

The Runtime Harness Is Not Meant to Make the Agent Act Better: It Is Meant to Make It Think Longer

This is perhaps the most important point: while OpenAI builds above all a structural harness, Anthropic builds above all a temporal harness.

The problem it is solving is no longer "how do I get Claude to generate good code?", but "how do I prevent Claude from losing the thread while it continues generating it?".

It sounds like a nuance, but it completely changes the design, because here:

the memory is an external artifact,
planning is serialized,
review is recurrent,
the task is continuously re-anchored.

So this is not only orchestration-it is assisted cognitive continuity.

Conclusion

If OpenAI's repository harness teaches us that an agent needs to live inside a readable codebase, Anthropic reminds us that this is not enough.

An agent may have perfect tools, perfect documentation, perfect constraints and still get lost if it is allowed to run for too long without an external memory that keeps it coherent.

And this is where the runtime harness changes the game: it does not merely build an environment in which the agent can act; it builds an environment in which the agent can continue to know why it is acting.

In agent engineering, this may be the difference between episodic automation and real autonomy.

DEV Community