pueding

Posted on Jun 3 • Originally published at learnaivisually.com

Harness-1: State-Externalizing Search Harness

#agents #ai #llm

What: The Harness-1 paper introduces a 20B RL-trained search agent that externalizes its working memory into a structured harness — candidate pools, evidence links, and verification records — instead of an ever-growing transcript.

Why: A deep search agent that replays its whole history every step runs the context window dry. Harness-1 makes context cost stay flat as the search deepens, which is the harness-as-state idea the agent-engineering world preaches, made concrete and RL-trained.

vs prior: Earlier search agents train over a growing transcript, so every candidate, observation, and verification lands back in context. Harness-1 trains over an external workspace and renders only a budget-bounded slice — the policy decides what to search and verify; the harness owns the memory.

Think of it as

A detective's case-board on the wall, briefed by index card.

                  THE GROWING CASE
                         │
              ┌──────────┴──────────┐
              │                     │
      ┌───────▼────────┐    ┌───────▼────────┐
      │  HARNESS-1     │    │ GROWING        │
      │  case-board    │    │ TRANSCRIPT     │
      │  on the wall   │    │ lug whole file │
      └───────┬────────┘    └───────┬────────┘
              │                     │
      carry one index card  haul the entire box
      into each interview    into every interview
              │                     │
              ▼                     ▼
      ✓ desk stays clear    ✗ desk overflows
        context stays flat     window overruns

case-board on the wall = the durable harness workspace (every lead, evidence link, verified fact)
index-card briefing = the budget-bounded slice rendered into the model's context each step
lugging the whole case file into every interview = replaying the entire growing transcript
running out of desk space = overflowing the context window as the search deepens

Quick glossary

Harness — The scaffolding around the model that owns tools, state, and exactly what gets shown to the model each step. The model is the brain; the harness is the desk, filing cabinet, and notepad.

Context window — The fixed token budget the model can read on any single step. Anything outside it is invisible to the model — and tokens are not free, so a full window is both a cost and a hard ceiling.

Growing transcript — The naïve agent-memory design: concatenate the full action-and-observation history and feed it back every step. It grows without bound, so a long search eventually overruns the context window.

State externalization — Keeping durable working memory outside the model's context — in the harness — so accumulated evidence does not spend context budget. The model reads a rendered view, not the raw store.

Budget-bounded rendering — Each step, the harness selects only a token-budgeted slice of the workspace to render into context, so context size is constant regardless of search depth.

Curated set — The agent's running shortlist of importance-tagged, verified evidence — distinct from the raw candidate pool. Harness-1's headline metric is curated recall: how much of the gold evidence lands in this set.

Curated recall — The fraction of the gold (correct) evidence that ends up in the curated set, averaged across 8 retrieval benchmarks. Harness-1 reports 0.730, +11.4 points over the next-best open search agent.

The news. On June 1, 2026, Harness-1 (arXiv:2606.02373) introduced a 20B-parameter search agent that separates semantic decision-making from state management. The policy decides what to search, inspect, curate, verify, and when to stop; a state-externalizing harness holds the working memory — candidate pools, importance-tagged curated sets, evidence links, verification records, and compressed observations — and renders only a budget-bounded slice into the model's context each step. Rather than training over an ever-growing transcript, the agent is trained with reinforcement learning over a structured external workspace. It reports 0.730 average curated recall across 8 retrieval benchmarks (web, finance, patents, multi-hop QA), +11.4 points over the next-strongest open search sub-agent. Read the paper →

Picture a detective working a long case. Every lead, photo, and verified alibi gets pinned to the case-board on the wall and connected with red string — the board is the durable record, and it only ever grows. When the detective walks into an interview, they don't wheel the entire case file into the room; they carry a single index-card briefing with just what this conversation needs. The board stays on the wall; only a briefing walks in. A rookie who instead lugs the whole growing file box into every interview eventually runs out of desk space — that is exactly what happens when a search agent replays its entire transcript into a finite context window.

That is the move Harness-1 makes concrete. The naïve design treats the agent's memory as a growing transcript: every observation, every candidate document, every verification step is concatenated and fed back to the model on the next step. It works for a few steps, then the transcript balloons and the search has to stop — not because the agent ran out of leads, but because it ran out of room. Harness-1 instead keeps that durable state in the harness — the case-board — and lets the policy decide where the agent's working state lives. Each step, the harness performs budget-bounded rendering: it selects a token-bounded slice of the workspace — the briefing — and shows only that to the model. The board can grow to hundreds of items while the briefing stays the same size, so context cost stays flat no matter how deep the search goes. Crucially, the agent is trained with reinforcement learning over this workspace, not over transcripts, so the policy learns the harness skills — curate, importance-tag, verify, compress, stop — as first-class actions.

Growing transcript vs state-externalizing harness

Design	What lives in context	Context cost as search deepens	Failure mode
Growing transcript	The full action + observation history, replayed every step	Grows with every step	Overflows the window; the search stalls on length, not leads
State-externalizing harness	A budget-bounded slice rendered from the workspace	~Flat, set by a render budget	A poorly-chosen slice can omit a needed item (mitigated by importance tags + curated recall)

The two rows describe the contrast Harness-1 draws between transcript-style memory and its externalized workspace; the "budget-bounded slice" claim is from the paper. Token figures in the hero animation are illustrative.

Walk the budget with some round numbers (illustrative). Say each search step adds about 2,000 tokens of fresh observations. Under the growing-transcript design, those tokens never leave: after 8 steps the model is reading roughly 16,000 tokens of history, after 20 steps about 40,000, and a genuinely deep multi-hop search marches straight past a typical working window. Under the state-externalizing harness, those 2,000-token observations land in the workspace, but the model is only ever shown a fixed ~6,000-token render — step 8 and step 20 cost the same 6,000 tokens in context. The accumulated evidence still exists; it just lives on the case-board instead of in the briefing. That is why Harness-1 can keep curating to 0.730 recall across deep benchmarks where a transcript agent would have run out of room — and it's the same lever the agent-engineering track frames as durable state the harness owns, rather than state smeared across a prompt.

It lands as a sharp companion to the recent push on how search agents act — GrepSeek learns a better action space (shell commands over a corpus), while Harness-1 learns a better state substrate (an externalized workspace). Same RL-trained-search-agent family, orthogonal levers. As the work frames it, the model should make the semantic calls and the harness should own the memory — a clean division that the standard fixes for an overflowing context have been circling, now learned end-to-end.

Goes deeper in: AI Agents → The Agent Loop & State → The Anatomy of a Harness

Related explainers

GrepSeek — training a shell-command search agent — the other lever: learning the search action space instead of the state substrate.
Is Grep All You Need? — grep vs vector retrieval — empirical evidence that harness design dominates the retrieval algorithm.
RecMem — subconscious + recurrence-triggered memory — another way to keep durable agent memory off the live context.

FAQ

What is Harness-1?

Harness-1 is a 20B-parameter, RL-trained search agent that separates the model's semantic decisions (what to search, inspect, curate, verify, and when to stop) from state management. A state-externalizing harness holds the durable working memory — candidate pools, importance-tagged curated sets, evidence links, verification records, and compressed observations — and renders only a budget-bounded slice into the model's context each step. It reports 0.730 average curated recall across 8 retrieval benchmarks, +11.4 points over the next-strongest open search sub-agent.

Why does externalizing state matter?

A search agent that replays its full transcript into context each step grows that context with every observation, so a deep search eventually overruns the context window and stops on length rather than on evidence. Externalizing state keeps the accumulated evidence in the harness and renders only a fixed-size slice, so context cost stays flat regardless of search depth — letting the agent keep curating across deep, multi-hop benchmarks.

How is this different from just a growing transcript?

A growing transcript concatenates the entire action-and-observation history and feeds it back every step, so its size scales with the number of steps. Harness-1 instead stores that history in a structured external workspace and trains the policy with reinforcement learning over that workspace — so the model learns to curate, verify, and compress as explicit actions, and the context the model reads is a budget-bounded rendering of the workspace rather than the raw, unbounded log.

Originally posted on Learn AI Visually.

DEV Community