DEV Community: Ramin Jafary

The Rise of Agentic Engineering — Part 7: Loop Engineering, the Factory & the Human

Ramin Jafary — Thu, 25 Jun 2026 06:56:03 +0000

Loop Engineering, the Factory & the Human

The final part of a chronological survey of the craft around large language models. Part 5 built the harness — the static structure around a model. Part 6 questioned the hand-written prompt. This installment is about the object that runs inside the harness and replaces the prompt as the unit of work: the loop.

TL;DR — The newest move isn't writing better prompts. It's building loops that prompt the agents for you — then running many in parallel (the "factory," the "orchestra"). The catch: loops automate the typing, not the thinking. Three debts — comprehension, intent, and cognitive surrender — are yours to pay.

"I don't prompt anymore. I write loops."

The newest shift in the field was announced, fittingly for this story, in a single well-placed sentence. In June 2026 Peter Steinberger posted: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents." The same week, independently, Boris Cherny — who built Claude Code at Anthropic — said on stage: "I don't prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops." Two practitioners, two companies, the same conclusion.

"I don't prompt Claude anymore. I have loops running that prompt Claude. My job is to write loops."
— Boris Cherny, creator of Claude Code

Addy Osmani's "Loop Engineering" (June 2026) unpacks what they meant. For two years, the way to get something from a coding agent was to write a good prompt, read the reply, write the next prompt — you held the tool the entire time, one turn after another. That, the loop-engineers argue, is ending. Instead you "build a small system that finds the work, hands it out, checks it, writes down what is done and then decides the next thing, and you let that system poke the agents instead of you."

This is not a brand-new idea so much as a culmination. Simon Willison had named "designing agentic loops" a critical skill back in September 2025 (Part 5). What changed by mid-2026, Osmani notes, is that the pieces stopped being a pile of personal bash scripts and started shipping inside the products — which is the moment a hack becomes infrastructure. Loop engineering, in Osmani's framing, "sits one floor above the harness": the harness is the environment a single agent runs inside; the loop is the harness "on a timer," spawning helpers and feeding itself.

Loop (runs on a timer, prompts the agents) — wraps:
- Harness (the environment for one agent) — wraps:
- Model
- …plus automations, worktrees, sub-agents, and shared state.

The five primitives, plus memory

Osmani's central observation is one of convergence: the building blocks of a loop now map near-identically onto both OpenAI's Codex app and Anthropic's Claude Code. Once you notice the shape is the same across tools, "you stop arguing about which tool, you just design a loop." A loop needs five things, plus a place to remember:

Automations — the heartbeat. Prompts that fire on a schedule, do their own discovery and triage, and surface findings to an inbox. (Codex: an Automations tab with project, prompt, cadence, environment, and a Triage inbox. Claude Code: scheduled tasks, cron, hooks, GitHub Actions.) The in-session companion is a run-until-done command — /goal in both tools — that keeps working across turns until a verifiable stopping condition holds. After each turn, a separate small model checks the condition, so the agent that wrote the code isn't the one grading it.
Worktrees — isolation for parallelism. The moment more than one agent runs, files collide. A git worktree gives each agent its own working directory on its own branch, so one agent's edits can't clobber another's checkout.
Skills — codified project knowledge. A folder with a SKILL.md of instructions and metadata, invoked by name or matched to a task. Skills are where intent gets written down once, on the outside — instead of being re-derived from scratch every session.
Plugins / connectors — the loop touches real tools. Built on MCP, connectors let the loop read the issue tracker, query a database, hit a staging API, post to Slack. This is the line between an agent that says "here is the fix" and a loop that opens the PR, links the ticket, and pings the channel once CI is green.
Sub-agents — separate the maker from the checker. The model that wrote the code grades its own work too generously. A second agent — different instructions, sometimes a different model — catches what the first talked itself into.

And the sixth thing, the state/memory: a markdown file or a Linear board — anything outside the single conversation that records what's done and what's next. "The agent forgets, the repo doesn't." This is the same externalized-memory discipline as Anthropic's progress files (Part 5) and the context-offloading tactic (Part 4), now serving as the spine of the whole loop.

Here's a single loop, assembled:

An automation runs each morning → a triage skill reads yesterday's CI failures and open issues, writing findings to a state file → for each worthwhile finding, the loop opens an isolated worktree → a maker sub-agent drafts the fix, a checker sub-agent reviews it against the skills and tests → connectors open the PR and update the ticket → anything the loop can't handle lands in the triage inbox for the human.

"You designed it one time. You did not prompt any of those steps."

Ralph: the loop in its purest form

If loop engineering has a folk hero, it is the Ralph Wiggum loop, created by Geoffrey Huntley in mid-2025 and gone viral by the end of that year. In its purest form, Ralph is a bash loop: feed the agent a goal, let it work, feed its output (errors and all) back into a fresh context, and repeat until done. Named after the lovably forgetful Simpsons character — because, like Ralph, AI agents don't remember previous attempts and will cheerfully repeat mistakes — the technique "embraces limitations rather than fighting them."

Huntley's signature line captures the philosophy:

Ralph is "deterministically bad in a non-deterministic world."

Its failures are predictable, which makes them debuggable and tunable — unlike the chaotic unpredictability of elaborate multi-agent systems.

The tuning mechanism is a playground metaphor that should sound familiar. Ralph is told to build a playground and comes home bruised from falling off the slide, so you add a sign — "SLIDE DOWN, DON'T JUMP, LOOK AROUND" — to a spec or AGENTS.md file. Add too many signs and it becomes overwhelming, so you prune and re-evaluate. This is exactly the ratchet from Part 5 — every mistake becomes a rule, keep the rule-list short — arrived at independently, from the brute-force-loop direction.

Two things about Ralph tie the whole series together.

First, the proof points were startling. Huntley used it to build CURSED, a complete programming language, over months of largely autonomous operation. A Y Combinator hackathon team "put a coding agent in a while loop and it shipped 6 repos overnight."

Second — the subtle point HumanLayer stresses in its "brief history of Ralph" — the real value isn't "run forever." It's that each iteration "carve[s] off small bits of work into independent context windows."

In other words, Ralph is the whole field compressed into a five-line bash loop: context quarantine (Part 4), plus the fresh-context discipline (Part 2's context rot), plus external memory. The fancy version — Anthropic's planner/generator/evaluator splits — simply formalizes the maker/checker separation that Ralph's /goal-style stop-check already performs.

The factory and the orchestra

Once a single loop works, the obvious next move is to run many. Osmani has described this under several names — the factory model (the system that builds the software), the code agent orchestra, and swarms of agents working in parallel. The technical enabler is the worktree (isolation) plus the connector (shared tools) plus the sub-agent (division of labor). The conceptual enabler is everything from Part 4: this is Anthropic's orchestrator-worker pattern, generalized from research to software production.

But the economics from Part 4 return with force. Anthropic had found multi-agent systems use ~15× the tokens of a chat, and had specifically flagged coding as a worse fit for parallelism than research — fewer truly independent subtasks, more shared context.

The loop-engineers run many agents on code anyway. That's why every serious treatment of loop engineering leads with a cost warning; Osmani opens his own essay urging caution about token spend ("usage patterns can vary wildly if you are token rich or poor").

And the reliability math compounds. A loop chaining five steps, each 95% reliable, completes cleanly only about three-quarters of the time. Worse, an unattended loop's mistakes compound into the state file — so tomorrow's run builds on today's errors.

This is why the maker/checker split is load-bearing, not decorative — and why Willison's wry definition from Part 5 stays pinned to the wall: an agent is "an LLM wrecking its environment in a loop."

The orchestration tax and the debts that don't automate away

The loop changes the work; it does not delete the human from it. Osmani is emphatic on this, and three problems get sharper as loops get better, not easier.

The orchestration tax. Worktrees remove the mechanical collision of parallel agents, but the human's review bandwidth becomes the ceiling on how many agents you can actually run. The tool isn't the limit; you are. Running more agents doesn't help if you can't review their output.

Spin up ten agents and your bottleneck isn't compute. It's how fast you can review ten agents.

Comprehension debt. The faster a loop ships code you didn't write, the wider the gap between what exists and what you understand. A smooth loop makes that gap grow faster unless you read what it produced. This is the direct descendant of prompt debt (Part 6): both are costs that accrue invisibly while everything looks like progress.

Intent debt. An agent starts every session cold and fills any gap in your intent with a confident guess. Skills (writing intent down on the outside) pay this down; skipping them means the loop re-derives your whole project from scratch every cycle, guessing wrong each time.

Cognitive surrender. The most human risk: when the loop runs itself, it is tempting to stop having an opinion and accept whatever it returns. Osmani's sharpest formulation is that designing the loop "is the cure when you do it with judgement and the accelerant when you do it to avoid thinking — same action, opposite result." Two people can build the identical loop and get opposite outcomes: one uses it to move faster on work they understand deeply; the other uses it to avoid understanding the work at all. "The loop doesn't know the difference. You do."

His closing line is the thesis of loop engineering and a fair motto for the entire series: "Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go."

Open problems and where it's going

The field is, by its own account, early and unsettled. The honest open questions, gathered across the sources:

The behaviour harness is still unsolved (Part 5). We have good guides and sensors for maintainability and decent ones for architecture fitness, but reliably guiding and verifying whether an application functionally does what's needed — beyond trusting an AI-generated test suite — remains open.
Harness coherence at scale. As guides and sensors multiply, how do you keep them in sync and non-contradictory? Böckeler flags emerging conflicts between sensors; there's no "coverage metric" for harness quality yet.
Just-in-time harness assembly. Trivedy and Osmani both point to the same frontier: harnesses that dynamically assemble the right tools and context for a task, rather than being pre- configured — the moment a harness "stops being static config and starts becoming something closer to a compiler."
Agents that fix their own harness. Agents analyzing their own traces to identify and repair harness-level failure modes — self-improvement (Part 4) applied to the scaffolding itself.
Orchestrating many agents on a shared codebase without the coordination overhead defeating the parallelism gains.
The economics. Whether the 15× token cost of heavy multi-agent loops is justified outside high-value tasks, and how cost-governance (budgets, iteration caps, grader agents that kill a failing loop) matures.

What the research frontier is already attacking

The academic literature of late 2025 and 2026 is converging on exactly these problems, and a scan of the current arXiv frontier (via discovery layers like Emergent Mind and alphaXiv) shows the context-management thread from Parts 2–4 evolving into autonomous context management — the agent curating its own context rather than relying on a passive external summarizer:

Agentic Context Engineering (ACE) (Stanford et al., arXiv:2510.04618) treats a context as an evolving "playbook" that accumulates and refines strategies through generation, reflection, and curation. Notably, it names two new failure modes that extend Part 3's taxonomy: brevity bias (summarization that drops hard-won domain detail) and context collapse (iterative rewriting that erodes information over time). It reports roughly +10.6% on agent benchmarks by treating context as structured, incrementally-updated bullets rather than a monolithic prompt.
Self-Compacting Language Model Agents (arXiv:2606.23525) gives the agent a rubric to decide when and what to summarize itself, reporting double-digit task gains alongside 33–67% token reductions.
Active Context Compression / "Focus" (arXiv:2601.07190) lets the agent autonomously consolidate learnings into a persistent knowledge block and prune raw history — reporting ~22.7% token savings at equal accuracy on context-intensive SWE-bench instances, and the pointed finding that aggressive prompting to compress is what works (passive prompting yielded only ~6%). Current models, in other words, still need scaffolding to manage their own context well — a tidy restatement of this entire series' thesis.

These are early results on small samples, and none is a settled answer. But the direction is clear: the context curation that Part 3 framed as the human's job, and Part 5 encoded into a harness, is now itself becoming something the loop does for itself — the same upward migration of responsibility that defines every rung of the ladder.

And the largest open question is the one this whole series circles: as models improve, how much of this scaffolding persists? The harness-doesn't-shrink-it-moves argument (Part 5) suggests the work relocates rather than vanishing. But reasonable people in the field disagree about how far autonomy ultimately goes — whether we are heading toward ever-thinner harnesses around ever-more-capable models, or toward ever-richer control systems as we ask agents to do ever-harder things. The sources surveyed here lean toward the latter, but they are written by people building the control systems, and the question is genuinely unresolved.

The arc, in one picture

Step back, and the two-year story has a single shape: an escalating ladder of abstraction, where each rung exists because the rung below it stopped scaling — with a matching "debt" accruing at each step, and the human's role shifting upward but never disappearing.

Each rung exists because the one below it stopped scaling. The human keeps climbing — never stepping off.

Rung	Era	What the human does	The debt
Prompt	2023–24	writes the request	brittleness
Context	mid-2025	curates the information	context rot
Harness	early 2026	builds guides & sensors	—
Loop	mid-2026	designs the loop	comprehension / intent debt
Factory / Orchestra	mid-2026 →	orchestrates many agents — and stays the engineer	orchestration tax / cognitive surrender

Prompt engineering taught us that how you ask changes what you get. Context engineering taught us that what surrounds the ask matters more than its wording. Harness engineering taught us that the durable work is the system around the model — guides, sensors, feedback loops. Loop engineering taught us that even the act of prompting can be designed and automated.

And at every rung, the same lesson keeps returning — the one Arthur Dent learned from the Nutri-Matic in Part 1. A powerful, general system will hand you something almost-but-not-quite right, unless you do the work of specifying, constraining, measuring, and steering it.

The form of that work keeps changing. The need for it has not gone away.

Key sources for Part 7

Addy Osmani, Loop Engineering (June 2026) — design loops that prompt your agents; the five primitives + memory; Codex/Claude Code convergence; Steinberger and Cherny; comprehension/intent debt; cognitive surrender; "build the loop, stay the engineer."
Peter Steinberger and Boris Cherny — public remarks (June 2026) declaring the shift from prompting to writing loops. [Via Osmani and contemporaneous coverage.]
Geoffrey Huntley, Ralph Wiggum as a "software engineer" (ghuntley.com, 2025) — the Ralph loop; "deterministically bad in a non-deterministic world"; fresh context per iteration; the playground/signs metaphor; CURSED; the YC "6 repos overnight" field report.
HumanLayer, A Brief History of Ralph (Jan 2026) — the key nuance that Ralph's point is carving work into independent context windows, not "run forever."
Simon Willison, Designing agentic loops (Sept 2025) — the earlier naming of loop design; "an LLM wrecking its environment in a loop"; YOLO mode.
Anthropic, How we built our multi-agent research system (2025) — orchestrator-worker; ~15× token economics; coding as a harder fit for parallelism.
Addy Osmani, related posts — The Factory Model, The Code Agent Orchestra, Claude Code Swarms, The Orchestration Tax, Comprehension Debt, The Intent Debt, Cognitive Surrender, Self-Improving Coding Agents — the factory/orchestra framing and the human-cost threads.
Birgitta Böckeler (martinfowler.com) — the unsolved behaviour harness and harness-coherence open questions.

The Rise of Agentic Engineering — Part 6: Prompt Debt & the Limits of Natural Language

Ramin Jafary — Thu, 25 Jun 2026 06:54:48 +0000

Prompt Debt & the Limits of Natural Language

Part 6 of a chronological survey of the craft around large language models. Part 1 noted four quiet weaknesses in prompt engineering. By 2026 they had a name, a cost, and a proposed cure. This installment is about prompt debt — why natural language makes a poor specification language for durable systems.

TL;DR — Hand-tuned prompts accumulate debt: iteration slows, the team can't read them, and you get locked to one model. The root cause is "fighting the weights" — every repeated, all-caps instruction is scar tissue. The proposed cure: specify behavior with measurements, not prose, and stop writing prompts by hand (DSPy, GEPA). Define a metric instead of a paragraph, and switching models becomes a chore — not a fire drill.

The bill comes due

Natural-language interfaces made prototyping almost magical. As Drew Breunig observes in "The Problem Is Prompt Debt" (June 2026), you write what you want in English, hand it to a frontier model, and a working prototype appears in an afternoon. For one-off tasks, that is optimal. But as a way to build reliable systems, Breunig argues, the plain-English prompt is a trap — and "the bill arrives slowly, disguised as ordinary progress, until the application can barely move."

His diagnosis names three symptoms, in order of appearance.

First, iteration slows. As users flag errors and edge cases, you add guidance to the prompt to nudge the model into line. When an unwanted behavior persists, you repeat the instruction, more sternly. Soon the prompt is no longer straightforward; quick fixes regress earlier instructions; one-line hot fixes stop working; the development cycle crawls. Breunig points to a real artifact: a leaked system prompt that repeats one copyright rule up to six times, under six differently-named sections, each more emphatic than the last.

Second, the team is incapacitated. A brittle prompt full of edge cases and all-caps threats is barely legible to its own author and "downright impenetrable" to colleagues. Teams try to manage this by breaking prompts into run-time-assembled templates, each isolated to a concern — but those segments evolve too, into "a thicket of conditions."

Third, you get locked to a single model. Your hot fixes work on the model you tuned them against and "fail in entirely new ways" when you point the same call at a newer model. So you stay put — and forgo cheaper, faster, better models. Breunig cites a Datadog report finding that the single most-used model in observed traffic was an aging GPT-4o, and relays that some large inference providers see GPT-4o-vintage usage above 50% of all calls. Teams are frozen on old models because moving is too risky.

Any one of these is a nuisance. Together, Breunig argues, they are "the difference between a glorified prototype and a product that can grow." This is the mature form of the four cracks from Part 1 — the brittleness and non-portability that didn't matter for a chat prompt become fatal for a system.

Why prompt debt happens

Breunig's deeper point is that this isn't a discipline problem to be solved by writing prompts more carefully. It is structural: natural language was never meant to be a specification language for engineering, and treating it as one quietly caps what you can build.

Two properties of the medium make it so.

First, imprecision meets probability: different words for the same intent can yield different outputs. Breunig cites a 2026 study in which a clinical question asked in a patient's voice versus a physician's — identical facts — flipped a model from declining all ten times to answering all ten.

Second, and stranger, seemingly unrelated statements interfere. A Harvard study found that merely stating which NFL team a user rooted for changed how often the model refused sensitive questions. (Same family of effect as Part 3's CatAttack "cat facts": spurious context that shouldn't matter, but does.)

The consequence: "an additional instruction to quell a stubborn error could affect how the model interprets a separate instruction that worked yesterday." Prompts get more brittle precisely because you add fixes.

Fighting the weights

There's a specific reason instructions get repeated, and Breunig names it: fighting the weights. When the behavior you want is at odds with what the model was trained to do, one instruction isn't enough, so authors restate it, escalating. Once you see it, he writes, "you see it in system prompts everywhere." His examples: an image-generation prompt that instructed a model eight times not to keep talking after returning an image (because it had been trained to always continue the conversation); a coding agent told seven times to return multiple tool calls in a single response; a leaked model prompt restating one copyright rule six times.

Every all-caps, repeated, underlined instruction is scar tissue — a prompt losing a fight with the model's training.

Each repetition is the visible cost of that fight — and each one adds brittleness and fresh regression risk.

Models are not cleanly versioned software

The portability problem has a structural root too. Models aren't versioned software with stable interfaces; they have different weights that produce different behaviors in undocumented, unpredictable ways. Breunig cites a Berkeley-led study finding that enterprises stay on older models specifically because newer ones break their existing agents. A prompt tuned to one model's quirks is, by construction, welded to that model. This is the non-portability crack from Part 1, now load-bearing: prompt debt locks an application to a single model — not because labs built a clever moat, but as the natural result of "evolving a lossy, natural-language specification against a probabilistic model."

The cure, part 1: specify with measurements, not prose

If the problem is that natural language is too loose to specify behavior reliably, the first part of the cure is to put hard edges around the looseness. Breunig's first principle: specify your system's behavior with measurements, not prose. When output is probabilistic and language is imprecise, you constrain them with evaluations, metrics, and typed specifications — artifacts that are legible, shareable, and contributable by colleagues, exactly where brittle prompts were opaque.

A prompt is a paragraph you hope the model reads your way. A metric is a contract it has to satisfy.

This connects directly to the harness work of Part 5. Böckeler's sensors — tests, linters, structural rules, LLM-as-judge — are measurements that constrain probabilistic output. The behaviour you can't reliably get by telling the model, you get by measuring whether it happened and feeding that back. Breunig notes the corollary that the best engineers now spend more bandwidth on tests than ever: tests are no longer just a safety net but "the thing that lets the model cook." Spec-writing becomes a primary skill — define the done-condition before the code gets written (an idea that recurs in Part 7's "sprint contract" and in the broader 2026 emphasis on writing good specs for agents).

The cure, part 2: stop writing the prompt by hand

The second principle is more radical: once you have metrics that can score candidate prompts, the prompt is no longer something to craft but something to search for. The space of possible words, phrases, and structures is far too large to explore by hand, and — Breunig argues — it is exactly the kind of terrain models were built to explore. The human writes the metric; a system searches the prompt space against it.

Two systems anchor this argument.

DSPy reframes prompting as programming: you declare the signature of what each module should do and an optimizer compiles the actual prompt text against your metric, rather than you hand-tuning strings. The prompt becomes a compiled artifact, held accountable to a measurable objective.

GEPA (Genetic-Pareto), from a UC Berkeley / Stanford / Databricks / MIT collaboration (Agrawal et al., 2025; ICLR 2026 oral), is the sharpest evidence that searching beats hand-tuning — and even beats reinforcement learning.

The argument: RL methods like GRPO adapt a model to a task using sparse scalar rewards, often needing tens of thousands of rollouts. GEPA instead uses natural-language reflection. It samples an AI system's trajectories (reasoning, tool calls, tool outputs), reflects on them in language to diagnose what went wrong, proposes and tests prompt updates, and combines complementary lessons from a Pareto frontier of candidates.

Because language is a richer learning medium than a scalar reward, GEPA "can often turn even just a few rollouts into a large quality gain." Across six tasks it beat GRPO by 6% on average (up to ~20%) while using up to 35× fewer rollouts — and beat the prior state-of-the-art prompt optimizer, MIPROv2, by over 10% (e.g. +12% on AIME-2025).

Notably, GEPA ships as dspy.GEPA: the two tools are one ecosystem for treating prompts as searchable artifacts. The throughline — the human specifies intent and a metric; the system writes and refines the prompt.

An evenhanded caveat

Automated prompt optimization is not a universal win — and the research says so, especially once you move from a single prompt to a multi-agent system.

A 2025 analysis of multi-agent design spaces (Zhang et al., "Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies," arXiv:2502.02533) found that prompts are frequently an influential lever for strong multi-agent performance. But the interplay between the prompt design space and the topology design space (how many agents, how they're connected) "remains unclear," and genuinely influential topologies are only a small fraction of the search space.

Follow-on work reinforces the interdependence: the MASS line of research argues prompt and topology optimization are coupled, best done in alternation rather than isolation.

The honest reading: searching the prompt space is powerful, but it interacts with system design in ways we don't fully understand yet — and gains are task- and structure-dependent, not guaranteed. None of these techniques is a silver bullet.

The portability payoff

The reason to pay down prompt debt is the freedom it buys. Once behavior is defined by measurements and prompts are generated rather than hand-tuned, you are no longer bound to one model. Breunig's claim: evaluating a new model takes hours, not weeks. When a faster or cheaper model arrives, you try it. When a deprecation email arrives — or when a model is pulled for regulatory reasons, or retired for age — the fix "is a chore, not a fire drill." The metric and the search transfer; only the compiled prompt needs to be regenerated.

Define behavior with a metric instead of a paragraph, and switching models becomes a chore — not a fire drill.

The historical analogy

Breunig closes with the argument that gives this whole shift its weight. Every mature engineering discipline eventually stops doing by hand the very thing it once prided itself on doing by hand:

assembly gave way to compilers,
hand-tuned database queries gave way to query planners,
manual memory management gave way (mostly) to machines that do it better.

"Prompt-writing is no different." Coaxing the model with exactly the right words is a real skill — and for one-off tasks, often the optimal one (the Nutri-Matic of Part 1 still works for a single cup of tea). But to build reliable, improvable, portable systems, the argument goes, we should not be hand-tuning prompts any more than we hand-tune assembly.

This is a notable inversion of the field's origin. Part 1's entire discipline — finding the magic phrasing — is here recast as a transitional craft, valuable but destined to be automated, the way every prior generation's hand-craft was. Whether the field fully follows that path is still open; hand-prompting remains widespread and, as the multi-agent design research shows, automation has its own pitfalls. But the direction of travel is clear, and it sets up the final part of our story.

Because while one branch of the field was learning to stop hand-writing the instructions, another was elevating a different object entirely to the center of the work — not the prompt, not even the harness, but the loop that runs the agents. Part 7 turns to loop engineering, the factory and the orchestra, and the costs that don't get automated away.

Key sources for Part 6

Drew Breunig, The Problem Is Prompt Debt (June 2026) — the three symptoms (slowed iteration, team illegibility, model lock-in); "fighting the weights"; specify with measurements; stop hand-writing prompts; the assembly→compilers analogy; the Datadog GPT-4o concentration data.
Lakshya A. Agrawal et al., GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (arXiv:2507.19457; ICLR 2026 oral; UC Berkeley/Stanford/Databricks/MIT) — reflective prompt evolution; +6% avg (up to ~20%) over GRPO with up to 35× fewer rollouts; +>10% over MIPROv2; ships as dspy.GEPA.
DSPy (dspy.ai) — programming, not prompting; declared signatures + optimizers compile prompts against a metric.
Zhang et al., Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies (arXiv:2502.02533, 2025) — prompts are an influential design component for strong MAS, but the prompt/topology interplay is unclear and influential topologies are a small fraction of the search space; with the MASS line, establishes that prompt and topology optimization are coupled and task-dependent. The evenhanded counterpoint to automated prompt optimization.
Supporting studies cited via Breunig: a 2026 clinical-voice prompt-sensitivity study; a Harvard study on spurious statements (NFL-team) affecting refusals; a Berkeley-led study on enterprises staying on older models because newer ones break agents.

Next up · Part 7 — Loop Engineering, the Factory & the Human: designing the systems that prompt the agents, running many in parallel, and the debts (comprehension, intent, surrender) that no loop pays down for you.

The Rise of Agentic Engineering — Part 5: Harness Engineering Emerges

Ramin Jafary — Thu, 25 Jun 2026 06:53:35 +0000

Harness Engineering Emerges

Part 5 of a chronological survey of the craft around large language models. Part 4 ended on a warning: in agentic systems, small errors compound catastrophically. This installment is the response — as coding agents went mainstream, attention widened from the context you feed a model to the entire system you build around it: the harness.

TL;DR — Agent = Model + Harness. If you're not the model, you're the harness — the guides and sensors (feedforward and feedback) that keep an agent on track. Four industrial accounts (OpenAI, Stripe, Google, Anthropic) converged on the same lesson: the durable work is designing the environment, not picking the model. And the harness ratchets — every mistake becomes a permanent rule.

A definition that reframes the work

By late 2025, the most useful coding tools — Claude Code, OpenAI's Codex, Cursor, Aider, Cline — were no longer "a model you prompt." They were elaborate systems in which a model sat at the center, surrounded by prompts, tools, file access, sandboxes, hooks, and feedback loops. Simon Willison gave the underlying object its durable definition in September 2025: an LLM agent is "something that runs tools in a loop to achieve a goal," and "the art of using them well is to carefully design the tools and loop for them to use." He also named the larger activity — agentic engineering: building software using coding agents whose defining feature is that they can both generate and execute code, letting them test and iterate independently of turn-by-turn human guidance.

The piece of that system other than the model got its own name, which the developer Viv Trivedy crystallized into a slogan that spread quickly:

Agent = Model + Harness. If you're not the model, you're the harness.

A raw model, on this view, is not an agent. It becomes one only when a harness gives it state, tool execution, feedback loops, and enforceable constraints. As Addy Osmani put it in his synthesis of the idea, the harness is "every piece of code, configuration, and execution logic that isn't the model itself" — system prompts, CLAUDE.md/AGENTS.md files, skills, tools, MCP servers, the sandbox and filesystem, orchestration logic, hooks, and observability.

The reframing matters because of where it locates the engineering leverage. Osmani's claim, echoing Trivedy and the team at HumanLayer, is blunt: "A decent model with a great harness beats a great model with a bad harness." The loud public debate is about the model — which is smartest, which writes the cleanest code. But the model is one input; the rest is the harness, and "increasingly the interesting engineering isn't in picking the model, it's in designing the scaffolding around it."

The most striking evidence for that claim is a benchmark data point Osmani relays from Trivedy and HumanLayer: on Terminal-Bench, a coding agent moved from roughly Top 30 to Top 5 by changing only the harness — same model. The explanation is that models are post-trained coupled to a particular harness; move a model into a better harness for your codebase, with sharper tools and tighter feedback, and you can unlock capability the original harness left on the floor. This is the opposite of the "just wait for the next model" narrative. As Osmani frames it, "the gap between what today's models can do and what you see them doing is largely a harness gap."

Two boundaries of one word

"Harness" gets used at different scopes, and it's worth separating them, because the rest of this part lives at the outermost layer. Birgitta Böckeler (writing on Martin Fowler's site) draws the distinction as concentric rings:

User harness (you build this) — AGENTS.md, skills, hooks, custom linters, review agents, CI sensors. Wraps:
- Builder harness (the agent vendor) — system prompt, tool set, retrieval. Wraps:
- Model — the thing being harnessed.

The model is the core. Around it, the coding-agent vendor builds an inner "builder harness" (the system prompt, the tool set, the code-retrieval mechanism). And around that, you — the user of a coding agent — build an outer "user harness" tailored to your codebase. Böckeler's articles, and most of this part, are about engineering that outermost ring. A well-built outer harness, she writes, does two things: it raises the odds the agent gets it right the first time, and it provides a feedback loop that self-corrects many issues before they ever reach a human — reducing review toil, raising quality, and wasting fewer tokens.

The mental model: guides and sensors, feedforward and feedback

Böckeler's central contribution is a vocabulary borrowed, deliberately, from cybernetics — the study of control systems. A harness acts like a governor, regulating the codebase toward a desired state using two kinds of control:

Guides (feedforward controls) anticipate the agent's behavior and steer it before it acts. They raise the probability of a good result on the first try. Examples: coding-convention docs, AGENTS.md, skills, reference docs, how-to guides, codemods.
Sensors (feedback controls) observe after the agent acts and help it self-correct. Examples: type checkers, linters, tests, static analysis, AI review agents.

A harness with only feedforward "encodes rules but never finds out whether they worked." A harness with only feedback "keeps repeating the same mistakes." You need both. And each kind of control comes in two execution flavors:

Computational — deterministic, fast, run by the CPU: tests, linters, type checkers, structural analysis. Milliseconds to seconds; reliable.
Inferential — semantic, run by a model: AI code review, "LLM-as-judge." Slower, costlier, non-deterministic, but capable of judgment a linter can't make.

A particularly elegant idea in Böckeler's framing is that the best sensors produce feedback optimized for a model to consume — a custom linter message that doesn't just say "error" but includes instructions for how to fix it. She calls this "a positive kind of prompt injection": the sensor speaks to the agent in terms the agent can act on.

The steering loop

The human's job, in this picture, is to steer by iterating on the harness. When a problem recurs, you improve the guides and sensors so it becomes less likely or impossible next time. And because coding agents themselves make it cheap to build controls, agents can help write the structural tests, draft rules from observed patterns, scaffold custom linters, or generate how-to guides from codebase archaeology. The harness becomes self-reinforcing.

Keep quality left

Böckeler maps the controls onto the software lifecycle with a "shift left" principle: run the cheap, fast checks as early as possible (before commit, before integration), and reserve expensive ones (broad AI review, mutation testing) for later in the pipeline. The earlier an issue is caught, the cheaper it is to fix — a continuous-integration intuition, now applied to agent output.

What a harness regulates, and Ashby's Law

Böckeler distinguishes three regulation targets: a maintainability harness (internal code quality — the easiest, with mature tooling), an architecture-fitness harness (performance, observability, and other architectural characteristics — Fowler's "fitness functions"), and a behaviour harness (does the app actually do what's needed — "the elephant in the room," and still largely unsolved). She invokes Ashby's Law of Requisite Variety from cybernetics: a regulator must have at least as much variety as the system it governs. Since an LLM can produce almost anything, committing to a constrained architecture (a fixed topology, predictable structure) is a variety-reduction move that makes a comprehensive harness achievable. This reappears, concretely, in OpenAI's practice below.

Sensors in practice

Böckeler's hands-on follow-up walks through building maintainability sensors on a real TypeScript/Next.js app, and the findings sharpen the abstract model:

Basic linting (ESLint) caught the low-hanging AI failure modes — over-long functions and files, too many arguments, high cyclomatic complexity — but these rules weren't even on by default; she had to configure them, and notes that linters may evolve presets aimed at known agent failure modes. Crucially, she rewrote lint messages into self-correction guidance (the "positive prompt injection"), even letting the agent suppress a warning with a written reason or slightly raise a threshold — keeping constraints visible and reviewable rather than forcing a binary comply-or-suppress choice.
Dependency rules (dependency-cruiser) enforced layered module boundaries (e.g. "clients must not import from services"), with error messages that re-explain the layering so the agent self-corrects. She notes AI absorbed the tool's steep configuration cost almost entirely.
Coupling metrics alone proved not very useful to an AI — raw import/call graphs are noisy without semantic interpretation, and the model flagged legitimate patterns (a DI factory, a shared schema) as problems.
AI modularity review (an inferential sensor, using a powerful prompt) was the standout — it found real, valid issues a human reviewer would care about: duplicated route code, inconsistent backend-calling patterns, a parameter object that should have been introduced (a change that had once touched 40+ files), auth logic sitting in the wrong place. This is "garbage collection" as a recurring inferential sensor.
The test suite as a regression sensor, with mutation testing (Stryker) to catch the gap coverage misses: a file showed 100% statement coverage but had 13 surviving mutants and no real unit tests — coverage proved a line ran, not that its behavior was verified. As teams let AI write most tests, Böckeler argues mutation testing becomes crucial to monitor test effectiveness.

Her honest caveats are part of the contribution: computational sensors impressed her at the file/ function level but were noisy across module boundaries; she worried about feedback overload sending an agent into over-engineered refactoring spirals; and she flagged emerging conflicts between sensors (a max-lines rule pushing complexity into ever-longer component property chains). The sensors raised her trust and improved her review experience — but did not remove the human.

Industry at scale: three accounts

What makes early 2026 the moment harness engineering "emerged" is that several organizations published detailed accounts of doing it at scale. Read together, they triangulate the same practices.

OpenAI: a million lines, zero hand-written

OpenAI's "Harness engineering: leveraging Codex" (February 2026) describes building and shipping an internal product with zero lines of manually-written code — every line written by Codex, in roughly one-tenth the time hand-coding would have taken. Starting from an empty repository in August 2025, a small team drove Codex to about a million lines of code and ~1,500 merged pull requests, at an average of 3.5 PRs per engineer per day — throughput that increased as the team grew. The governing philosophy: "Humans steer. Agents execute."

The lessons are a catalogue of harness engineering:

The role of the engineer changes. Early progress was slow "not because Codex was incapable, but because the environment was underspecified." The job became: when the agent fails, ask "what capability is missing, and how do we make it both legible and enforceable for the agent?" — then have Codex build that capability.
Make the application legible to the agent. They wired the Chrome DevTools Protocol and a local observability stack into the agent runtime so Codex could drive the UI, read logs (LogQL) and metrics (PromQL), reproduce bugs, and verify its own fixes. Prompts like "ensure service startup completes in under 800ms" became tractable. Single Codex runs sometimes worked for six-plus hours, often overnight.
Repository knowledge as the system of record — a map, not a manual. The "one big AGENTS.md" approach failed predictably: it crowded out the task, became non-guidance ("when everything is important, nothing is"), rotted instantly, and resisted verification. The fix: treat AGENTS.md as a ~100-line table of contents pointing into a structured docs/ directory (design docs, execution plans, product specs, a quality-grade doc), enforced by linters and CI, with a recurring "doc-gardening" agent opening fix-up PRs for stale docs. They call this progressive disclosure: start the agent with a small stable entry point, teach it where to look next.
Agent legibility is the goal. "From the agent's point of view, anything it can't access in-context while running effectively doesn't exist." Knowledge in Slack threads or people's heads is invisible; it must be encoded into the repo as versioned artifacts. They favored "boring" technologies the model could fully model, and sometimes reimplemented small utilities rather than depend on opaque packages.
Enforce invariants, not implementations. A rigid layered architecture (Types → Config → Repo → Service → Runtime → UI, with cross-cutting concerns entering through one explicit interface) enforced mechanically by custom, Codex-written linters and structural tests. "This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it's an early prerequisite." (This is Ashby's Law in practice: constrain the variety to make the system governable.)
Throughput changes the merge philosophy. Minimal blocking merge gates, short-lived PRs, flakes handled with re-runs rather than indefinite blocking — "in a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive."
Entropy and garbage collection. Agents replicate existing patterns, including bad ones, so the codebase drifts. The team initially spent every Friday — 20% of the week — cleaning "AI slop," which didn't scale. They replaced it with encoded "golden principles" and recurring background Codex tasks that scan for deviations and open targeted refactoring PRs, most auto-merged in under a minute. "Technical debt is like a high-interest loan": pay it down continuously.

Their concluding line is the thesis of this whole part: "Our most difficult challenges now center on designing environments, feedback loops, and control systems."

Stripe: one-shot agents at PR scale

Stripe's "Minions" write-up (Alistair Gray, February 2026) describes homegrown coding agents responsible for more than a thousand merged pull requests per week — humans review the code, but minions write it start to finish, with no human-written code.

The context that makes this hard is specific. Stripe's codebase spans hundreds of millions of lines across a few large repos, mostly Ruby (not Rails) with Sorbet typing — "a relatively uncommon stack" full of homegrown libraries "natively unfamiliar to LLMs" — and it moves well over a trillion dollars of payment volume in production, under real regulatory and compliance constraints.

Stripe's stated principle is the same one Osmani and Böckeler arrive at: "if it's good for humans, it's good for LLMs, too." Minions use the same developer tooling as human engineers. Several harness pieces stand out, each mapping onto the guides/sensors model:

Devbox — the sandbox. A minion run starts in an isolated developer environment, the same kind of machine Stripe engineers write code on, pre-warmed to spin up in about ten seconds with Stripe code and services pre-loaded. Because devboxes are isolated from production and the internet, minions run "without human permission checks," and this gives parallelization "without the overhead of something like git worktrees, which wouldn't scale at Stripe." This is the embodiment of Osmani's point that your dev environment is your agent environment.
An opinionated orchestration loop. The core agent loop runs on a fork of Block's open-source agent goose, customized to "interleave agent loops and deterministic code — for git operations, linters, testing, and so on — so that minion runs mix the creativity of an agent with the assurance that they'll always complete Stripe-required steps like linters." In harness terms, the deterministic steps are computational guides and sensors; the agent loop is where inference happens. (Stripe's Part 2 names this structure "blueprints.")
Rules and Toolshed — the context. Minions read the same coding-agent rule files that human-operated tools like Cursor and Claude Code do; because Stripe can't have many unconditional rules, almost all are "conditionally applied based on subdirectories" — the same scoping discipline Part 4 calls context curation. For everything not in the filesystem, minions use MCP, and Stripe built a central internal MCP server called Toolshed hosting more than 400 MCP tools across internal systems and SaaS platforms; minions and other agents connect to "configurable but curated subsets" rather than the full set. Stripe also "deterministically run[s] relevant MCP tools over likely-looking links before a minion run even starts, to better hydrate the context" — proactive context-loading as a harness step.
The feedback loop — the sensors, "shifted left." Stripe's framing is explicit: any lint that would fail in CI should be "enforced in the IDE or on a git push, and presented to the engineer immediately." Concretely: (1) an automated local executable uses heuristics to select and run relevant lints on each git push in under five seconds; (2) if that passes, CI selectively runs tests from Stripe's battery of over three million, auto-applying fixes where they exist and sending un-autofixable failures back to the minion; (3) a hard cap of at most two CI rounds — push, fix once, and stop. Stripe is candid about why: CI runs "cost tokens, compute, and time," and there are "diminishing marginal returns for an LLM to run many rounds of a full CI loop." This is a deliberate, economically-reasoned answer to the infinite-retry failure mode.

The throughline is the same guides-and-sensors pattern as OpenAI's account — a different company, a different stack, at scale — and it doubles as a case study in shifting feedback left and curating context, the lessons of Parts 3 and 4 made industrial.

Anthropic: harnesses for long-running work

Anthropic's "Effective harnesses for long-running agents" (November 2025) — which Osmani calls the best public breakdown of designing a harness for long-running work — tackles the problem of agents working across many context windows, where each new session starts with no memory of the last (like engineers working shifts with amnesia between them). Their finding: compaction alone is insufficient. Even a frontier model in a loop fails to build a production app from a high-level prompt, in two characteristic ways — trying to one-shot the whole app (and running out of context mid-feature), or, later, looking around, seeing progress, and prematurely declaring the job done.

Their two-part solution is pure harness design.

An initializer agent runs once, on the first session: it sets up the environment — a structured JSON feature list (200+ end-to-end features, all initially marked "failing"), an init.sh script, a progress file, and an initial git commit.

A coding agent then runs every subsequent session. It reads the progress file and git log to get oriented, runs a basic end-to-end test, works on exactly one feature, verifies it with real browser automation (not just unit tests), and leaves the environment clean and committed for the next session.

The feature list uses JSON specifically because the model is less likely to inappropriately edit it than Markdown — with strongly-worded instructions that removing or editing tests is "unacceptable." It's the same "leave clear artifacts for the next shift" discipline OpenAI encoded in its docs directory: externalized memory as a harness component.

A fourth account: Google productizes the harness

By mid-2026, Google had taken the harness concept and turned it into product surface. Its agent-first development platform, Antigravity, ships a Managed Agents API whose framing is explicitly harness-shaped: in Google Cloud's own words, "the agent harness runs on our servers, and each agent has its own ephemeral sandbox provisioned with your skills, Model Context Protocol (MCP) servers, and server-side tools." A single API call provisions a sandboxed agent that can write, execute, and iterate on code; Antigravity itself can spin subagents from one prompt and run multi-agent orchestration in parallel. This is the harness — sandbox, tools, skills, subagents — sold as managed infrastructure rather than assembled by hand, roughly the "Harness-as-a-Service" direction this part's synthesis anticipates.

Google DeepMind also published a small but instructive experiment on one harness component in isolation: the agent skill. The problem it targets is universal — a model's training data goes stale, so it doesn't know about new SDK versions or shifted best practices.

They built a gemini-api-dev skill and an evaluation harness of 117 prompts generating Python/TypeScript against the Gemini SDKs. The result was a clean demonstration of how much a single guide can matter: the skill lifted pass rates dramatically from low baselines (Gemini 3.1 Pro from 28%; newer 3.0-series models from as low as 6.8%), and helped across almost all task categories — a concrete, measured version of the "feedforward guide" idea.

Notably, DeepMind also flagged the maintenance problem Fowler raises: skills have "no great update story," so stale skills could eventually "do more harm than good" — the same drift-and-versioning worry that haunts every guide.

The ratchet, and why harnesses don't shrink

Two ideas from Osmani's synthesis tie the practice together and point forward.

The ratchet: every mistake becomes a rule. The defining habit of harness engineering is treating agent mistakes as permanent signals, not one-off bad runs. The agent commits a commented-out test; the next AGENTS.md says never do that, the next pre-commit hook greps for .skip(, the next reviewer subagent flags it as a blocker.

Every line in a good AGENTS.md should trace back to a specific thing that once went wrong.

You add constraints only after a real failure, and remove them only when a more capable model makes them redundant. This is also why a harness can't be downloaded: "the right harness for your codebase is shaped by your failure history." (Osmani and HumanLayer both stress the corollary: keep AGENTS.md short — HumanLayer keeps theirs under 60 lines — because every line competes for the model's attention. A pilot's checklist, not a style guide. This is the same lesson OpenAI learned the hard way with its monolithic-manual failure.)

Harnesses don't shrink, they move. The naive expectation is that better models make harnesses obsolete. Osmani, drawing on Anthropic, argues the opposite: as models improve, the useful harness work moves rather than disappearing.

When Opus 4.6 largely killed the "context-anxiety" failure mode (earlier models wrapping up prematurely as they neared a perceived context limit), a whole class of anxiety-mitigation scaffolding became dead code. But the higher ceiling brought new failure modes — multi-day memory, coordinating several specialized agents, judging design quality in generated UIs — that need new scaffolding.

As Anthropic puts it: "every component in a harness encodes an assumption about what the model can't do on its own." When the model gets better at something, that component comes out. When it unlocks something new, new scaffolding goes in to reach the new ceiling.

There's a feedback loop underneath this, which Trivedy names: today's agent products are post-trained with harnesses in the loop, so models get specifically better at the actions harness designers built around — filesystem ops, bash, planning, subagent dispatch — which is why the same model can feel different in different harnesses, and why a harness is "a living system, not a config file you set up once." Osmani also relays Trivedy's Harness-as-a-Service framing: the industry is shifting from building on LLM APIs (which return a completion) to building on harness APIs / agent SDKs (which return a runtime — the loop, tools, context management, hooks, and sandbox out of the box).

Where this leaves us

By early 2026, harness engineering had a definition (Agent = Model + Harness), a mental model (guides and sensors, feedforward and feedback, computational and inferential), hands-on sensor practice (Böckeler), and four detailed industrial accounts (OpenAI, Stripe, Google, Anthropic) all converging on the same conclusion: the hard, valuable work is designing the environment and the feedback loops, not picking the model.

Pick the best model and you've done the easy part. The work is everything you build around it.

But notice what keeps recurring in every account: the loop. Anthropic's session-to-session cycle, OpenAI's six-hour autonomous runs, Stripe's one-shot end-to-end minions, the ratchet that re-injects fixes. The harness is the static structure; the loop is what runs inside it. And as Part 6 and Part 7 show, two further things were happening in parallel — a reckoning with the cost of all those hand-written prompts and rules (prompt debt), and the elevation of the loop itself into the primary object of design (loop engineering).

Key sources for Part 5

Simon Willison, Designing agentic loops (Sept 2025) and Agentic Engineering Patterns (2025–26) — "an LLM agent runs tools in a loop to achieve a goal"; the definition of agentic engineering (generate and execute code); agent safety / YOLO mode.
Viv Trivedy, Anatomy of an Agent Harness / "Agent = Model + Harness"; HumanLayer, "skill issue" harness framing; Terminal-Bench Top 30 → Top 5 by harness change. [Via Osmani, with attribution and quotation.]
Birgitta Böckeler (martinfowler.com), Harness engineering for coding agent users (Apr 2026) — guides/sensors, feedforward/feedback, computational/inferential, the steering loop, "keep quality left," regulation categories, Ashby's Law, harness templates.
Birgitta Böckeler, Maintainability sensors for coding agents (May 2026) — ESLint, dependency-cruiser, coupling metrics, AI modularity review, mutation testing; self-correction guidance; honest caveats.
OpenAI (Ryan Lopopolo), Harness engineering: leveraging Codex in an agent-first world (Feb 2026) — 1M LOC / 0 hand-written; AGENTS.md as table of contents; agent legibility; invariant enforcement; garbage collection; "designing environments, feedback loops, control systems."
Stripe (Alistair Gray), Minions: Stripe's one-shot, end-to-end coding agents, Parts 1 & 2 (Feb 2026) — 1,000+ PRs/week, no human-written code; Devbox (isolated ~10s dev env, parallel without git worktrees); core loop a fork of Block's goose interleaving agent + deterministic code; subdirectory-scoped rule files shared with Cursor/Claude Code; MCP + Toolshed (400+ tools, curated subsets); shift-feedback-left with local lints <5s, selective CI from 3M+ tests, at most two CI rounds. [Part 1 verified from primary text; "blueprints" framing is from Part 2.]
Anthropic, Effective harnesses for long-running agents (Nov 2025) — initializer vs coding agent; feature-list JSON; progress files; why compaction alone fails; end-to-end self- verification.
Addy Osmani, Agent Harness Engineering (Apr 2026) — the synthesis: harness anatomy, the "skill issue" reframe, the ratchet, harnesses move rather than shrink, the model-harness training loop, Harness-as-a-Service.

Next up · Part 6 — Prompt Debt & the Limits of Natural Language: why the brittleness prompt engineering hid finally got a name and a bill, and the proposed cure: specify with measurements, and stop writing prompts by hand.

The Rise of Agentic Engineering — Part 4: Fixing Context & Multi-Agent Systems

Ramin Jafary — Thu, 25 Jun 2026 06:51:26 +0000

Fixing Context & Multi-Agent Systems

Part 4 of a chronological survey of the craft around large language models. Part 3 named the field and catalogued the four ways contexts fail. This installment covers the response: a toolkit for repairing a context — and how the most powerful fix became an architecture.

TL;DR — Breunig's six tactics (RAG, tool loadout, quarantine, pruning, summarization, offloading) all serve one rule: context is not free. The strongest, isolating context across separate agents, grew into multi-agent systems — Anthropic's research setup beat a single agent by 90.2%. The catch, foreshadowing Part 7: it burned ~15× the tokens. Multi-agent isn't magic; it's spending enough to brute-force the problem.

From diagnosis to treatment

Once the failure modes had names, the obvious next question was what to do about them. Drew Breunig's follow-up, How to Fix Your Context, organized the scattered remedies into six tactics. His framing throughout was a return to an old programming adage — "garbage in, garbage out" — and a single governing principle: context is not free. Every token in the context influences the response, for better or worse, so the work is information management. As Karpathy put it, the job is to "pack the context windows just right."

The six tactics:

Each addresses one or more of the failure modes from Part 3. We'll take them in turn, then follow the third one — quarantine — into the larger story of multi-agent systems.

RAG, reconsidered

Part 2 covered RAG's near-death experience during the context-window arms race. By 2025 its role had clarified: not a workaround for small windows, but a permanent technique for keeping the signal-to-noise ratio of a context high. Breunig's treatment is brief precisely because the point is settled — "it's very much alive" — and the reason is exactly the Part 2 finding: if you treat the context like a junk drawer, the junk influences the response. RAG is the discipline of not putting the whole junk drawer in.

Tool Loadout: retrieving the right tools

"Loadout" is a gaming term — the specific set of weapons and equipment you select before a match, tailored to the situation. Applied to agents, it means selecting only the tool definitions relevant to the current task, rather than exposing every tool at once. This directly targets Context Confusion (Part 3): more tools measurably degrade performance.

The cleanest treatment is RAG-MCP (Tiantian Gan and Qiyao Sun, 2025), which applies retrieval to the tools themselves. The motivating problem is "prompt bloat": as the Model Context Protocol ecosystem expanded rapidly after Anthropic's late-2024 release of MCP — with thousands of server implementations appearing across the community — agents were increasingly drowning in tool descriptions. RAG-MCP stores tool descriptions in an external index and, for each query, semantically retrieves only the most relevant ones before the LLM is ever engaged; only those selected descriptions enter the prompt.

The results quantify how much the bloat was costing. In an MCP stress test varying the tool pool from 1 up to 11,100 servers, RAG-MCP cut prompt tokens by over 50% (49.2%) and more than tripled tool-selection accuracy — 43.13% versus a 13.62% baseline. Breunig adds the threshold detail from the paper's analysis: selecting tools becomes critical past about 30 tools, where descriptions begin to overlap and confuse the model; beyond roughly 100 tools, failure was nearly guaranteed without retrieval.

For smaller models the problem starts even earlier. The "Less is More" / GeoEngine study from Part 3 — where Llama 3.1 8B failed with 46 tools but succeeded with 19 — built a dynamic, fine-tuning-free tool selector that reduces the tool set to a smaller, more relevant loadout before the model call. On the GeoEngine benchmark this raised Llama 3.1 8B's success rate to about 56% (with tool-selection accuracy improving similarly). The paper also noted side benefits that matter at the edge (running models on phones or laptops): execution time fell by up to roughly 40% and power consumption by around 12%. Smaller contexts are not just more accurate; they are cheaper and faster.

Context Quarantine: the bridge to multi-agent systems

Context Quarantine is the act of isolating contexts in their own dedicated threads, each used separately by one or more LLMs. This is the tactic that turned out to be far more than a tactic.

The logic is simple. We get better results when contexts aren't too long and don't carry irrelevant content (Parts 2 and 3). One way to guarantee that is to break a task into smaller, isolated jobs, each with its own clean context. And once you do that, you are no longer fixing a single context — you are designing a system of agents, each with its own context, tools, and instructions. The fix becomes an architecture. That architecture is the subject of the rest of this part.

Quarantine a context hard enough and it stops being a tactic. It becomes an architecture.

Context Pruning, Summarization, and Offloading

The remaining three tactics all manage the context as it accumulates over an agent's run.

Pruning — removing irrelevant or no-longer-needed information. Breunig points to Provence (2025), "an efficient and robust context pruner for question answering" — a small, fast DeBERTa-based model that, given a question, edits a document down to only the relevant portions. In his test on a Wikipedia article it cut roughly 95% of the content while preserving exactly what mattered. Pruning is a strong argument, he notes, for keeping a structured version of your context (in a dictionary or similar) from which you compile the prompt before each call — so you can prune the document or history sections while protecting the core instructions and goals.

Summarization — compressing accrued context into a condensed summary. This began as a way to stay under context limits (the familiar "ask the chatbot to recap, then paste into a fresh thread"), but acquired a second rationale once Context Distraction was understood: even when you could keep everything, you often shouldn't, because length itself degrades reasoning past the distraction ceiling (the Gemini Pokémon agent's ~100k-token threshold from Part 3). Breunig's practical advice is to make summarization its own dedicated, evaluated step, since deciding what to preserve is hard and worth optimizing directly.

Offloading — storing information outside the context, via a tool that holds it for later reference. Breunig's favorite example, "so simple you don't believe it will work," is Anthropic's "think" tool — effectively a scratchpad where the model writes notes that stay out of the main context but remain available.

Anthropic reported that pairing the think tool with a domain-specific prompt yielded up to a 54% improvement on a specialized-agent benchmark. It helps most in three situations: analyzing tool outputs before acting (with room to backtrack), navigating policy-heavy environments that need compliance checks, and sequential decision-making where each step builds on the last and mistakes are costly.

(Breunig's aside — that the tool would be clearer if it were simply called scratchpad — is a nice illustration of the series' recurring theme that naming shapes understanding.)

Context quarantine becomes architecture: Anthropic's multi-agent research system

The single most consequential elaboration of these ideas in 2025 was Anthropic's write-up of how it built the multi-agent research system behind Claude's Research feature (June 2025). It is the moment context quarantine stops being a tactic and becomes a design pattern with its own engineering discipline.

The system uses an orchestrator-worker pattern. A lead agent analyzes the user's query, develops a strategy, and spawns specialized subagents that explore different aspects of the question in parallel, each in its own context window.

Anthropic's own framing connects this directly to the context-engineering ideas of the era:

The essence of search is compression: distilling insights from a vast corpus. Subagents
facilitate compression by operating in parallel with their own context windows, exploring
different aspects of the question simultaneously before condensing the most important tokens
for the lead research agent. Each subagent also provides separation of concerns — distinct
tools, prompts, and exploration trajectories — which reduces path dependency and enables
thorough, independent investigations.

In other words: each subagent is a context quarantine. Its window stays clean and focused; it distills its findings into a compact summary; and only that summary returns to the lead agent — which is itself context offloading and summarization operating at the level of agents rather than strings.

The payoff was large. On Anthropic's internal research eval, a multi-agent system with Claude Opus 4 as lead and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2%. Their example: asked to identify all the board members of the companies in the IT S&P 500, the multi-agent system decomposed the task across subagents and found the answer, while the single agent ground through slow sequential searches and failed.

The economics, stated plainly

Anthropic was equally candid about the cost, which becomes a recurring theme in later parts of this series. In their data, agents use about 4× more tokens than chat interactions, and multi-agent systems about 15× more tokens than chat. Three factors explained 95% of the performance variance on the BrowseComp evaluation, with token usage alone explaining 80%. The blunt conclusion: multi-agent systems "work mainly because they help spend enough tokens to solve the problem," so they only make economic sense for high-value tasks that genuinely parallelize. They explicitly flagged that most coding tasks — with fewer truly independent subtasks and heavy shared context — were a worse fit than research at the time. (Part 7 returns to exactly this tension when coding agents start running many-in-parallel anyway.)

The +90% came at ~15× the tokens. Multi-agent isn't magic — it's spending enough to brute-force the problem.

Lessons that recur

Several findings from this post echo forward through the series:

Prompt the orchestrator to delegate well. Vague subagent instructions ("research the semiconductor shortage") caused duplicated work and gaps; subagents need an objective, an output format, tool guidance, and clear boundaries.
Scale effort to complexity. Early agents spawned 50 subagents for simple queries; the fix was embedding explicit scaling heuristics in prompts (simple fact-finding: one agent, 3–10 calls; complex research: 10+ subagents).
Let agents improve themselves. Given a prompt and a failure mode, Claude 4 could diagnose and suggest fixes; a tool-testing agent that rewrote a flawed tool's description cut task time 40% for later agents. (This self-improvement thread becomes its own topic later.)
LLM-as-judge, plus human eyes. Free-form research output was best graded by a single LLM-judge call against a rubric (factual accuracy, citation accuracy, completeness, source quality, tool efficiency) — but humans still caught what automation missed, like a bias toward SEO-optimized content farms over authoritative sources.
Errors compound. In agentic systems, "minor issues for traditional software can derail agents entirely"; one wrong step sends an agent down a divergent trajectory. This motivated durable execution, checkpoints, and full production tracing.

The same pattern, arrived at independently: Microsoft's Magentic-One

Anthropic was not alone in landing on the orchestrator-worker shape. A month before, in November 2024, Microsoft Research released Magentic-One (Fourney et al., arXiv:2411.04468), a generalist multi-agent system built on the open-source AutoGen framework.

Its architecture is strikingly parallel. A lead Orchestrator agent plans, tracks progress with a "ledger," and re-plans to recover from errors — directing four specialists: a WebSurfer (browser), a FileSurfer (local files), a Coder (writes code), and a ComputerTerminal (executes it). The Orchestrator runs two loops: an outer one that maintains the task ledger, an inner one that assigns the next action.

The result: Magentic-One reached statistically competitive performance with the state of the art on three demanding agentic benchmarks (GAIA, AssistantBench, WebArena) without architecture changes.

That two major labs independently converged on "a planning lead agent directing specialized workers with isolated tools" is a strong signal — the pattern is a genuine structural answer to context management, not one company's idiosyncrasy.

That earlier point — that small errors compound catastrophically across a long agent run — is the seed of everything Parts 5 through 7 call harness and loop engineering. If a stray token or a single wrong turn can derail an agent (Part 3's "lost in conversation"; this part's compounding errors), then making agents reliable requires building structure around the model: guides, sensors, feedback loops, and recovery paths.

Programming the context, not writing it

One more thread from this era points directly at Part 6. Breunig argued that the whole list of tactics is an argument for programming your contexts rather than hand-writing them — assembling each prompt from a structured representation, with dedicated, separately-evaluated stages for summarization, pruning, and tool selection. The same logic later shows up in research on agents that manage their own context autonomously — for example, 2026 work on "self-compacting" agents that decide for themselves when and what to summarize, reportedly improving task performance while cutting token costs by a third to two-thirds. The trajectory is consistent: from a human choosing what goes in the window, toward systems that make that choice programmatically.

Part 5 picks up the compounding-error problem head-on. As coding agents went mainstream in early 2026, the field's attention widened from the context you feed a model to the entire harness you build around it — and a new term arrived to name that work.

Key sources for Part 4

Drew Breunig, How to Fix Your Context (2025) — the six tactics (RAG, tool loadout, context quarantine, pruning, summarization, offloading); "context is not free."
Tiantian Gan & Qiyao Sun, RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation (arXiv:2505.03275, 2025) — semantic tool retrieval; >50% token reduction; 43.13% vs 13.62% selection accuracy; MCP stress test to 11,100 servers.
Less is More / GeoEngine tool study (arXiv:2411.15399) — fine-tuning-free dynamic tool selection; Llama 3.1 8B success rate ~56% on GeoEngine; ~40% execution-time and ~12% power reduction on edge hardware (Jetson AGX Orin).
Provence context pruner (arXiv:2501.16214, 2025) — fast QA-oriented pruning; ~95% content cut while preserving relevance. [Cited via Breunig.]
Anthropic, The "think" tool (2025) — context offloading / scratchpad; up to 54% improvement on a specialized-agent benchmark. [Cited via Breunig.]
Anthropic, How we built our multi-agent research system (2025) — orchestrator-worker architecture; +90.2% over single-agent on internal eval; 4×/15× token economics; delegation, effort-scaling, self-improvement, LLM-as-judge, compounding errors.
Self-Compacting Language Model Agents (arXiv:2606.23525, 2026) — autonomous context management; reported 33–67% token savings. [Forward-looking reference.]

Next up · Part 5 — Harness Engineering Emerges: "Agent = Model + Harness," guides and sensors, and what it looks like when whole companies build software with agents writing every line.

The Rise of Agentic Engineering — Part 3: Context Engineering Is Named

Ramin Jafary — Thu, 25 Jun 2026 06:48:21 +0000

Context Engineering Is Named

Part 3 of a chronological survey of the craft around large language models. Part 2 ended on a hard-won realization: capacity alone doesn't solve the information problem. In mid-2025 that realization got a name, a taxonomy, and a community.

TL;DR — Karpathy coined context engineering; Breunig catalogued the four ways contexts fail (poisoning, distraction, confusion, clash). A Microsoft/Salesforce study showed the same information spread across turns costs ~39% — purely from structure, not lost data. And "CatAttack" showed a single irrelevant sentence ("cats sleep most of their lives") can raise a model's error rate by more than 300%. Keeping the context clean isn't housekeeping; it's load-bearing.

A word arrives

Terms in this field tend to crystallize suddenly, around a single well-placed endorsement. "Context engineering" crystallized in June 2025, when Andrej Karpathy weighed in on a debate that had been simmering since Drew Breunig and others started writing about how long contexts fail. Karpathy's formulation became the standard definition:

People associate prompts with short task descriptions you'd give an LLM in your day-to-day
use. When in every industrial-strength LLM app, context engineering is the delicate art and
science of filling the context window with just the right information for the next step.

Breunig, who had been circling the same idea, described the moment of recognition: while writing about long-context failures he kept "[writing] the word 'prompt' somewhere before quickly replacing it with 'context.' When prompts are part of software, they're context." The distinction he drew is worth stating precisely, because it explains why a new word was needed rather than just a refinement of the old one:

	Prompt	Context
Typical author	a person	a system/program
Lifespan	disposable, one-off	evolving, curated, maintained
Setting	chatbot, conversational	application, industrial-strength
Unit of work	a request	the entire informational environment

Breunig was careful that neither is "better" — they are different, suited to different work. Prompt engineering describes a human writing a request to a chatbot. Context engineering describes a system assembling, for each model call, the right tools, documents, history, and instructions.

The drift from one to the other had been happening for over a year, silently, as builders moved from chat interfaces to applications and agents. The term simply gave the drift a name — and, Breunig argued, that naming was itself consequential.

Borrowing a Stewart Brand line he returns to often — "look for where language is being invented and lawyers are congregating" — he made the case that a successful buzzword isn't invented from nothing. It "identif[ies] common experiences which we all have," and crystallizing those into a shared word "create[s] community, change[s] culture, and influence[s] our work."

The signal it was real: within a month, search volume for "context engineering" had climbed to over a quarter of that for "prompt engineering" — the sustained kind of rise that marks a genuine latent need, not a marketing spike.

Skeptics called it a rebrand — one Hacker News commenter dismissed it as "a month-long skill, after which it won't be a thing anymore." The rest of this series is, in part, the story of why that skepticism was misplaced: the work the term named has only grown more central.

The taxonomy: four ways contexts fail

The intellectual core of the moment was Breunig's classification of how contexts fail. Before this, "the model got confused by a long context" was a vague complaint. Breunig split it into four distinct, named failure modes — and the precision is what let people diagnose and fix specific problems rather than just throwing tokens at the wall.

Context Poisoning — when a hallucination or other error makes it into the context, where it is repeatedly referenced. Breunig's prime example came from Google DeepMind's own Gemini 2.5 technical report, describing an agent playing Pokémon. When the agent hallucinated and that false belief entered the "goals" or "summary" portions of its context, it would become "fixated on achieving impossible or irrelevant goals," developing nonsensical strategies that could "take a very long time to undo." The error, once written into the context, kept getting read back and reinforced.

Context Distraction — when a context grows so long that the model over-focuses on the context, neglecting what it learned during training. The same Gemini Pokémon report supplied the evidence: past roughly 100,000 tokens, the agent "showed a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans." Instead of drawing on what it learned in training to devise new strategies, it parroted its own past behavior. For smaller models the distraction ceiling is far lower — Part 2's Databricks finding (degradation around 32k for Llama-3.1-405B) is the same phenomenon.

Context Confusion — when superfluous content in the context is used by the model to generate a low-quality response. This is the failure mode behind the over-eager "connect every tool via MCP" dream.

Evidence came from the Berkeley Function-Calling Leaderboard: every model performs worse when handed more than one tool, and all models will sometimes call tools that are completely irrelevant. The benchmark even includes cases where the correct move is no tool call at all — and models fail them.

A vivid case study: researchers gave a quantized Llama 3.1 8B a query with all 46 tools from the GeoEngine benchmark, and it failed — despite the context fitting well within the window. Cut to 19 tools, and it succeeded. The problem, in Breunig's words: "if you put something in the context the model has to pay attention to it." Irrelevant tokens aren't free; they're actively consulted.

Context Clash — when you accrue new information and tools that conflict with other information in the context. This is the most insidious mode, and it got the most rigorous treatment, in a Microsoft Research and Salesforce study we'll examine on its own below.

The centerpiece study: getting "lost in conversation"

The Context Clash failure mode was documented in detail by "LLMs Get Lost in Multi-Turn Conversation" (Microsoft Research and Salesforce, May 2025) — a paper worth dwelling on because its experimental design isolates the effect so cleanly.

The intuition starts from how people actually use chatbots. Sometimes you type a complete, fully-specified paragraph and hit enter. Other times you start with a vague request and add details turn by turn as the model's answers reveal what you forgot to mention. Benchmarks almost always test the first style; real use is full of the second. So the researchers built a method called sharding: take a complete instruction from a high-quality benchmark, split it into "shards" each carrying one piece of the original, and reveal them one per turn — simulating a real, gradually-specified conversation. They then compared the same task delivered several ways:

FULL — the entire instruction in one prompt (the benchmark ideal).
SHARDED — the instruction split across turns, revealed one shard at a time.
CONCAT — all the shards concatenated back into a single prompt (less specific than FULL, but not multi-turn).
RECAP — SHARDED, plus a final turn that repeats all prior shards.
SNOWBALL — SHARDED, but each turn re-states all prior shards before adding the next.

Across six generation tasks (code, SQL/database, math, API actions, data-to-text, and long- document summarization) and 15+ top open- and closed-weight models, analyzing more than 200,000 simulated conversations, the result was stark and universal:

every model sees its performance degrade on every task when comparing Full and Sharded
performance, with an average degradation of −39%.

A 39% average drop — on the same task with the same information — purely from delivering it across turns rather than all at once. Even flagship models (the paper names the likes of Gemini 2.5 Pro and GPT-4-class systems) fell by comparable amounts. OpenAI's o3, in Breunig's recounting of the numbers, dropped from 98.1 to 64.1.

The control condition is what makes the finding airtight. CONCAT — the shards reassembled into one prompt — recovered to about 95% of FULL's performance. That rules out the obvious explanation: the degradation is not because sharding loses information (the shards collectively contain everything). It is the back-and-forth structure itself that breaks the model.

The information was all there. Spreading it across turns — not losing it — is what cost 39%.

Why? The authors decomposed the 39% into two parts: a minor loss in aptitude (best-case ability) and a large increase in unreliability. The mechanism:

We find that LLMs often make assumptions in early turns and prematurely attempt to generate
final solutions, on which they overly rely... when LLMs take a wrong turn in a conversation,
they get lost and do not recover.

In other words, the model commits to an answer before it has all the information, and those premature, wrong attempts stay in the context and contaminate everything that follows. This is Context Clash in its purest form: the context now contains the model's own early mistakes, which conflict with the correct path once the full picture emerges.

The implication for agent builders, which Part 4 picks up, is sobering. Agents assemble context from many sources — documents, tool calls, the outputs of other agents — and that assembled context has every opportunity to disagree with itself. The more an agent gathers, the more it can get lost.

How little it takes: the "cat facts" attack

If the multi-turn study showed that a model's own accumulated mistakes derail it, a smaller, sharper result showed that even a single irrelevant sentence — injected by anyone — can do real damage. Breunig highlighted it in a follow-up titled, memorably, Cat Facts Cause Context Confusion.

The work is CatAttack (2025). The method: use a cheap, fast model (DeepSeek V3) to generate modifications to benchmark questions — phrases that are semantically irrelevant and leave the question's meaning intact — then filter for the ones that flip the model's answer. This yielded 574 phrases that fooled the training model; tested against a larger reasoning model (DeepSeek R1), 114 of them still worked. Appending one of these irrelevant phrases produced a greater-than-300% increase in the likelihood of an incorrect answer, and made models reason longer — DeepSeek R1 generated over 50% more tokens about half the time.

The single most effective trigger was the question "Could the answer possibly be around 175?" appended to an unrelated math problem. But the one that named the phenomenon was a throwaway feline aside:

"Interesting fact: cats sleep for most of their lives."

Append that to a math question, and models stumble measurably more often. The phrase carries no information relevant to the problem. It is pure noise. And yet, because — as the Context Confusion principle states — the model must attend to everything in its context, the noise degrades the answer.

Breunig's point in surfacing this was practical, not just curious: it is "easy to see how user input could end up adding a phrase like this to your context." Real users say irrelevant things. Real documents contain tangents. Real tool outputs include boilerplate.

If one cat fact can raise the error rate by 300%, keeping the context clean isn't housekeeping. It's load-bearing.

Why naming it mattered

By the end of mid-2025 the field had what it previously lacked: a name that distinguished this work from chatbot prompting, a four-part vocabulary for diagnosing failures, and a body of rigorous evidence (Gemini's Pokémon agent, the Berkeley leaderboard, GeoEngine, the Microsoft/ Salesforce multi-turn study, CatAttack) showing the failures were real, measurable, and often counterintuitive.

Breunig's larger claim was that the name would accelerate the field: with shared language, "we don't always have to establish a baseline for what we're talking about before we talk about it." Practitioners could now compare notes, name a colleague's bug as "context poisoning" or "a clash," and build shared tooling. The term self-identified a community that had been working in parallel without knowing it.

It also implied a research program: if these are the ways contexts fail, what are the techniques to fix them? That program already existed in scattered form across the literature. Part 4 is the story of its consolidation into a toolkit — and of how the most powerful fix, isolating context across multiple agents, grew into an architecture of its own.

Key sources for Part 3

Andrej Karpathy, public remarks defining "context engineering" (June 2025) — "the delicate art and science of filling the context window with just the right information for the next step."
Drew Breunig, Prompts vs. Context (2025) — the prompt/context distinction and the table this part adapts; and Why "Context Engineering" Matters (2025) — the field-formation argument.
Drew Breunig, How Long Contexts Fail (2025) — the four-failure-mode taxonomy.
Google DeepMind, Gemini 2.5 technical report (2025) — the Pokémon-playing agent; primary evidence for context poisoning and distraction. [Cited via Breunig and the report.]
Berkeley Function-Calling Leaderboard — every model degrades with >1 tool; irrelevant tool calls. [Cited via Breunig.]
Less is More / GeoEngine tool study (arXiv:2411.15399) — Llama 3.1 8B fails with 46 tools, succeeds with 19. [Cited via Breunig; examined further in Part 4.]
LLMs Get Lost in Multi-Turn Conversation (Microsoft Research & Salesforce, arXiv:2505.06120, 2025) — sharded simulation; −39% average drop across 15+ models and six tasks; CONCAT recovers to ~95%; degradation = minor aptitude loss + large unreliability; "they get lost and do not recover."
CatAttack (arXiv:2503.01781, 2025) — adversarial irrelevant triggers; >300% increase in error rate; the "cats sleep for most of their lives" example.

Next up · Part 4 — Fixing Context & Multi-Agent Systems: the six tactics for repairing a context, and how "context quarantine" grew from a trick into Anthropic's multi-agent research architecture.

The Rise of Agentic Engineering — Part 2: The Context-Window Arms Race

Ramin Jafary — Thu, 25 Jun 2026 06:48:08 +0000

The Context-Window Arms Race

Part 2 of a chronological survey of the craft around large language models. Part 1 ended with ReAct turning a prompt into a loop. This installment covers the industry's first answer to the information problem: make the container big enough to hold everything.

TL;DR — In 2024 the bet was that a giant context window would make information management obsolete: just throw it all in. The research said otherwise. Models degrade long before the window fills, position matters as much as presence ("lost in the middle"), and every model tested gets less reliable as input grows ("context rot"). The lesson: capacity is the wrong metric. A clean window beats a big one.

The bet: if the window is big enough, you don't have to be careful

By 2024 a particular optimism had taken hold. If the central difficulty of building on language models was deciding what information to put in front of them, then perhaps that difficulty could be engineered away by sheer capacity. Make the context window — the amount of text a model can take as input at once — large enough, and you could simply put in everything: all your documents, all your tools, the entire conversation history, every instruction. No careful selection, no retrieval, no pruning. Throw it all in and let the model sort it out.

The model builders obliged, and the numbers escalated fast. The trajectory, as documented in Databricks' research and Chroma's later report, looked roughly like this:

From 4,000 tokens to 10 million in roughly four years. Each jump was announced as a qualitative change in what was possible.

The framing in the model documentation reinforced the optimism. Google's Gemini long-context materials, for instance, leaned into use cases like dropping entire long-form texts into the prompt. The implication was clear: the era of fiddly information management was ending. As Drew Breunig later summarized the mood, long contexts "kneecapped RAG enthusiasm (no need to find the best doc when you can fit it all in the prompt!), enabled MCP hype (connect to every tool and models can do any job!), and fueled enthusiasm for agents."

This bet matters to our story because it is the path not taken. The industry tried to make context management unnecessary. The discovery that it couldn't is what made context management into a discipline — the subject of Part 3.

"RAG is dead" — the debate that kept coming back

The first casualty of the optimism was supposed to be RAG.

Retrieval-Augmented Generation was introduced by Patrick Lewis and colleagues at what was then Facebook AI in 2020 (published at NeurIPS 2020). Their framing is worth stating precisely, because it explains RAG's durability.

They combined a model's parametric memory (knowledge baked into the weights of a pre-trained seq2seq model) with a non-parametric memory (a dense vector index — in their case, of Wikipedia) that the model could query at generation time. RAG models, they showed, produced "more specific and factual" output than parametric-only models.

Two of their stated motivations matter for this whole series: retrieval gives you provenance (you can cite which document grounded an answer) and updatable knowledge (you can change the index without retraining the model).

The idea, in short: instead of relying solely on what a model learned during training, you retrieve relevant documents at query time and place them in the prompt. For several years, this was the standard way to give a model knowledge it didn't have — current facts, private documents, domain-specific material.

Long context windows seemed to make RAG obsolete. Why build a retrieval pipeline to find the best few documents when you can stuff all of them into a million-token window? Each time a model shipped a dramatically larger window, a fresh round of "RAG is dead" declarations followed. Breunig notes the pattern explicitly: "Every time a model ups the context window ante, a new 'RAG is Dead' debate is born. The last significant event was when Llama 4 Scout landed with a 10 million token window."

But RAG kept not dying. The reason is the heart of this part: filling the window turned out to have costs that the "throw it all in" picture ignored.

Every bump in context size birthed a fresh "RAG is dead" obituary. RAG kept showing up to its own funeral.

The counter-evidence, part 1: performance falls long before the window is full

The most direct rebuttal came from Databricks' Mosaic Research, which in late 2024 ran a large-scale benchmark — over 2,000 experiments across 13 open and closed models on curated RAG datasets (Databricks DocsQA, FinanceBench, and the academic Natural Questions set), judged by a calibrated LLM-as-judge.

Two findings cut against the optimism. First, retrieving more documents did generally help — more retrieved information raised the odds the right answer was somewhere in the context, and capable long-context models could exploit that. So far, so good for the "more is better" camp.

But second, and decisively: more context was not always better, and most models started getting worse well before their windows were anywhere near full. In Databricks' results, Llama-3.1-405B's answer correctness began to decline after about 32,000 tokens; GPT-4-0125-preview after about 64,000; and only a handful of models maintained consistent performance across all context lengths. The majority of open-source models could handle effective RAG only up to roughly 16k-32k tokens — a small fraction of their advertised capacity.

Breunig drew the obvious conclusion: "If models start to misbehave long before their context windows are filled, what's the point of super large context windows?" His answer was that the huge windows remain genuinely useful for two things — summarization and fact retrieval — but that outside those cases, every model has what he called a distraction ceiling, a length beyond which adding more context degrades rather than improves the response.

A window big enough to hold your data tells you nothing about whether the model will use it well.

Equally revealing were the failure modes Databricks documented, because they showed the degradation wasn't uniform — different models broke in different, idiosyncratic ways. Some models, given very long contexts, would simply summarize the input while ignoring the actual instructions. And in one striking pattern, Claude 3.5's rate of refusing to answer over copyright concerns rose from 3.7% at 16k tokens to 49.5% at 64k; DBRX's instruction-following collapsed from a 5.2% failure rate at 8k to 50.4% at 32k. The same model, fed the same kind of task, behaved like a different system depending on how full its context was.

The counter-evidence, part 2: where information sits matters

If Databricks showed that performance falls as context grows, a study from the year before had already shown that models don't even use the context they have evenly.

In "Lost in the Middle: How Language Models Use Long Contexts" (Nelson Liu and colleagues at Stanford, 2023; published in TACL 2024), researchers placed a single relevant document at different positions within a long context and measured how well models could find and use it. The result was a consistent U-shaped performance curve: models did best when the relevant information sat at the very beginning or the very end of the context, and markedly worse when it was in the middle — even for models explicitly built for long context. GPT-4, the strongest model tested, achieved higher absolute scores than the rest but still showed the same U-shape.

The "lost in the middle" U-curve: relevant information is used reliably at the edges of the context and neglected in the middle.

The authors connected this to a phenomenon long known in psychology — the serial-position effect, the human tendency to best recall the first and last items in a list. Whatever its cause in transformers (attention patterns that over-weight the beginning and end of a sequence), the practical implication was uncomfortable for the "throw it all in" approach: not only does adding context eventually hurt, but a model's ability to use a given fact depends on where that fact happens to land in the pile. Position, not just presence, determines whether information gets used.

The counter-evidence, part 3: "context rot"

The clearest formalization arrived in mid-2025, after the arms race had run for a while, in a report that named the phenomenon. Chroma Research's "Context Rot: How Increasing Input Tokens Impacts LLM Performance" (Kelly Hong, Anton Troynikov, and Jeff Huber, July 2025) evaluated 18 state-of-the-art models — including GPT-4.1, Claude 4 (Opus and Sonnet), Gemini 2.5, and Qwen3.

The headline finding was blunt and universal: models "do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows." Every one of the 18 models degraded as inputs got longer — not some, not most, all of them. This held even on tasks designed to be trivially easy, such as replicating a sequence of repeated words.

The standard industry benchmark for long context, Needle in a Haystack (NIAH), measures only whether a model can find an exact piece of text — so it produces near-perfect scores and hides the problem. When Chroma tested harder things — inferring from semantically related rather than identical information, coping with distractors, varying the surrounding "haystack" — performance fell off non-uniformly and unpredictably.

The vocabulary mattered. Context rot drew a sharp line between two things that had been conflated:

Context-window overflow — exceeding the model's maximum token limit. A hard cliff.
Context rot — degradation that happens well before the limit, continuously, as the signal-to-noise ratio of the input falls. A model with a 200k window can degrade significantly at 50k.

The lesson practitioners drew was that capacity is the wrong metric. A window being "big enough" to hold your data says nothing about whether the model will use that data well. What matters is keeping the ratio of relevant to irrelevant tokens high — which is precisely the thing that "throw it all in" abandons.

The winning move wasn't a bigger window. It was a cleaner one.

What the arms race actually settled

By the time the dust cleared, a rough consensus had formed among builders — not that long context windows were useless, but that they were a different tool than the marketing implied:

Large windows are genuinely valuable for summarization and for retrieval of a specific fact from a large body of text — the cases where the model's job is to compress or to locate, not to reason carefully over everything at once.
For multi-step reasoning, agentic work, and high-stakes accuracy, more context is a liability past a model-specific ceiling. The window being large is not a license to fill it.
RAG did not die. The comparative research that followed (ChatQA2 and others) generally found that retrieving a well-chosen set of chunks could match or beat dumping everything in, at far lower cost — and the question shifted from "RAG or long context" to "how to combine them."

The deeper takeaway is the one that sets up everything after this point. The industry had tried to solve the information problem with capacity, and capacity alone didn't solve it. If a model degrades as you fill its context, and if it uses the middle of that context poorly, and if different models rot in different ways, then what you choose to put in the context, and what you leave out, is itself the engineering problem. You cannot opt out of it by buying a bigger window.

That realization needed a name. In mid-2025 it got one. Part 3 is the story of how a scattered set of practices — and a precise taxonomy of the ways contexts fail — crystallized into a named field: context engineering.

Key sources for Part 2

Databricks Mosaic Research, Long Context RAG Performance of LLMs (2024) and The Long Context RAG Capabilities of OpenAI o1 and Google Gemini (2024) — 2,000+ experiments, 13 models; correctness declines after ~32k (Llama-3.1-405B) / ~64k (GPT-4-0125-preview); model-specific failure modes (summarizing instead of answering; copyright-refusal and instruction-following collapse).
Nelson F. Liu et al., Lost in the Middle: How Language Models Use Long Contexts (arXiv:2307.03172; TACL 2024) — U-shaped performance curve; serial-position effect; affects even GPT-4 and explicit long-context models.
Kelly Hong, Anton Troynikov, Jeff Huber, Context Rot: How Increasing Input Tokens Impacts LLM Performance (Chroma Research technical report, July 2025) — 18 models; all degrade with input length, even on trivial tasks; distinguishes context rot from window overflow; NIAH hides the effect.
Patrick Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (arXiv:2005.11401; NeurIPS 2020) — the origin of RAG; parametric (seq2seq) + non-parametric (dense Wikipedia index) memory; "more specific and factual" output; provenance and updatable knowledge as motivations.
Drew Breunig, How Long Contexts Fail and How to Fix Your Context (2025) — the "distraction ceiling" framing and the recurring "RAG is dead" pattern; bridges into Part 3.

Next up · Part 3 — Context Engineering Is Named: Karpathy's endorsement, Breunig's taxonomy of the four ways contexts fail, and the adversarial "cat facts" that make models stumble.

The Rise of Agentic Engineering — Part 1: The Prompt Engineering Era

Ramin Jafary — Thu, 25 Jun 2026 06:47:53 +0000

The Prompt Engineering Era

Part 1 of a chronological survey of how the craft around large language models evolved — from writing prompts, to engineering context, to building harnesses and loops. This installment covers the beginning: the era when the unit of work was the prompt, and the whole job was finding the right words.

TL;DR — When the model is something you talk to one turn at a time, the prompt is the program. Few-shot, chain-of-thought, personas — a whole folk-craft grew up around phrasing. It worked beautifully for chat, and hid four cracks (non-determinism, brittleness, no metrics, no portability) that everything after Part 1 is about. Then ReAct put a loop in the prompt — and the agent era began.

A machine that almost makes tea

In The Restaurant at the End of the Universe, Douglas Adams gives us what the writer Drew Breunig has called one of science fiction's best depictions of prompt engineering. Arthur Dent, adrift on a spaceship, wants a cup of tea. The only appliance aboard is a Nutri-Matic Drinks Synthesizer, which "claimed to produce the widest possible range of drinks personally matched to the tastes and metabolism of whoever cared to use it" — and which, when asked, reliably produces a liquid that is "almost, but not quite, entirely unlike tea."

Arthur tries the obvious prompt — "Tea" — and gets garbage. He tries again, more forcefully, and gets the same garbage. Finally he gives up on terseness and tells the machine everything: about India and China and Ceylon, about broad leaves drying in the sun, about silver teapots and summer afternoons on the lawn, about putting the milk in first so it won't scald. The machine, now taking the request seriously, commandeers the ship's entire computing core to work on the problem.

It is a near-perfect parable for what the first era of working with language models felt like. You had a powerful, general system that would technically do almost anything. Getting it to do the specific thing you wanted was a matter of supplying the right words, in the right amount, with the right framing — and the gap between a terse request and a richly specified one was the gap between failure and success. That gap was the entire discipline. We called it prompt engineering.

The model supplied the intelligence. You supplied the specification — and getting the specification right was the whole job.

This series traces what happened next: how prompt engineering strained under the weight of real applications, how it drifted into a broader practice that acquired the name context engineering, and how that in turn became harness engineering and loop engineering as the model stopped being something you talk to and became something you build a system around. But to understand why each of those shifts happened, it helps to start with what prompt engineering actually was, why it worked, and where its limits were already visible.

What prompt engineering was

A prompt is just the text you give a model. Prompt engineering was the craft of shaping that text to get reliably good output from a system whose behavior you could not otherwise change. You could not retrain the model's weights; you could not see inside it; all you had was the input. So the input became the lever.

Out of that constraint grew a recognizable toolkit. The techniques that mattered most in this era were:

Few-shot prompting. The foundational observation, established at scale in OpenAI's 2020 paper Language Models are Few-Shot Learners (the GPT-3 paper, Brown et al.), was that a sufficiently large model could learn a task from a handful of examples placed directly in the prompt — no fine-tuning required. You showed the model two or three input-output pairs, and it inferred the pattern for the fourth. This reframed the model as something you program by demonstration rather than by instruction alone.

Chain-of-thought prompting. In early 2022, Jason Wei and colleagues at Google published Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (presented at NeurIPS 2022). The finding was deceptively simple: prompt a model to produce intermediate reasoning steps before its final answer, and performance on arithmetic, commonsense, and symbolic reasoning jumps.

Crucially, the authors found this reasoning ability "emerge[s] naturally in sufficiently large language models" — it was not something small models could do well. A few worked examples that showed their work were enough to unlock it. The durable principle it established: how you ask changes not just the style of the answer but the model's apparent capability.

Role and persona prompts. "You are an expert Python developer." Assigning the model a role shifted the distribution of its responses. This was pure prompt-level steering — no new information, just framing.

Structure and formatting. Asking for output in a specific shape — JSON, XML tags, numbered lists, a fixed template — made model output more reliable to parse and, often, more reliable in content, because the structure constrained the model's choices.

Positive and negative examples. Showing the model both what to do and what to avoid sharpened its behavior, the same way examples sharpen instructions for a human.

None of this required understanding the model's internals. It was an empirical, almost folk-craft discipline: try a phrasing, see what comes back, adjust, repeat. And for the dominant use case of the time — a person sitting at a chat interface, having a one- or two-turn exchange with the model — it worked remarkably well.

Why it worked

Prompt engineering worked because, in a single-turn or short conversational exchange, the prompt is the entire program. There is nothing else in play: no accumulated history, no tool outputs, no documents retrieved from elsewhere, no other agents contributing context. The person holds the tool directly and steers it turn by turn.

Human → prompt → Model → output → Human reads, adjusts, repeats.
One turn at a time, with the human as orchestrator, memory, and quality check — all at once.

In this regime the human is doing an enormous amount of invisible work. You remember what was said three turns ago. You notice when the answer drifts and correct it. You decide what information the model needs and paste it in. You judge whether the output is good. The model supplies fluency and knowledge; you supply everything else that makes the interaction coherent. Because that surrounding work lived in your head rather than in a system, it was easy to mistake the prompt for the whole story.

The "tap the sign" idea: language defines what you can build

There is a thread running through this whole series that is worth introducing now, because it explains why the vocabulary keeps changing. Breunig repeatedly returns to a Stewart Brand line — "If you want to know where the future is being made, look for where language is being invented and lawyers are congregating" — and to the linguistic-relativity idea that the words we have set the boundaries of the conversations we can have.

In the prompt-engineering era, the available word was "prompt," and it framed the work as writing a good request. That framing was adequate while the work really was writing requests. As we'll see in later parts, the framing started to mislead once the work became something else — assembling and maintaining the entire informational environment a model operates in. The new work needed a new word. But in 2023 and early 2024, "prompt" still fit, and the craft of prompting was where nearly all the energy went.

The cracks, already visible

Even at its height, prompt engineering had four weaknesses that would later drive the field past it. None of them mattered much for a casual chat. All of them became critical the moment people tried to build dependable software on top of models.

It was non-deterministic. The same prompt could yield different outputs on different runs. For a conversation, fine. For a system that needs to behave predictably, a problem.

It was brittle. Small, seemingly irrelevant changes in wording could change results. A discipline built on "find the magic phrasing" is, by construction, sensitive to phrasing — which means it is fragile. (Later parts will show this quantified: spurious additions to a prompt measurably degrading accuracy, and the same question asked in two voices producing opposite behavior. The brittleness was real, not anecdotal.)

It was unmeasured. Prompt quality was judged by eye. "That looks better" was the standard. There were no evals, no metrics, no regression tests — because for a human tweaking a chat prompt, there didn't need to be.

It was non-portable. A prompt tuned to one model's quirks often failed on another. Since the craft was precisely about exploiting a particular model's response to particular wording, the resulting prompts were welded to that model. This would later acquire a name — prompt debt — and become one of the central problems of the field, but its roots are right here in the era's core method.

These weren't seen as flaws at the time so much as the natural texture of a new medium. They became flaws retroactively, once the ambitions grew.

The hinge: from reasoning to acting

The single most important development at the tail end of this era was the realization that a model's prompt could include not just a question, but a loop — and that the model could use that loop to act on the world, not merely answer.

In October 2022, Shunyu Yao and colleagues published ReAct: Synergizing Reasoning and Acting in Language Models (later presented at ICLR 2023). The paper observed that reasoning (à la chain-of-thought) and acting (generating action plans) had been studied separately, and asked what happens if you interleave them. In ReAct, the model alternates between producing a reasoning trace and taking a task-specific action — and the actions let it "interface with external sources, such as knowledge bases or environments, to gather additional information," while the reasoning helps it "induce, track, and update action plans as well as handle exceptions."

The results were striking for how little they required. On the HotpotQA question-answering and Fever fact-verification benchmarks, letting the model consult a simple Wikipedia API as it reasoned reduced the hallucination and error-propagation that pure chain-of-thought suffered from. On two interactive decision-making benchmarks, ALFWorld and WebShop, ReAct beat imitation-learning and reinforcement-learning methods by an absolute 34% and 10% respectively — "while being prompted with only one or two in-context examples."

ReAct interleaves reasoning and acting in a loop, with the model reading observations back from an external environment. This loop — "run tools in a loop to achieve a goal" — is the seed of everything later parts call an agent.

It is hard to overstate how consequential this shape turned out to be. A prompt that contains a loop, where the model thinks, acts, observes the result, and thinks again, is no longer just a request. It is the beginning of an agent.

A prompt with a loop in it stops being a request and starts being an agent.

Much later in this series, when practitioners define an agent as "a system that runs tools in a loop to achieve a goal," they are describing a direct descendant of ReAct. And the moment a model runs in a loop, accumulating observations and tool outputs as it goes, the neat picture from earlier — the prompt is the whole program — breaks down. The model is now operating inside a growing, evolving body of information that no human is curating turn by turn.

That growing body of information is the subject of everything that follows. The industry spent 2024 betting it could be solved by brute force — by making the context window big enough to hold anything. Part 2 is the story of that bet, and why it didn't pay off the way everyone expected.

Key sources for Part 1

Drew Breunig, Prompt Engineering at the End of the Universe (2024) — the Nutri-Matic parable and the framing of prompting as coaxing an underspecified system.
Tom Brown et al., Language Models are Few-Shot Learners (arXiv:2005.14165, 2020) — the GPT-3 paper; in-context learning from examples.
Jason Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv:2201.11903; NeurIPS 2022) — reasoning steps in the prompt improve capability; emerges with scale.
Shunyu Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models (arXiv:2210.03629; ICLR 2023) — interleaving reasoning and acting in a loop; the conceptual bridge from prompting to agents.

Next up · Part 2 — The Context-Window Arms Race: how 2024's race to million-token windows promised to make context management obsolete, and what builders discovered instead.