Ramin Jafary

Posted on Jun 25

The Rise of Agentic Engineering — Part 7: Loop Engineering, the Factory & the Human

#ai #agents #softwareengineering #futureofwork

Loop Engineering, the Factory & the Human

The final part of a chronological survey of the craft around large language models. Part 5 built the harness — the static structure around a model. Part 6 questioned the hand-written prompt. This installment is about the object that runs inside the harness and replaces the prompt as the unit of work: the loop.

TL;DR — The newest move isn't writing better prompts. It's building loops that prompt the agents for you — then running many in parallel (the "factory," the "orchestra"). The catch: loops automate the typing, not the thinking. Three debts — comprehension, intent, and cognitive surrender — are yours to pay.

"I don't prompt anymore. I write loops."

The newest shift in the field was announced, fittingly for this story, in a single well-placed sentence. In June 2026 Peter Steinberger posted: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents." The same week, independently, Boris Cherny — who built Claude Code at Anthropic — said on stage: "I don't prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops." Two practitioners, two companies, the same conclusion.

"I don't prompt Claude anymore. I have loops running that prompt Claude. My job is to write loops."
— Boris Cherny, creator of Claude Code

Addy Osmani's "Loop Engineering" (June 2026) unpacks what they meant. For two years, the way to get something from a coding agent was to write a good prompt, read the reply, write the next prompt — you held the tool the entire time, one turn after another. That, the loop-engineers argue, is ending. Instead you "build a small system that finds the work, hands it out, checks it, writes down what is done and then decides the next thing, and you let that system poke the agents instead of you."

This is not a brand-new idea so much as a culmination. Simon Willison had named "designing agentic loops" a critical skill back in September 2025 (Part 5). What changed by mid-2026, Osmani notes, is that the pieces stopped being a pile of personal bash scripts and started shipping inside the products — which is the moment a hack becomes infrastructure. Loop engineering, in Osmani's framing, "sits one floor above the harness": the harness is the environment a single agent runs inside; the loop is the harness "on a timer," spawning helpers and feeding itself.

Loop (runs on a timer, prompts the agents) — wraps:
- Harness (the environment for one agent) — wraps:
- Model
- …plus automations, worktrees, sub-agents, and shared state.

The five primitives, plus memory

Osmani's central observation is one of convergence: the building blocks of a loop now map near-identically onto both OpenAI's Codex app and Anthropic's Claude Code. Once you notice the shape is the same across tools, "you stop arguing about which tool, you just design a loop." A loop needs five things, plus a place to remember:

Automations — the heartbeat. Prompts that fire on a schedule, do their own discovery and triage, and surface findings to an inbox. (Codex: an Automations tab with project, prompt, cadence, environment, and a Triage inbox. Claude Code: scheduled tasks, cron, hooks, GitHub Actions.) The in-session companion is a run-until-done command — /goal in both tools — that keeps working across turns until a verifiable stopping condition holds. After each turn, a separate small model checks the condition, so the agent that wrote the code isn't the one grading it.
Worktrees — isolation for parallelism. The moment more than one agent runs, files collide. A git worktree gives each agent its own working directory on its own branch, so one agent's edits can't clobber another's checkout.
Skills — codified project knowledge. A folder with a SKILL.md of instructions and metadata, invoked by name or matched to a task. Skills are where intent gets written down once, on the outside — instead of being re-derived from scratch every session.
Plugins / connectors — the loop touches real tools. Built on MCP, connectors let the loop read the issue tracker, query a database, hit a staging API, post to Slack. This is the line between an agent that says "here is the fix" and a loop that opens the PR, links the ticket, and pings the channel once CI is green.
Sub-agents — separate the maker from the checker. The model that wrote the code grades its own work too generously. A second agent — different instructions, sometimes a different model — catches what the first talked itself into.

And the sixth thing, the state/memory: a markdown file or a Linear board — anything outside the single conversation that records what's done and what's next. "The agent forgets, the repo doesn't." This is the same externalized-memory discipline as Anthropic's progress files (Part 5) and the context-offloading tactic (Part 4), now serving as the spine of the whole loop.

Here's a single loop, assembled:

An automation runs each morning → a triage skill reads yesterday's CI failures and open issues, writing findings to a state file → for each worthwhile finding, the loop opens an isolated worktree → a maker sub-agent drafts the fix, a checker sub-agent reviews it against the skills and tests → connectors open the PR and update the ticket → anything the loop can't handle lands in the triage inbox for the human.

"You designed it one time. You did not prompt any of those steps."

Ralph: the loop in its purest form

If loop engineering has a folk hero, it is the Ralph Wiggum loop, created by Geoffrey Huntley in mid-2025 and gone viral by the end of that year. In its purest form, Ralph is a bash loop: feed the agent a goal, let it work, feed its output (errors and all) back into a fresh context, and repeat until done. Named after the lovably forgetful Simpsons character — because, like Ralph, AI agents don't remember previous attempts and will cheerfully repeat mistakes — the technique "embraces limitations rather than fighting them."

Huntley's signature line captures the philosophy:

Ralph is "deterministically bad in a non-deterministic world."

Its failures are predictable, which makes them debuggable and tunable — unlike the chaotic unpredictability of elaborate multi-agent systems.

The tuning mechanism is a playground metaphor that should sound familiar. Ralph is told to build a playground and comes home bruised from falling off the slide, so you add a sign — "SLIDE DOWN, DON'T JUMP, LOOK AROUND" — to a spec or AGENTS.md file. Add too many signs and it becomes overwhelming, so you prune and re-evaluate. This is exactly the ratchet from Part 5 — every mistake becomes a rule, keep the rule-list short — arrived at independently, from the brute-force-loop direction.

Two things about Ralph tie the whole series together.

First, the proof points were startling. Huntley used it to build CURSED, a complete programming language, over months of largely autonomous operation. A Y Combinator hackathon team "put a coding agent in a while loop and it shipped 6 repos overnight."

Second — the subtle point HumanLayer stresses in its "brief history of Ralph" — the real value isn't "run forever." It's that each iteration "carve[s] off small bits of work into independent context windows."

In other words, Ralph is the whole field compressed into a five-line bash loop: context quarantine (Part 4), plus the fresh-context discipline (Part 2's context rot), plus external memory. The fancy version — Anthropic's planner/generator/evaluator splits — simply formalizes the maker/checker separation that Ralph's /goal-style stop-check already performs.

The factory and the orchestra

Once a single loop works, the obvious next move is to run many. Osmani has described this under several names — the factory model (the system that builds the software), the code agent orchestra, and swarms of agents working in parallel. The technical enabler is the worktree (isolation) plus the connector (shared tools) plus the sub-agent (division of labor). The conceptual enabler is everything from Part 4: this is Anthropic's orchestrator-worker pattern, generalized from research to software production.

But the economics from Part 4 return with force. Anthropic had found multi-agent systems use ~15× the tokens of a chat, and had specifically flagged coding as a worse fit for parallelism than research — fewer truly independent subtasks, more shared context.

The loop-engineers run many agents on code anyway. That's why every serious treatment of loop engineering leads with a cost warning; Osmani opens his own essay urging caution about token spend ("usage patterns can vary wildly if you are token rich or poor").

And the reliability math compounds. A loop chaining five steps, each 95% reliable, completes cleanly only about three-quarters of the time. Worse, an unattended loop's mistakes compound into the state file — so tomorrow's run builds on today's errors.

This is why the maker/checker split is load-bearing, not decorative — and why Willison's wry definition from Part 5 stays pinned to the wall: an agent is "an LLM wrecking its environment in a loop."

The orchestration tax and the debts that don't automate away

The loop changes the work; it does not delete the human from it. Osmani is emphatic on this, and three problems get sharper as loops get better, not easier.

The orchestration tax. Worktrees remove the mechanical collision of parallel agents, but the human's review bandwidth becomes the ceiling on how many agents you can actually run. The tool isn't the limit; you are. Running more agents doesn't help if you can't review their output.

Spin up ten agents and your bottleneck isn't compute. It's how fast you can review ten agents.

Comprehension debt. The faster a loop ships code you didn't write, the wider the gap between what exists and what you understand. A smooth loop makes that gap grow faster unless you read what it produced. This is the direct descendant of prompt debt (Part 6): both are costs that accrue invisibly while everything looks like progress.

Intent debt. An agent starts every session cold and fills any gap in your intent with a confident guess. Skills (writing intent down on the outside) pay this down; skipping them means the loop re-derives your whole project from scratch every cycle, guessing wrong each time.

Cognitive surrender. The most human risk: when the loop runs itself, it is tempting to stop having an opinion and accept whatever it returns. Osmani's sharpest formulation is that designing the loop "is the cure when you do it with judgement and the accelerant when you do it to avoid thinking — same action, opposite result." Two people can build the identical loop and get opposite outcomes: one uses it to move faster on work they understand deeply; the other uses it to avoid understanding the work at all. "The loop doesn't know the difference. You do."

His closing line is the thesis of loop engineering and a fair motto for the entire series: "Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go."

Open problems and where it's going

The field is, by its own account, early and unsettled. The honest open questions, gathered across the sources:

The behaviour harness is still unsolved (Part 5). We have good guides and sensors for maintainability and decent ones for architecture fitness, but reliably guiding and verifying whether an application functionally does what's needed — beyond trusting an AI-generated test suite — remains open.
Harness coherence at scale. As guides and sensors multiply, how do you keep them in sync and non-contradictory? Böckeler flags emerging conflicts between sensors; there's no "coverage metric" for harness quality yet.
Just-in-time harness assembly. Trivedy and Osmani both point to the same frontier: harnesses that dynamically assemble the right tools and context for a task, rather than being pre- configured — the moment a harness "stops being static config and starts becoming something closer to a compiler."
Agents that fix their own harness. Agents analyzing their own traces to identify and repair harness-level failure modes — self-improvement (Part 4) applied to the scaffolding itself.
Orchestrating many agents on a shared codebase without the coordination overhead defeating the parallelism gains.
The economics. Whether the 15× token cost of heavy multi-agent loops is justified outside high-value tasks, and how cost-governance (budgets, iteration caps, grader agents that kill a failing loop) matures.

What the research frontier is already attacking

The academic literature of late 2025 and 2026 is converging on exactly these problems, and a scan of the current arXiv frontier (via discovery layers like Emergent Mind and alphaXiv) shows the context-management thread from Parts 2–4 evolving into autonomous context management — the agent curating its own context rather than relying on a passive external summarizer:

Agentic Context Engineering (ACE) (Stanford et al., arXiv:2510.04618) treats a context as an evolving "playbook" that accumulates and refines strategies through generation, reflection, and curation. Notably, it names two new failure modes that extend Part 3's taxonomy: brevity bias (summarization that drops hard-won domain detail) and context collapse (iterative rewriting that erodes information over time). It reports roughly +10.6% on agent benchmarks by treating context as structured, incrementally-updated bullets rather than a monolithic prompt.
Self-Compacting Language Model Agents (arXiv:2606.23525) gives the agent a rubric to decide when and what to summarize itself, reporting double-digit task gains alongside 33–67% token reductions.
Active Context Compression / "Focus" (arXiv:2601.07190) lets the agent autonomously consolidate learnings into a persistent knowledge block and prune raw history — reporting ~22.7% token savings at equal accuracy on context-intensive SWE-bench instances, and the pointed finding that aggressive prompting to compress is what works (passive prompting yielded only ~6%). Current models, in other words, still need scaffolding to manage their own context well — a tidy restatement of this entire series' thesis.

These are early results on small samples, and none is a settled answer. But the direction is clear: the context curation that Part 3 framed as the human's job, and Part 5 encoded into a harness, is now itself becoming something the loop does for itself — the same upward migration of responsibility that defines every rung of the ladder.

And the largest open question is the one this whole series circles: as models improve, how much of this scaffolding persists? The harness-doesn't-shrink-it-moves argument (Part 5) suggests the work relocates rather than vanishing. But reasonable people in the field disagree about how far autonomy ultimately goes — whether we are heading toward ever-thinner harnesses around ever-more-capable models, or toward ever-richer control systems as we ask agents to do ever-harder things. The sources surveyed here lean toward the latter, but they are written by people building the control systems, and the question is genuinely unresolved.

The arc, in one picture

Step back, and the two-year story has a single shape: an escalating ladder of abstraction, where each rung exists because the rung below it stopped scaling — with a matching "debt" accruing at each step, and the human's role shifting upward but never disappearing.

Each rung exists because the one below it stopped scaling. The human keeps climbing — never stepping off.

Rung	Era	What the human does	The debt
Prompt	2023–24	writes the request	brittleness
Context	mid-2025	curates the information	context rot
Harness	early 2026	builds guides & sensors	—
Loop	mid-2026	designs the loop	comprehension / intent debt
Factory / Orchestra	mid-2026 →	orchestrates many agents — and stays the engineer	orchestration tax / cognitive surrender

Prompt engineering taught us that how you ask changes what you get. Context engineering taught us that what surrounds the ask matters more than its wording. Harness engineering taught us that the durable work is the system around the model — guides, sensors, feedback loops. Loop engineering taught us that even the act of prompting can be designed and automated.

And at every rung, the same lesson keeps returning — the one Arthur Dent learned from the Nutri-Matic in Part 1. A powerful, general system will hand you something almost-but-not-quite right, unless you do the work of specifying, constraining, measuring, and steering it.

The form of that work keeps changing. The need for it has not gone away.

Key sources for Part 7

Addy Osmani, Loop Engineering (June 2026) — design loops that prompt your agents; the five primitives + memory; Codex/Claude Code convergence; Steinberger and Cherny; comprehension/intent debt; cognitive surrender; "build the loop, stay the engineer."
Peter Steinberger and Boris Cherny — public remarks (June 2026) declaring the shift from prompting to writing loops. [Via Osmani and contemporaneous coverage.]
Geoffrey Huntley, Ralph Wiggum as a "software engineer" (ghuntley.com, 2025) — the Ralph loop; "deterministically bad in a non-deterministic world"; fresh context per iteration; the playground/signs metaphor; CURSED; the YC "6 repos overnight" field report.
HumanLayer, A Brief History of Ralph (Jan 2026) — the key nuance that Ralph's point is carving work into independent context windows, not "run forever."
Simon Willison, Designing agentic loops (Sept 2025) — the earlier naming of loop design; "an LLM wrecking its environment in a loop"; YOLO mode.
Anthropic, How we built our multi-agent research system (2025) — orchestrator-worker; ~15× token economics; coding as a harder fit for parallelism.
Addy Osmani, related posts — The Factory Model, The Code Agent Orchestra, Claude Code Swarms, The Orchestration Tax, Comprehension Debt, The Intent Debt, Cognitive Surrender, Self-Improving Coding Agents — the factory/orchestra framing and the human-cost threads.
Birgitta Böckeler (martinfowler.com) — the unsolved behaviour harness and harness-coherence open questions.

DEV Community