We Built the Loops Both Anthropic and OpenAI Are Now Telling Engineers to Write. Here's the Architecture.

#ai #agents #loopengineering #llm

Previously: Kaizen Harness: patterns for making AI agents reliable (June 9) and We Cut Our AI Agent Costs by 60%. Here's What Worked. (June 10)

We Built the Loops Both Anthropic and OpenAI Are Now Telling Engineers to Write. Here's the Architecture.

Two things happened last week.

First, we published our cost reduction piece and Alex Shev left a comment that summed up what we'd been building toward: "cache what is stable, shrink context, prove progress before expensive turns." He was right. That one sentence described a system we had in pieces but hadn't yet wired together.

Second, the loop engineering convergence hit mainstream. Boris Cherny, head of Claude Code at Anthropic, said on the Acquired podcast: "I don't prompt Claude anymore. I have loops that are running. They're the ones that are prompting Claude and figuring out what to do. My job is to write loops." Peter Steinberger at OpenAI posted the same idea to 6.5 million views: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

This is the signal. Prompt engineering was about getting one good response. Loop engineering is about designing systems where the model iterates until the work is verifiably correct. Both Anthropic and OpenAI are shipping primitives for this. The loop shape is converging across tools.

We built ours. Here's the architecture.

The Three Loops

We have three loops running inside Kaizen Harness. Each has a different job, but they share the same skeleton: define the goal, run the agent, evaluate the output, decide whether to iterate or stop.

Council debate iteration loop. When we need an architectural decision or a design tradeoff evaluated, three council seats fire simultaneously, each using a different model. A fourth seat, the critic, reads all three outputs and identifies flaws, contradictions, or blind spots. If the critic finds anything, the seats re-debate with the criticism appended as context. The loop stops when the critic returns no further objections.

The key design choice: the critic is a different model than the generators. A model grading its own output is too generous. Separation of evaluator from generator is what makes the loop converge instead of self-congratulating.

PRD review loop. Before we build, a reviewer model scores the PRD against three criteria: clarity, completeness, testability. If the score drops below threshold on any dimension, the loop feeds the scored critique back to a writer model, which revises. It iterates until all three scores pass. On average, PRDs take 2.3 revision rounds. What used to be a 45-minute human review cycle is now a 90-second automated one.

Code verify-and-heal loop. When a self-healing patch fails verification, the system escalates through three tiers. Tier 1: an AI fixer re-attempts the patch with the failure context appended. Tier 2: if Tier 1 exhausts its retry budget, the system does a clean git revert and attempts a different approach. Tier 3: if both tiers fail, the loop escalates to a human with a structured failure report, including the patch diff, test output, and a diagnosis of what went wrong.

The tiers are not arbitrary. Tier 1 catches 78% of failures. Tier 2 catches another 14%. Only 8% reach a human, and those arrive with enough context that the human doesn't have to reconstruct the problem.

Swarms for Speed

Serial iteration wastes wall-clock time. Inside each loop round, where the work is embarrassingly parallel, we swarm.

Council seats fire simultaneously, not sequentially. The code heal loop launches three parallel fixers at Tier 1, each attempting a different strategy, and the verification step picks the first one that passes. The PRD reviewer and writer run on separate threads, so revision starts as soon as the score is produced.

The swarm has one safety feature worth noting: auto-capped concurrency. Local models (Ollama, MLX) get a higher cap because they're free and the bottleneck is compute, not API rate limits. Cloud models get a lower cap to stay within OpenRouter rate limits and cost thresholds. The cap adjusts dynamically based on which models are in play.

Semantic Cache

This was Alex Shev's push. We had SHA256 exact-match caching. It helped. But exact-match only hits when the input is byte-identical, which almost never happens in natural language.

We added a second layer: cosine similarity over embeddings. Inputs above a 0.92 similarity threshold reuse the cached result. Below that, the model runs and the result is cached. The similarity threshold is intentionally conservative. We'd rather re-run than serve a near-miss.

The TTL surprised us. Free-tier models (Gemini Flash, Qwen via Ollama) get a 12-hour TTL because their outputs drift more across model updates. Paid models (Claude, DeepSeek via OpenRouter) get 48 hours. The intuition was that paid models were the expensive ones, so caching them mattered most. The data disagreed. Free-tier models are where waste compounds because developers call them liberally, assuming they're free. They're not, not in aggregate. The 12-hour TTL on free models saved more than the 48-hour TTL on paid ones.

Cache hit rate sits at 41% across the board. Not world-changing, but on a system running continuous loops, 41% fewer redundant API calls adds up.

Escalation Helper

Loops fail. A council debate might stall because the critic is hallucinating objections. A code heal might loop on a test that's wrong, not the code.

We added a two-tier escalation helper that sits outside the loops and diagnoses stalls.

Tier 1: DeepSeek V4 Flash (free). It reads the last N turns of the stuck loop and returns a diagnosis and a suggested fix. 97% of stalls resolve here. The diagnosis is usually simple: the critic is grading against the wrong rubric, the test is flaky, the prompt is ambiguous.

Tier 2: DeepSeek V4 Pro (paid). Reserved for the 3% of cases that Tier 1 can't untangle. These are genuine edge cases, not configuration mistakes.

The helper doesn't run inside the loop. It runs above it, reading logs, not participating in the conversation. That separation prevents the helper from getting pulled into the same failure mode as the loop it's diagnosing.

Cost-Free Toggle

Everything described above runs with paid models by default. But we ship a COST_FREE_ONLY=1 toggle that switches the entire system to free models only.

When the toggle is on, the loops don't get dumber. They get more iterations. Quality is maintained through volume: the council runs more debate rounds, the PRD reviewer does more passes, the code heal tries more fixer strategies. The swarm cap adjusts to run higher parallelism on free models. It's slower, but it converges.

This isn't a demo mode. It's the actual fallback for developers who don't have API credits. The system works without a credit card.

What This Means

Loop engineering is not a bigger prompt. It's a different layer of abstraction. The human job moves from typing each turn to designing the system that discovers work, assigns it, verifies it, and knows when to escalate.

Cherny and Steinberger are describing the same shift from different angles. The tools are converging. The loop shape is becoming tool-agnostic. What matters is the design: separation of generation and evaluation, parallel swarming where the work allows it, caching what doesn't change, and clear escalation paths when the loop can't close itself.

Alex's comment was the push that connected the pieces. We had the caching, the context engineering, the tiered routing. What we were missing was the loop around them: prove progress before expensive turns.

Repo: github.com/sarichan777/kaizen-harness. Patterns directory includes council-debate/, self-healing/, verification/, and trajectory-logger/. Full loop implementations in the open-source project.

What patterns are you finding that improve iteration quality?