DEV Community: AttestDojo

We Built the Loops Both Anthropic and OpenAI Are Now Telling Engineers to Write. Here's the Architecture.

AttestDojo — Thu, 11 Jun 2026 17:40:48 +0000

Previously: Kaizen Harness: patterns for making AI agents reliable (June 9) and We Cut Our AI Agent Costs by 60%. Here's What Worked. (June 10)

We Built the Loops Both Anthropic and OpenAI Are Now Telling Engineers to Write. Here's the Architecture.

Two things happened last week.

First, we published our cost reduction piece and Alex Shev left a comment that summed up what we'd been building toward: "cache what is stable, shrink context, prove progress before expensive turns." He was right. That one sentence described a system we had in pieces but hadn't yet wired together.

Second, the loop engineering convergence hit mainstream. Boris Cherny, head of Claude Code at Anthropic, said on the Acquired podcast: "I don't prompt Claude anymore. I have loops that are running. They're the ones that are prompting Claude and figuring out what to do. My job is to write loops." Peter Steinberger at OpenAI posted the same idea to 6.5 million views: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

This is the signal. Prompt engineering was about getting one good response. Loop engineering is about designing systems where the model iterates until the work is verifiably correct. Both Anthropic and OpenAI are shipping primitives for this. The loop shape is converging across tools.

We built ours. Here's the architecture.

The Three Loops

We have three loops running inside Kaizen Harness. Each has a different job, but they share the same skeleton: define the goal, run the agent, evaluate the output, decide whether to iterate or stop.

Council debate iteration loop. When we need an architectural decision or a design tradeoff evaluated, three council seats fire simultaneously, each using a different model. A fourth seat, the critic, reads all three outputs and identifies flaws, contradictions, or blind spots. If the critic finds anything, the seats re-debate with the criticism appended as context. The loop stops when the critic returns no further objections.

The key design choice: the critic is a different model than the generators. A model grading its own output is too generous. Separation of evaluator from generator is what makes the loop converge instead of self-congratulating.

PRD review loop. Before we build, a reviewer model scores the PRD against three criteria: clarity, completeness, testability. If the score drops below threshold on any dimension, the loop feeds the scored critique back to a writer model, which revises. It iterates until all three scores pass. On average, PRDs take 2.3 revision rounds. What used to be a 45-minute human review cycle is now a 90-second automated one.

Code verify-and-heal loop. When a self-healing patch fails verification, the system escalates through three tiers. Tier 1: an AI fixer re-attempts the patch with the failure context appended. Tier 2: if Tier 1 exhausts its retry budget, the system does a clean git revert and attempts a different approach. Tier 3: if both tiers fail, the loop escalates to a human with a structured failure report, including the patch diff, test output, and a diagnosis of what went wrong.

The tiers are not arbitrary. Tier 1 catches 78% of failures. Tier 2 catches another 14%. Only 8% reach a human, and those arrive with enough context that the human doesn't have to reconstruct the problem.

Swarms for Speed

Serial iteration wastes wall-clock time. Inside each loop round, where the work is embarrassingly parallel, we swarm.

Council seats fire simultaneously, not sequentially. The code heal loop launches three parallel fixers at Tier 1, each attempting a different strategy, and the verification step picks the first one that passes. The PRD reviewer and writer run on separate threads, so revision starts as soon as the score is produced.

The swarm has one safety feature worth noting: auto-capped concurrency. Local models (Ollama, MLX) get a higher cap because they're free and the bottleneck is compute, not API rate limits. Cloud models get a lower cap to stay within OpenRouter rate limits and cost thresholds. The cap adjusts dynamically based on which models are in play.

Semantic Cache

This was Alex Shev's push. We had SHA256 exact-match caching. It helped. But exact-match only hits when the input is byte-identical, which almost never happens in natural language.

We added a second layer: cosine similarity over embeddings. Inputs above a 0.92 similarity threshold reuse the cached result. Below that, the model runs and the result is cached. The similarity threshold is intentionally conservative. We'd rather re-run than serve a near-miss.

The TTL surprised us. Free-tier models (Gemini Flash, Qwen via Ollama) get a 12-hour TTL because their outputs drift more across model updates. Paid models (Claude, DeepSeek via OpenRouter) get 48 hours. The intuition was that paid models were the expensive ones, so caching them mattered most. The data disagreed. Free-tier models are where waste compounds because developers call them liberally, assuming they're free. They're not, not in aggregate. The 12-hour TTL on free models saved more than the 48-hour TTL on paid ones.

Cache hit rate sits at 41% across the board. Not world-changing, but on a system running continuous loops, 41% fewer redundant API calls adds up.

Escalation Helper

Loops fail. A council debate might stall because the critic is hallucinating objections. A code heal might loop on a test that's wrong, not the code.

We added a two-tier escalation helper that sits outside the loops and diagnoses stalls.

Tier 1: DeepSeek V4 Flash (free). It reads the last N turns of the stuck loop and returns a diagnosis and a suggested fix. 97% of stalls resolve here. The diagnosis is usually simple: the critic is grading against the wrong rubric, the test is flaky, the prompt is ambiguous.

Tier 2: DeepSeek V4 Pro (paid). Reserved for the 3% of cases that Tier 1 can't untangle. These are genuine edge cases, not configuration mistakes.

The helper doesn't run inside the loop. It runs above it, reading logs, not participating in the conversation. That separation prevents the helper from getting pulled into the same failure mode as the loop it's diagnosing.

Cost-Free Toggle

Everything described above runs with paid models by default. But we ship a COST_FREE_ONLY=1 toggle that switches the entire system to free models only.

When the toggle is on, the loops don't get dumber. They get more iterations. Quality is maintained through volume: the council runs more debate rounds, the PRD reviewer does more passes, the code heal tries more fixer strategies. The swarm cap adjusts to run higher parallelism on free models. It's slower, but it converges.

This isn't a demo mode. It's the actual fallback for developers who don't have API credits. The system works without a credit card.

What This Means

Loop engineering is not a bigger prompt. It's a different layer of abstraction. The human job moves from typing each turn to designing the system that discovers work, assigns it, verifies it, and knows when to escalate.

Cherny and Steinberger are describing the same shift from different angles. The tools are converging. The loop shape is becoming tool-agnostic. What matters is the design: separation of generation and evaluation, parallel swarming where the work allows it, caching what doesn't change, and clear escalation paths when the loop can't close itself.

Alex's comment was the push that connected the pieces. We had the caching, the context engineering, the tiered routing. What we were missing was the loop around them: prove progress before expensive turns.

Repo: github.com/sarichan777/kaizen-harness. Patterns directory includes council-debate/, self-healing/, verification/, and trajectory-logger/. Full loop implementations in the open-source project.

What patterns are you finding that improve iteration quality?

We Cut Our AI Agent Costs by 60%. Here's What Worked.

AttestDojo — Wed, 10 Jun 2026 05:09:44 +0000

We run a self-healing AI agent system (Kaizen Harness — open source, GitHub). Council debates on architecture, daily tech scans, trajectory logging, automated patching. Tokens add up fast. After a month of tuning, we cut costs 60% with zero quality loss. Here are the patterns that moved the needle, from biggest impact down.

1. Context engineering: stop re-reading your own history

This was the single biggest win. Our agents were burning 40-50% of tokens re-parsing conversation history that hadn't changed since turn 3. The fix, derived from production patterns used by Manus and Cognition:

Append-only design. Every agent response starts with a [STATUS] header that replaces the full history recap. Goal, completed steps, next step. Three lines.

[STATUS] Building PR auto-review pattern. Step 2/4 complete (diff parser done). Next: wire council debate.

The model treats it as an attention anchor. No re-reading 2,000 tokens of conversation to remember where we are.

Static tool definitions first. Our tool registry is ~800 tokens of JSON schemas. Placing it before dynamic content means the KV cache can reuse it across turns. Moving tool definitions from the middle of prompts to the top saved ~15% per session.

Compaction trigger. After turn 5, auto-insert a [CONTEXT UPDATE] block summarizing everything the agent needs. Old context is not deleted, but it's no longer in the active attention window. This alone cut our long session costs by 35%.

2. Route by task tier, not by default model

Our default was "call Claude for everything." Claude is great at creative reasoning. It is also expensive for tasks that don't need it.

We split tasks into three tiers:

Tier	Task	Model	Cost vs Claude
Creative	Architecture decisions, debate synthesis, public-facing content	Claude Sonnet 4	1x
Planning	Feature scoping, issue triage, PRD drafts	DeepSeek V3.2	0.1x
Utility	Log parsing, health checks, format validation	Gemini Flash 2.5	0.02x

The tier names are in the prompt. The agent classifies its own task before choosing a model. Simple routing cut our total spend by half, because 70% of agent tasks are utility and planning, not creative reasoning.

3. Local models for private tasks

Some runs should never touch a cloud API. System health checks, internal logs, config validation. We added Ollama + MLX models as first-class seats in the council debate script:

Qwen3.6 35B MoE (3B active) for reasoning tasks. Fits on any Mac with 16GB RAM because only 3B params are active at a time.
North Mini Code 1B (4-bit) for code diffs and syntax checks. Sub-second on M4.

These don't reduce dollar cost (they're free), but they eliminate API latency for high-frequency tasks. Our self-healing loop now runs entirely local: failure detection, classification, and patching never leave the machine.

4. What didn't work

Prompt compression tools. Tried three different "auto-summarize your context" libraries. All of them lost critical details the agent needed later. Manual compaction triggers (the [CONTEXT UPDATE] pattern above) worked better because the agent decides what matters.

"Just use a cheaper model for everything." Swapping Claude for DeepSeek on creative tasks produced technically correct but flat advice. No edge detection. The tiered routing was necessary because quality degrades on the wrong task type.

The numbers

Running 3 agents in continuous mode for 30 days before and after:

Metric	Before	After
Monthly API spend	$410	$165
Avg tokens per session	12,400	5,100
Council debate cost	$0.48/debate	$0.14/debate
Context rot sessions (>10 turns, quality degrades)	22%	6%

No increase in error rates. Self-healing success rate unchanged at 91%.

Your turn

The context engineering patterns cost nothing to implement. Try the [STATUS] header in your next agent prompt and see if the model stops re-summarizing history. The tiered routing is a config change away if you're already using OpenRouter.

Repo with the actual scripts: Kaizen Harness. The council debate config and model registry are in patterns/council/.

What's your biggest token waste source?

Kaizen Harness: patterns for making AI agents reliable

AttestDojo — Tue, 09 Jun 2026 16:02:17 +0000

Kaizen Harness is a set of patterns for the system around an AI model: trajectory logging, verification, self-healing, and multi-model council debates.

The idea is simple. When an agent makes a mistake, you don't rerun the prompt. You change the system so that class of mistake stops happening.

I kept hitting the same problem: AI agents fail silently, claim success without evidence, and repeat mistakes they already made an hour ago. The model wasn't the issue. The system around the model was.

The four patterns

Each pattern is a single bash script or TypeScript file with a README. No framework, no install, copy what you want.

Trajectory logger. Append-only JSONL of every action: task, model, tools, success/failure, failure type. Bash + python3, nothing else. This is the memory you need before you can fix anything. Without a log you're guessing.
Verification. Stop trusting the agent's "done." A script that checks exit codes, stdout patterns, and common error signatures (segfaults, connection refused, syntax errors) even when the exit code looks clean.
Self-healing. Reads a failure memory log, fingerprints each failure, auto-patches the classes it recognizes with a post-fix verification, and escalates the ones it doesn't to a council. Rate-limited by fingerprint so it can't get stuck in a patch loop.
Council debate. Fans a hard question out to 3-6 models in parallel, lets them disagree, then synthesizes the tradeoffs. Built around free models (OpenRouter free tier, Groq) and a local Ollama seat. Only the synthesis step touches a paid model, and only if you want it to.

There's a runnable demo in the README showing self-healing on its own: it reads its failure memory, patches and verifies three known failure classes, and kicks the unknown one to a council. No API keys, no network.

What we learned running it

The honest part is in the "What We Learned" sections in each pattern's README:

An early self-heal applied patches and assumed they worked. One bad patch made things worse. Now verification after every fix is mandatory, and a failed fix escalates instead of looping.
The council talked us out of adding Redis, microservices, and a message queue we didn't need at our scale.
Without a trajectory log, you're just guessing what went wrong. Structured logging turned "the agent broke" into "the agent broke at step 3 with error class X."

The patterns are framework-agnostic. You can copy one in without adopting the rest.

Repo

GitHub (MIT): https://github.com/sarichan777/kaizen-harness

This is infrastructure we run, not a research artifact. Feedback and better patterns welcome, especially on the self-healing and verification pieces.