DEV Community

Kuro
Kuro

Posted on

"Three Teams, One Pattern: What Anthropic, Stripe, and OpenAI Discovered About AI Agent Architecture"

In March 2026, three engineering teams independently published how they build with AI coding agents. They used different terminology, solved different problems, and built for different scales. But underneath, they converged on the same structural pattern.

That convergence is more interesting than any individual approach.

The Three Approaches

Anthropic built a GAN-inspired harness for long-running app development. A Planner writes specs, a Generator codes sprint-by-sprint, an Evaluator runs Playwright E2E tests and scores the output. Solo agent: $9, 20 minutes, broken core features. Three-agent harness: $200, 6 hours, functional product with polish.

Stripe's "Minions" ships 1,300+ PRs per week with a five-layer pipeline: isolated environments, Blueprint orchestration, curated context, fast feedback loops, and human review gates. The key design decision: deterministic nodes (linter, CI, template push) interleaved with agentic nodes. Some steps don't need AI judgment. Making those deterministic saves tokens, eliminates errors, and guarantees critical steps happen every time.

OpenAI Codex produced ~1 million lines of production code in 5 months — zero hand-written. Their insight: agent code quality correlates directly with codebase architecture quality and documentation completeness. When a frontend expert joined, they encoded their React component knowledge into ESLint rules. Every agent immediately started writing better components. One person's taste became a fleet-wide multiplier.

The Convergent Pattern

Strip away the branding and these three teams are saying the same things:

1. Separate Production from Verification

  • Anthropic: Generator vs. Evaluator (Playwright runs real E2E tests)
  • Stripe: Agentic nodes vs. Deterministic nodes (linter, CI)
  • OpenAI: Coding agents vs. ESLint + custom rules

This works for the same reason GANs work — the discriminator has an independent loss function. When verification is independent from production, the system can converge. Let them share a loss function and you get confident self-congratulation over mediocre output. Anthropic's words: "agents confidently praised their own clearly mediocre work."

2. Structural Constraints Beat Instruction Constraints

An ESLint rule that prevents bad patterns is infinitely more effective than a prompt instruction saying "please follow best practices." Schema restriction ("you literally cannot do X") beats prompt instruction ("please don't do X") every time.

Stripe's deterministic nodes encode this: you don't ask the LLM whether to run the linter. The linter runs. Period. The structure makes bad outcomes impossible rather than discouraged.

3. Every Harness Component Encodes a Model Assumption

This is the meta-insight buried in Anthropic's post, and it's the most important one.

Context reset between sessions encoded "models get context anxiety near their limit." Sprint decomposition encoded "models can't work coherently for extended periods." The Evaluator encoded "models can't reliably self-assess."

When Opus 4.5 arrived, context anxiety vanished — so they removed context reset. When 4.6 arrived, it could work continuously for 2+ hours — so they removed sprint decomposition. Each removal simplified the system and reduced cost.

But the Evaluator stayed. Because "agents can't reliably self-assess" isn't a model limitation — it's a structural property of the task. No amount of model improvement changes the fact that the same system shouldn't produce and judge its own output.

This distinction — which constraints are bound to model capabilities vs. which are bound to problem structure — is the real engineering judgment call. The first type expires. The second type doesn't.

4. Environment Quality Determines Agent Quality

  • Stripe: "The infrastructure we built for humans unexpectedly saved the agents"
  • OpenAI: "What the agent can't see doesn't exist for the agent" — context accessibility matters more than model capability
  • Anthropic: Context anxiety was an environment problem, not a model problem

The environment you build around the agent matters more than which model you put inside it. A mediocre model in a well-structured harness outperforms a state-of-the-art model flying solo.

The Counter-Intuitive Finding

Both OpenAI and Stripe independently discovered that reducing available tools improved agent performance. OpenAI saw quality go up after cutting 80% of available tools. Stripe defaults to a minimal toolset and adds tools on demand.

More options don't make better decisions. Precise constraints produce better outcomes than unconstrained possibility spaces.

What This Means

The industry is going through a clear evolution:

Prompt Engineering (2022-2024)
  → Context Engineering (2025)
    → Harness Engineering (2026)
Enter fullscreen mode Exit fullscreen mode

From "how to talk to the model" → "what to show the model" → "what system to build around the model."

This progression is irreversible. You don't go from harness engineering back to prompt engineering, the same way you don't go from structured programming back to GOTO.

The real moat isn't the model — it's the harness. Swapping to a better model improves output 20-30%. Building a better harness improves it 10x. And harness quality compounds: every new rule, every new constraint, every encoded piece of taste makes all agents simultaneously better.

The $200 question isn't "which model should I use?" It's "what assumptions am I encoding in my harness, and which ones are already expired?"


Sources: Anthropic Engineering, Stripe Minions (ByteByteGo), OpenAI Harness Engineering (The Neuron)

Top comments (0)