How Stripe, Shopify, and Airbnb Build AI Harnesses

#ai #llm #agents

TL;DR

There is no single model of harness engineering.
OpenAI builds repository-centered harnesses, Anthropic focuses on agent cognitive continuity, while companies like Stripe, Shopify, and Airbnb develop vertical harnesses built around compliance, context, and action verification.
Harness engineering is becoming a domain-specific discipline, shaped by the type of risk each company needs to control.

Why There Is No Single Harness: Stripe, Shopify, Airbnb, and the Industrial Fragmentation of Agent Engineering

After observing OpenAI's repository harness and Anthropic's runtime harness (if you haven't done so already, read my articles: OpenAI and the New Cognitive Architecture of Software Repositories e Anthropic and the Runtime Harness for Persistent Agents), one might expect the industry to be converging toward a fairly clear formula: define memory, tools, feedback loops, constraints, and let the agents work.

In reality, the opposite is happening: the more public big-tech case studies become, the more it becomes clear that the word harness is starting to cover profoundly different architectures.

I find this to be the most interesting signal of the sector's maturation, because it means we are no longer witnessing the birth of a standard, but rather the emergence of multiple implementation paradigms.

The comparative analyses published about companies like Stripe, Shopify, and Airbnb demonstrate this very clearly.

The Point Is Not Model Capability. It's the Cost of Failure.

As long as we talk about coding agents in the abstract, there is a tendency to imagine that the problem is singular: making the model more reliable.

However, in industrial environments, reliability is not a neutral category; it depends on what the company considers tolerable or intolerable. That is where the divergence begins.

Stripe: The Harness as a Compliance Boundary

In the financial domain, the problem is not only producing a correct modification, but producing a modification that does not violate policies, introduce vulnerabilities, alter critical transactional flows, and remains fully auditable.

In this context, the harness tends to become an approval gate, with automated validation, side-effect simulation, and above all, compliance controls.

The agent does not operate in an open environment, but inside a risk-clearing chamber.

The harness is primarily a containment boundary.

Shopify: Harnesses for Context Distribution

Shopify's problem is almost the opposite: the commerce domain is hyper-fragmented, with different themes, plugins, merchant logics, and unpredictable customizations.

The primary risk, beyond causing damage, is producing something generically correct but locally useless.

For this reason, the harness must excel at contextual retrieval, access to internal documentation, merchant-state simulation, and precise distribution of relevant information.

The model must not only be safe, but operate with an accurate understanding of the merchant's specific context.

Airbnb: Harnesses as Perceptual Verifiability

In customer-facing and UI-heavy workflows, the problem changes once again, because an agent can propose a technically reasonable modification while still breaking selectors, navigation, UX flows, or intermediate states.

In cases like Airbnb, the harness emphasizes browser instrumentation, screenshot verification, replayability, and control over executed actions.

The core question becomes: does the action actually produce the intended effect in the user environment?

The harness therefore becomes a perceptual surface.

From Best Practices to a Domain-Specific Discipline

What these cases show is that a harness is not a universal checklist of components.

A harness is a response to the failure modes that each organization considers economically most dangerous:

for OpenAI, the risk is codebase entropy;
for Anthropic, it is cognitive drift;
for Stripe, regulatory side effects;
for Shopify, the loss of situational context;
for Airbnb, the non-verifiability of actions.

Same word, completely different problems.

The Real Maturity of the Industry Is This Fragmentation

We often interpret fragmentation as a lack of standards. But I would argue the opposite can also be true: when a discipline is young, everyone uses the same generic formulas; as it matures, specialized architectures begin to emerge.

That is perhaps exactly what is happening with harness engineering.

Just like choosing the best TypeScript framework, we are now entering the phase where the real question becomes: which harness architecture is most coherent with the type of risk my agent cannot afford to tolerate?