DEV Community: Sebastian Chedal

Anatomy of an Agent Harness: 7 Components You Should Audit

Sebastian Chedal — Thu, 28 May 2026 00:44:02 +0000

You’re past the pilot. The agent works in demos and probably in staging, and now somebody is asking the real buying question: will it hold up when nobody is watching? That question doesn’t resolve at the model layer. It resolves in the layer of code, configuration, and execution logic that sits around the model, what the industry has started calling the harness.

There are seven components every agent harness has. We built these seven components after reviewing eight published articles by our peers between March and April 2026 from LangChain, Salesforce, Firecrawl, Atlan, Fowler & Boeckeler, Osmani, Schmid, and Hands on Architects.

By the end of this article I hope you will have a deeper understanding of what a harness is, the different ways it can fail, what it does and how to assess the quality of your harness (existing, or when shooping for someone to build an agent from a vendor).

There’s a lot of interest in this topic right now: Anthropic’s annualized revenue grew from $14B in mid-February to over $30B by April 2026. The market is buying. What it’s buying is models. But what’s deciding whether those models earn their keep in production is the harness layer.

One clarification before the components, because the word agent is doing too much work in 2026. When we say “agent” here, we mean model plus harness running self-directed work, not workflow-LLM patterns where every step is human-scheduled. A workflow with an embedded LLM call needs prompt management and an error handler; an agent doing self-directed work needs the entire harness, which is what we are discussing in this article.

The seven components of the harness

The eight sources above name different subsets of components, the common agreement and synthesis of all the harnesses comes down to:

Component	What it is	How it can fail
Execution Sandbox	What the agent runs as and its permissions.	Broad permissions + long-horizon agent creates outsized risk radius.
Auth Identity	Who the agent is to external systems.	Shared API keys prevent auditing; child agents break revocation chains.
Memory & Context	What persists, what compacts, what discards.	Uncompacted context growth leaks cost; no garbage collection.
Tool Calls	How the agent interacts with and reaches the world.	Transient tool failures trigger runaway retry storms.
Orchestration	Single-agent loop vs multi-agent handoff, and who owns state.	Multiple agents conflict over stale views of unowned shared state.
Cost Governance	What stops a runaway charge before the credit card bill tells you.	Lack of pre-flight circuit breakers allows sudden, massive token spend.
Observability	What you can answer the morning after.	Logs confirm a failure occurred but lack structure to explain why. Or no logs at all.

Component 1: Execution sandbox

Your execution sandbox decides what the agent runs as, where it runs, and what it can reach: filesystem, network, processes, databases, and infrastructure. The decision is your risk radius, and it has to be made before deploy because retrofitting sandboxing later is rip-and-replace work.

The architectural choices fall along a spectrum: container-level isolation, process-level isolation, OS-level isolation, or hardware-level isolation with policy engines on top. See, for example, NVIDIA’s NemoClaw approach with its OpenShell and scoped permissions.

The clearest recent worked example sits in the AI Incident Database, citation 1442: in mid-December 2025, AWS Cost Explorer in one mainland China region reportedly had an approximately 13-hour interruption after Kiro, an internal Amazon AI coding tool, was reportedly allowed to delete and recreate part of the working environment. Amazon disputed the AI-causation account and attributed the issue to user error and misconfigured access controls. In both cases, the root of the issue was the same: the AI’s sandbox permissions were too broad for what the agent could do.

Ask yourself: What can this agent do that I’d be unwilling to let a junior engineer do on day one, and what stops them from doing that?

Component 2: Identity and authentication

Identity and authentication answers a question most teams skip in the rush to ship: who is the agent? And who is it more practically as it relates to external systems, what credentials does it carry, and what’s its audit trail when it acts? The decision is whether to give each agent a dedicated service account with scoped permissions, run it under a shared API key, or impersonate a human user.

The Gravitee 2026 State of AI Agent Security report is the cleanest 2026 data on what production teams are actually doing here. The picture is sobering:

Only 21.9% of teams treat AI agents as independent, identity-bearing entities
45.6% rely on shared API keys for agent-to-agent authentication
25.5% of deployed agents can create and task another agent

When an agent on a shared key spawns child agents and one of them does something costly, the chain of command becomes harder to control and audit. The potential failure pattern to watch out for here is the combination (shared key plus multi-agent, plus the ability to spawn), not any single decision. In our experience, this tends to be the component team’s promise to “fix later” and then discover later means: after a costly incident.

Ask yourself: If this agent did something costly in the next hour, could I tell which agent did it, and could I revoke just that agent’s access without breaking the others? How do I manage sub-agent spawning? Are my keys shared too broadly across multiple agents or systems?

Component 3: Memory and context

Memory and context describes what persists across runs, what gets compacted into smaller representations, and what gets discarded. Context-rot and compaction are first-class harness primitives. From our experience operating memory and context controls: the coupling between memory and cost tends to be tighter than either treatment suggests; we’ll get to that in Component 6.

A strong harness here requires that you answer the question of what stores state (vector retrieval, structured state, a hybrid). But there’s also a token discipline at the prompt-construction layer, deciding what gets included on each turn. Plus a compaction policy, deciding when long histories collapse into summaries. Your context window is your “RAM”, and a harness with no compaction policy is a process that never frees memory.

Failures here can look like your agent still gives correct answers, but each turn pulls more context than the last, and the per-task spend grows, while accuracy declines. The architectural fix sits in the memory layer, which is where teams typically look last because the agent is still “working.”

Ask yourself: Where does this agent’s state actually live, how am I managing memory and context in my agent network?

Component 4: Tool calls

Tool calls covers how the agent reaches the world: the tool registry, the calling protocol, the error-recovery behavior. Are tools exposed via an MCP-native registry, hand-wrapped APIs maintained internally, or framework-bundled tool packs you don’t control. The MCP server ecosystem expanded rapidly through early 2026, and most teams we work with end up with a mix of all three.

A serious risk with tools is a retry storm. This is when the agent calls a tool, the call fails transiently (a rate limit, a 503, a malformed response), and the harness has no policy distinguishing retryable from non-retryable failure modes. So the agent retries. And retries. And retries. The cost shows up before the alert does, and the upstream tool sometimes degrades further under the retry pressure.

Ask yourself: How should my tools be built and called? When this agent calls a tool and the call fails, what does it do, and what stops it from doing the same failed call 50 more times?

Component 5: Orchestration

Orchestration answers whether you have one agent in a loop or several agents handing off to each other, and whether the work is event-driven or scheduled. The load-bearing decision underneath is shared-state ownership: is there one canonical source of truth (a file, a database, a queue) that agents read and write through, or is state implicit and distributed across the agents themselves?

Multi-agent systems that fail in production tend to fail here. Two agents act on stale views of the same state, the merge logic was never specified, and the bug is invisible until it’s expensive. Anthropic’s published work on multi-agent research systems is a useful reference for what production adds to the orchestrator-subagent pattern; we covered that ground in our take on the multi-agent blueprint, which gets into the token-cost tradeoff for orchestration specifically.

The orchestration component is also where “we’ll just add another agent” tends to become technical debt. Each agent you add multiplies the number of state transitions you have to reason about, and if the system isn’t built around an explicit state owner from the start, the debt compounds.

Ask yourself: If two agents disagree about what’s true, which one wins, and how do I know? Can misinterpretations from one agent carry forward down the chain to other agents? How are hand offs done between agents?

Component 6: Cost governance

Cost governance covers what stops a runaway: token budgets, rate limits, kill switches, spend caps, pre-flight budget enforcement. Cost governance is the second half of the architectural pairing we flagged in Component 3. Bad memory designs leak cost; cost circuit breakers can’t fix poor context discipline. They can only cap the downside while the upstream architecture is fixed. We’ve written about how the optimization sequence actually plays out (script-first, caching last), and the same logic applies here: governance lives at the harness layer, not at the dashboard layer.

Ask yourself: What’s the maximum spend (in dollars and in irreversible state changes) this agent can incur in the next time interval (hour, day) without human approval?

Component 7: Observability

Observability is what you can answer the morning after. Structured event logs, traces, cost and latency metering, decision audit trails. Observability quality tends to decide how fast you can recover from anything that goes wrong in the other six components.

The architectural decision is whether to emit structured event logs at the harness layer (queryable later), scrape ad-hoc logs from individual agents (slower, lossy), or rely on vendor-provided dashboards (good for some questions, bad for the questions you didn’t anticipate). The trade-offs and what they look like at each deployment stage are the topic of our piece on operational decisions at each deployment stage. The three-monitoring-layers question, in particular, lives in this component.

Your risk here: something goes wrong overnight, and the team can answer that something went wrong (the bill, the alert) but not why. The runbook says check the logs; the logs were never structured to answer this kind of question.

Ask yourself: What can I answer about what this agent did yesterday, and how long does the answer take to produce? How quickly do we get notified for issues?

The 30-minute audit checklist

These seven questions can help you prevent the failure patterns while getting your harness ready for production:

Execution sandbox. What can this agent do that I’d be unwilling to let a junior engineer do on day one, and what stops it from doing that?
Identity and authentication. If this agent did something costly in the next hour, could I tell which agent did it, and could I revoke just that agent’s access without breaking the others?
Memory and context. Where does this agent’s state actually live, and what tells me when it’s growing in a way it shouldn’t?
Tool calls. When this agent calls a tool and the call fails, what does it do, and what stops it from doing the same failed call 50 more times?
Orchestration. If two agents disagree about what’s true, which one wins, and how do I know?
Cost governance. What’s the maximum spend (in dollars and in irreversible state changes) this agent can incur in the next hour without human approval?
Observability. What can I answer about what this agent did yesterday, and how long does the answer take to produce?

A vendor demo or an internal architecture review that gets clean, specific answers has made a good start at designing an effective harness layer. A demo where two or three answers turn into “we’re planning to add that” is a system where the production-readiness work hasn’t been done yet.

Bottom line

A bad harness is just a brain in a jar. You need a solid harness to give your agent the eyes, ears and system capable of operating in your business environment effectively. We hope that these questions give you a head start in your self-evaluation process as you evaluate your internal progress or that of a vendor when selecting your next partner to help you build your agentic applications.

If running the harness layer yourself isn’t where you want to spend your time, we build and operate agentic systems for clients. You can learn more abour our managed autonomous AI agents, or contact us to find out more.

FAQ

What is an agent harness in AI?

An agent harness is every piece of code, configuration, and execution logic around the model. LangChain’s Vivek Trivedy describes it as “every piece of code, configuration, and execution logic that isn’t the model itself.” The model is the reasoning core; the harness is the operational software around it that handles tools, memory, identity, sandboxing, orchestration, cost controls, and observability. In production agent systems, the harness tends to determine whether the model’s output translates into reliable work.

What is the difference between an agent harness and an agent framework?

An agent framework (LangChain, LangGraph, AutoGen, CrewAI, and similar) is a library that gives you primitives for building agents: chains, tool-calling abstractions, memory interfaces. A harness is the integrated runtime that sits around the model in production, including everything the framework provides plus the things frameworks don’t: sandbox policies, identity boundaries, cost governors, observability pipelines. Firecrawl’s April 2026 piece draws this distinction clearly: a framework helps you build; a harness is what runs the result.

What are the components of an agent harness?

The union view across the eight major published definitions consolidates into seven components:

Execution sandbox: where it runs, with what access
Identity and authentication: who it is to external systems
Memory and context: what persists and what compacts
Tool calls: how it reaches the world
Orchestration: single-agent loop vs multi-agent handoff, and who owns state
Cost governance: what stops a runaway
Observability: what you can answer the morning after

Why does the harness matter more than the model?

Through early 2026, eight major publishers (LangChain, Salesforce, Firecrawl, Atlan, Fowler and Boeckeler, Osmani, Schmid, Hands on Architects) independently shipped harness-definition pieces — convergence on the harness as the decisive layer for production reliability. The model handles reasoning; the harness handles whether that reasoning translates into reliable work.

How do I evaluate whether an agent system is production-ready?

The 30-minute checklist above is the short version: seven operator questions, one per component. A system that answers all seven cleanly has been architected through the harness layer. A system that slides into “we’re adding that” on two or three components has work ahead, and the production-readiness timeline is probably longer than the demo suggests. The Gravitee 2026 report found 21.9% of teams treating agents as identity-bearing entities, which is a useful sanity check on what “ready” looks like across the field. Most production systems still have meaningful gaps, and naming them honestly is more useful than papering over them.

GEO Measurement: The KPIs That Generate Actual Results (Not just vanity metrics)

Sebastian Chedal — Sat, 23 May 2026 10:52:04 +0000

The dominant question in generative engine optimization right now is whether your brand shows up in AI answers. The harder, more useful question is whether the AI recommends you when a buyer asks the comparison prompt that ends the decision. Those two outcomes are decoupled. The same AI conversation can pull a quote from your site and then, in the next breath, recommend a competitor to the same user.

That gap between being cited and being recommended is what the published GEO measurement frameworks tend to overlook. They count citations, average them across engines, and report a single “visibility score.” All three moves erase the signal you actually need.

I believe this is a leftover from SEO where getting cited was enough because then people would click search results. With GEO your customer is having an entire conversation with AI, and doing all the funnel stages off-line. They are learning about their problem/need, comparing competitors and then ultimately selecting their vendor without ever leaving a chat window.

Getting cited isn’t enough, you need to be *recommended *by the AI as the best or at least one of the best options in class.

The measurement gap is a targeting problem, not a tooling problem

If getting recommended or not wasn’t enough… 62% of marketing leaders say they cannot measure the ROI of their AI search optimization efforts, according to a 2025 Conductor survey, reported via GenOptima. The default reading of that number is that the field is under-tooled — that better dashboards or more granular tracking would close the gap.

However to add salt to the wound… the published frameworks are not failing to measure enough things. They are mostly measuring the wrong outcome.

The leading guides (GenOptima’s six-KPI framework, UpGrowth’s seven KPIs, Stellar’s three-tier model, Digital Bloom’s ROI procedure) each capture a real piece of the measurement stack. Read together, they recommend:

Citations
Mentions
Sentiment
Share of voice
Position
Source coverage

…and a half-dozen named composites. What none of them resolve is the distance between an AI answer that quotes you, and an AI answer that recommends you.

Being cited is not being recommended — and the AI knows the difference

A buyer asks Claude: “What’s the best AI consulting firm for mid-market manufacturers in the Pacific Northwest?” The answer pulls a quote about regional manufacturing trends from your blog post. The same answer, two sentences later, recommends three competitors as the firms to actually contact.

You got the citation. You did not get the recommendation. The user closes the tab and starts emailing your competitors.

This pattern is more common than the citation-counting frameworks acknowledge. Citation and recommendation are decoupled outcomes — they are produced by different parts of the AI’s reasoning, draw from different signals on your site, and respond to different optimizations. Most published frameworks treat citation rate as the headline KPI. It is a leading indicator at best. It tells you the AI knows something about you. It does not tell you the AI picks you when the question gets to the comparison stage.

The right primary KPI is recommendation rate at buyer-intent prompts. Not “does the engine mention your brand somewhere in a 600-word answer about the industry” but “does the engine name you when a real buyer asks the question that ends in a purchase decision.” That requires building a prompt set that mirrors the comparison questions your buyers actually ask — not the head terms you would target in traditional SEO, and not the broad industry queries that produce friendly mentions without conversion intent.

A useful working definition: track recommendation rate as the percentage of buyer-intent prompts in which an AI engine names your brand as a recommended option (not merely cites a source from your domain). Measure per engine, across a stable prompt set you can re-run monthly. For teams running thought-leadership programs that aim higher up the funnel, the same measurement works at the awareness stage — “what should I read about ____” prompts where the recommendation is to subscribe, watch, or follow rather than to buy. The mechanic is the same; the prompt set changes.

Citation rate still matters as a leading indicator. It usually predicts which brands will eventually become recommendation candidates. But reporting citation rate without recommendation rate is reporting the dress rehearsal as if it were opening night.

Per-engine spread is the load-bearing KPI — aggregate scores lie

Profound’s analysis of 100,000 prompts across ChatGPT and Perplexity found that 89% of AI citations come from different sources depending on which engine the user queried, and only 11.0% of domain citations appeared in both models.

Try to avoid tools that display only your “AI visibility score” across engines as an averaging. The aggregate number tells you nothing about which engine your buyers are using, which engine you are losing on, or which engine your next content investment should target.

What to track instead, in three slots: per-engine citation rate, per-engine recommendation rate, and the variance across them as its own metric. Call that last one per-engine spread. A brand with a 40% recommendation rate on Perplexity and a 5% rate on ChatGPT has a per-engine spread that tells you exactly where the optimization work needs to go.

Per-engine spread also doubles as a noise check on vendor reports. If a tool gives you a single composite score and refuses to break it down by engine, the report is functionally unverifiable. You cannot act on a number you cannot decompose.

The three KPIs that survive contact — and how to rank them

There are only 3 core KPIs you can and should really be tracking. All the others: citation rate, share of voice etc. are often just vanity metrics that don’t result in actual conversions:

Recommendation rate at buyer-intent: the conversion-stage signal. This is the synthesis layer, track it per engine and per topic.
**Competitors mentioned: **How many competitors are mentioned, spread again per engine.
Sentiment: the qualities and tone of how the AI describes you. How does the AI rank you against others? What are your known weaknesses and when does it recommend against your business?

Increasing Recommendation Strength

One you have a solid understanding of who is being recommended, how often that is you, and in what light you are viewed by AI Engines you can start to take action.

Common areas to focus on:

Your business is missing information people want to know
AI doesn’t know the answer to a client’s question, so they can’t recommend you
Your data on your website is not specific enough, leading to other vendors who have specific data being recommended

Some tips on how you can increase coverage:

Load in all your customer sales inquiries and look at that data to determine if all their questions are also on your website
Create customer profiles and use these to generate synthetic questions, then answer them on your website
Run AI Agents with your client persona with the mission of finding a provider/seller and then audit the results of their journey, apply the learnings
Check your site against schema validation and add elements to your website that are missing from the schema review
Build a network of listicles and reviews from 3rd party sites that strengthen your brand
Review all the fanout queries from all research performed by each AI query and then turn those into your SEO targets. These are often long-tail phrases with very low competition and they are the queries AI is using to research and make its determinations

Closing Thoughts

People are using AI to help them make purchasing decisions, this will only continue to increase both in the number of people using AI for this, as well as how much of their purchasing process they relegate to AI engines. AI is taking over the brain-power people used to use to filter and select their best options.

This means the process of that decision making is becoming more opaque. Don’t get distracted by vanity KPI theater; where you start measuring how often your stats are quoted by an AI, only to wonder why your sales are down.

Understanding how AI makes decisions, and then being able to demonstrate to your customers the value you are bringing them is challenging now. But I believe if you focus first and foremost on the money (what actually makes a difference) you can use this as your beacon to navigate through the wires of AI noodle-brains and get the results your customers actually want and need.

FAQ

How many AI engines should you track for GEO measurement?

Track the engines your buyers actually use, then layer in the engines whose citations propagate to other models. For most B2B audiences in 2026 that means ChatGPT, Perplexity, Claude, Gemini, and Google AI Overviews as the core five, with Copilot and Grok as secondary depending on audience. Tracking fewer than three means you cannot measure per-engine spread, which is the whole point.

What is a good GEO citation rate to aim for?

Citation rates are only a measure of how often your brand is mentioned, they do not track how often your brand is recommended. If your goal is to get actual recommendations, shift away from trying to get more citations and instead focus on getting recommended.

Can you measure GEO without expensive tools?

Yes, especially for a single business. A weekly spreadsheet covering 10 prompts across three engines produces enough signal to learn the shape of your engine-by-engine picture. Most of the work you need to do is foundational, GEO tracking tools are useful to then know how often you are being recommended, per prompt and per customer-category.

How does GEO measurement differ from traditional SEO measurement?

Traditional SEO measurement assumes a single engine and a clickable ranking as the outcome. GEO measurement runs on different assumptions. The practical differences:

Multiple engines, no consensus. The engines disagree with each other on which sources to cite, so per-engine reporting becomes non-negotiable.
Recommendation events replace clicks as the conversion signal, because the primary outcome no longer produces a click.
Attribution requires explicit channel-grouping work in analytics, because AI-referred traffic does not always carry a recognizable referrer.
Keyword ranking still matters for fanout queries that AI conversations trigger, but it stops being the headline number. If you want to get recommended, show up in the fan-outs.

AI Cost Optimization: A Practitioner Framework

Sebastian Chedal — Mon, 18 May 2026 18:07:06 +0000

An AI system that’s starting to cost real money is a different problem from an AI prototype, whose job was to prove a model could do the thing. The production system’s job is to do the thing at a margin that justifies its existence. Teams usually cross that line without noticing. The bill climbs steadily, then jumps, then someone runs the math and the project is suddenly under cost review.

This is some of the work we do for clients. We get hired to come, review an AI system that’s working but expensive, find the architectural waste, and bring the spend down without dropping quality. The framework in this article is the approach we actually use.

In this article:

Why cost optimization is quality optimization in disguise, and how to tell when you’ve crossed into degradation
The Script-vs-LLM Substitution Rule and the misallocation question
Dispatcher-First Cost Architecture: the architectural decision that produces the largest savings
Why agent decomposition lowers cost AND raises accuracy
The Haiku scratchpad case: getting Sonnet-quality answers at Haiku prices by changing the prompt
The optimization sequence, ordered by ROI per engineering hour
The Accuracy-Speed-Cost Triangle: the ceiling you meet after the structural work is done

If runaway cost is the failure mode you’re worried about, the AI Agent Cost Circuit Breaker covers the reactive side. This article is the proactive side: how to design a system that doesn’t run away in the first place.

Cost optimization is quality optimization in disguise

The most common framing of AI cost optimization treats cost and quality as a tradeoff dial: turn the cost down, accept some quality loss, find the spot you can live with. That framing is wrong, and it produces the wrong techniques.

The goal of cost optimization is to make the process more efficient, more accurate, and often faster. The cost savings emerge from that. When you go deep on cost optimization, you end up doing a careful analysis of the process: what each step actually does, what model tier each step actually needs, which calls shouldn’t be model calls at all. That analysis improves the system on every axis. Lower cost emerges from that work as a consequence of the deeper process analysis.

Cost optimization that drops quality below tolerance is just the wrong solution. That’s degradation of service. If a “savings” plan ends with the system producing worse outputs, it didn’t optimize. It switched to a different, worse system.

This lens changes the question you ask of every technique. Instead of “how much cheaper does this make us?” the question is “does this improve the system or does it degrade it?” Techniques that improve the system on multiple axes (accuracy, speed, reliability, cost) are the ones to chase first. Techniques that trade quality for cost belong last, sparingly, and only when the quality drop is genuinely tolerable for the use case. The industry literature corroborates the connection. aisuperior.com frames systematic optimization as producing both cost reductions and quality improvements together. The same analysis that finds the waste also finds the quality bugs.

The Script-vs-LLM Substitution Rule

The largest savings in most AI systems aren’t hiding in model selection. They’re hiding in calls that should never have been LLM calls at all.

The heuristic is the Script-vs-LLM Substitution Rule: scripts for determinism, LLMs for judgment. If a task has a defined input shape and a defined output shape, and the transformation between them is mechanical, a script does it exactly, in milliseconds, for fractions of a cent. The moment you put an LLM in that spot, you’ve added cost, latency, and a non-zero error rate to a task that didn’t need any of them.

The substitution candidates show up in almost every AI system once you go looking. File-existence checks, status notifications, structured-data comparisons, format conversions, date math, URL canonicalization. Every one of these running on a premium reasoning model is dollar-bleed without quality justification, and the failure modes (hallucinated dates, off-by-one comparisons) are worse than the script equivalents.

The boundary case matters. When judgment is genuinely required (ambiguous input, context-dependent interpretation, decisions that require reading subtext or weighing trade-offs), the direction reverses. Don’t script what genuinely needs an LLM. Scripts for the deterministic stuff, LLMs for the judgment stuff, and don’t mix them up.

This is the same insight at the center of our Four Axes of AI Agent Efficiency framework. The Script-It axis specifically targets entire sessions that shouldn’t have been LLM calls in the first place. In production audits we’ve found this is consistently the largest single cost lever, bigger than model downgrades, prompt compression, or caching.

The stakes for getting this wrong are non-trivial. Gartner has projected that over 40% of agentic AI projects will be canceled by 2027 due to escalating costs and unclear value. A large share of that escalation traces back to LLM-everywhere architecture, putting an expensive reasoning model into spots where a five-line script would have served. The substitution rule is the cheapest, fastest fix for a runaway bill. And there’s no trade hiding under it: the script is cheaper, faster, and more accurate than the call it replaces.

Dispatcher-First Cost Architecture

The single highest-leverage architectural decision in AI cost optimization is putting a lightweight dispatcher in front of every premium-model call. We call this Dispatcher-First Cost Architecture: every inbound task routes through a gatekeeper (a script or a low-cost model) that decides which downstream agent or model handles it. No speculative engagement of high-cost models.

The academic backbone is well-established. Stanford’s FrugalGPT paper showed that a cascade architecture (try cheaper models first, escalate on failure) can match GPT-4 performance with up to a 98% cost reduction across natural language tasks. The RouteLLM framework from LMSYS reached similar territory on MT Bench, with 85% cost reduction at production-equivalent quality.

The lesson under the numbers is more useful than the percentages themselves. The majority of queries don’t need the most expensive model. A trained dispatcher classifies task complexity and routes accordingly; the premium model gets engaged only when the cheaper tier fails or the complexity score crosses a threshold.

Here’s how this looks in our own content pipeline. We run an autonomous agent stack on Anthropic Claude Opus, Sonnet, and z.ai GLM-5, with daily spend in the $15-20 range. Each pipeline stage is pinned to the model tier the task actually needs: GLM-5 for data gathering, Opus only when synthesis or judgment is required, Sonnet for art direction. The dispatcher isn’t a separate service; it’s the stage definition itself, because we pre-classified each stage during architecture. A config bug that sent all six content stages to Opus tripled the per-article cost before we caught it. Per-stage model pinning is what makes that recoverable.

Dispatcher architecture earns its complexity when task complexity varies significantly. On a uniform workload, the dispatcher adds latency, code surface, and a place for bugs to hide without giving you a savings lever to pull. The decision rule: if your workload has at least two distinguishable complexity tiers (and most do, once you look), the dispatcher pays for itself. If everything is genuinely a high-end reasoning task, route directly and skip the dispatcher.

Model pinning at the dispatcher layer is also a governance control. The governance practitioner’s guide covers this overlap in more detail. Runtime model selection is one of the controls that protects against unintended escalation, security as well as cost.

Agent decomposition lowers cost AND raises accuracy

If one technique deserves to be at the top of the priority list once script substitution is done, it’s agent decomposition. The pattern: take a single task you’re sending to a large model and split it into a sequence of smaller subtasks, each running on a smaller model tier appropriate to that subtask.

The economics are direct: if one large model is doing a process, that can be very expensive. Break it down into several smaller sub-steps with small models, and each one of those small models might cost a tenth or even a twentieth of the price of the larger model. Multiply that across the steps and the per-task spend drops dramatically.

The non-obvious second benefit is the one most cost-optimization guides miss. Smaller models on focused subtasks often outperform a single large model on the bundled task. The reasons are mechanical: each subtask has narrower context, narrower failure modes (each step has one job, and you can evaluate it in isolation), and easier debugging. Accuracy goes up because the system is easier to reason about, not because the smaller models are individually smarter.

Decomposition also frees you to run independent subtasks in parallel where the data flow allows it, which pulls latency down on top of cost. Three things move together: cost down, accuracy up, often speed up too. No trade-off.

Decomposition has a cost of its own. It adds coordination overhead: state passing between steps, error handling at each boundary, monitoring across the chain. For single-call workflows or short pipelines, the overhead isn’t worth it. The threshold is roughly: if the task has at least three distinct phases that could plausibly run on different model tiers, decomposition pays. For a one-shot answer task with a uniform reasoning load, keep it monolithic.

Our deployment operational decisions article covers the lifecycle questions around when to decompose and when to consolidate. Decomposition is one of the moves you make as a system matures.

The Haiku scratchpad case: make cheaper models smarter before escalating

Sometimes you can get the answer quality of a higher tier at the price of a lower tier, not by switching models but by changing the prompt. The technique is to force the cheaper model to reason in writing before it answers. Give it a scratchpad (a file, a structured output field, anywhere it can lay out its thinking step by step) and require it to write reasoning before producing the final answer.

Here’s a direct case: We ran a large-volume sandbox test on Haiku and another on Sonnet, measuring how often the model produced a failure (wrong decision, wrong recommendation) using a secondary LLM as evaluator against a fixed control criteria. Haiku failed 4% of the time. Sonnet failed 0% of the time. Per-call, Haiku was substantially cheaper, but the error rate made it look like Sonnet was the right choice.

Then we changed the Haiku instructions: before producing an answer, write your reasoning to a scratchpad file. Only after that, give the answer. We re-ran 250 tests. The Haiku error rate moved from 4% to 0%. The per-run cost rose trivially, a few hundred extra output tokens of reasoning, and Haiku stayed substantially cheaper than Sonnet for the same volume of work. Sonnet-quality answers at Haiku prices.

The same approach works between Sonnet and Opus on harder tasks. Force the mid-tier model to write reasoning before answering, and the gap to the premium tier closes for some workloads. Not all. Scratchpad-forcing has limits. Some tasks genuinely need Opus-tier reasoning and no prompt design closes that gap.

Before reaching for a model upgrade on high-volume tasks where the per-call cost delta is large, run the scratchpad test. The cases where it works are the cases where you save the most — and once again, all three axes move the right way: cost down, accuracy up, with a small speed cost from the extra output tokens that’s typically dwarfed by the spend reduction.

The optimization sequence

In rough order of priority, here are the optimization levers you should look to start pulling:

Script substitution. Audit the system for LLM calls that don’t require judgment. Replace them with scripts. Biggest savings, lowest complexity, fastest to ship. Days of work for sustained spend reduction.
Model pinning by stage. If different parts of your system have different complexity requirements, pin each to the right model tier. Don’t run everything on Opus. Moderate complexity, large savings, weeks of work.
Dispatcher architecture. Once stages are pinned, formalize the routing layer. A lightweight dispatcher in front of premium calls multiplies the savings from steps 1 and 2 and prevents future drift back to expensive defaults.
Agent decomposition. Split monolithic tasks into focused subtasks running on appropriate tiers. Hits the cost+accuracy dual benefit, and unlocks parallelism on top. Higher engineering effort but the highest ceiling on savings.
Scratchpad-forcing on the smaller tier. Before escalating to a larger model, force the cheaper one to write reasoning before answering. Often closes the quality gap at a trivial output-token cost.
Context trimming and prompt compression. Tools like Microsoft’s LLMLingua compress long prompts by single-digit multiples with minimal semantic loss. Lower-leverage unless your prompts are unusually long, but worth measuring once the architectural moves are done.
Caching layers. Prompt caching for repeated context and semantic caching for near-duplicate queries. Pure-cost wins when repeated context is common in your workload; cache hit rate is the predictor of value. You can also create fun hypercubes by caching the output of a multi-dimensional query struct and then cache each answer in higher order geometry and reduce your LLM costs to zero by serving identical outputs from identical inputs where the conditions are identical and skip your AI costs entirely.
Batch API and subscription balancing. Discounts for non-time-sensitive workloads and subscription versus pay-as-you-go decisions. Real but modest savings, lowest engineering effort. Do these last.

The sequence above is what we’ve used across cost-optimization engagements with PrograMate.ai, Unleashed Consulting, Black Gazelle, AI Governance Portland Organization, and the Wiseman Group. In each case, the largest savings came from steps 1-4: substitution, pinning, dispatching, decomposition. The lower-leverage moves closed the remaining fraction of savings but were never where the heavy lifting happened.

The Accuracy-Speed-Cost Triangle: the ceiling, not the starting point

Once the structural moves above are done — calls that shouldn’t have been LLMs replaced with scripts, stages pinned to the right model tier, monolithic tasks decomposed and parallelized where possible, smaller models given scratchpads — you arrive at the Accuracy-Speed-Cost Triangle. This is the end state. Up to this point, the right techniques made the system faster and cheaper and more accurate at the same time. From this point on, that stops being true.

The triangle has three corners — accuracy, speed, cost — and at the ceiling, every additional lever you pull moves two of them in opposite directions. To get cost down further, you have to give up speed, accept some quality drop, or both. Examples of choices that genuinely sit on the triangle:

Batch API for non-time-sensitive work. Real cost savings, but the request now takes hours or a day instead of seconds. Trade: cost ↓, speed ↓.
Model downgrade beyond what scratchpads can recover. When you’ve already tried prompt design and the smaller tier still fails on a measurable share of your workload, taking the downgrade anyway buys cost at the price of accuracy. Trade: cost ↓, accuracy ↓.
Quantized or distilled in-house models for high-volume routine work. Cost falls, output quality narrows on edge cases. Trade: cost ↓, accuracy ↓ at the tails.
Context truncation past the safe threshold. The lossless compression already happened in the structural phase. Pushing further trades quality for incremental savings. Trade: cost ↓, accuracy ↓.
Capping retries, fallbacks, or self-correction loops. Saves call volume, increases the rate at which the system ships a wrong answer. Trade: cost ↓, accuracy ↓.

All is not lost though once you reach the ceiling, because the ceiling itself moves. New model releases that match a higher tier’s quality at a lower price shift the triangle outward. A model capable enough to consolidate two stages of your decomposition into one moves it again. Provider pricing changes can move it overnight. Ideally you have the time to review your cost structure over time, especially after a major movement in the market.

Putting it together

Teams that try cost optimization without an organizing framework may run into the following failure modes:

Reaching for the triangle before the structural moves. Treating cost and quality as a tradeoff dial from day one, when most of the savings sit in techniques that improve both at once.
Optimizing the wrong layer. Caching when the real waste is misallocated LLM calls.
Chasing token price without checking quality. Downgrading to a model that produces worse outputs and calling it a win. Or worse, downgrading the model and not testing sufficiently to validate the quality remained the same.
Hidden ops costs in self-hosting. The math rarely works at small or mid scale once you account for engineering time.
Dispatcher overhead on uniform workloads. Adding routing complexity where there’s no complexity variance to benefit.

If you want to model the savings on your own system before changing anything, the AI Agent ROI Calculator walks through the inputs that determine where your spend actually is. If you’d rather have someone come in and do the audit, that’s what our managed autonomous AI agents service exists for. Either way, the same framework applies: find the architectural waste first, then the token waste, then the trade-offs at the ceiling, in that order.

Frequently Asked Questions

How much can a typical AI system reduce costs through optimization?

Industry benchmarks land in the 40-70% range for systematic optimization applied to a production system. When optimization compounds with process improvements, when the analysis reveals waste that was hiding in architectural decisions, order-of-magnitude reductions (200-1,000%) are achievable but not typical. Set expectations at 40-70% as the base case.

What’s the cheapest model that still produces production-quality output?

It depends on the task, and the question is usually asked too early. Before picking a model tier at all, run the structural sequence: replace misallocated LLM calls with scripts, decompose monolithic tasks into smaller-tier subtasks, and try scratchpad-forcing on the smaller tier. After that, the cheapest model that hits your quality bar on a representative sandbox test is the answer — and it’s typically smaller than the one you’d have chosen without the structural pass.

When should I switch from a frontier model to a smaller one?

After a sandbox test shows the smaller model meets your quality bar on a representative workload. Before tier-jumping down, try scratchpad-forcing on the smaller model. Sometimes you get the quality you need at the lower price without the switch.

How do I decide between an LLM call and a deterministic script?

Apply the Script-vs-LLM Substitution Rule. Scripts for determinism (defined inputs, defined outputs, mechanical transformation). LLMs for judgment (ambiguous input, context-dependent decisions, reasoning about trade-offs). If a task has a single right answer that doesn’t depend on context, it’s a script.

Is self-hosting cheaper than paying API fees?

Rarely at small or mid scale. The math looks tempting (GPU hours versus API fees) but the hidden costs (engineering time, MLOps tooling, model updates, downtime, security) dominate the bill in practice. Self-hosting starts paying off at scale levels most production systems don’t reach. At the scale where it does pay off, you usually want a hybrid: hosted for the high-volume routine work, API for spike-load and frontier-capability calls. This could change though over time as performance of self hosted models meet and exceed current higher tier models.

How does dispatcher routing actually work?

A lightweight component (often a smaller model or a deterministic classifier) receives every inbound task and decides which downstream agent or model handles it. Stanford’s FrugalGPT cascade is the academic reference: try cheaper models first, escalate on failure or low confidence. RouteLLM trains the router on Chatbot Arena data to classify task complexity and pick the model tier. In production, the dispatcher can be a routing script that maps task type to model tier, or a trained classifier.

What’s the right balance between subscription pricing and API pay-per-use?

Volume threshold. If your monthly usage consistently exceeds the breakeven point of a subscription tier, lock in. If it’s variable or below the breakeven, stay pay-as-you-go. For systems with mixed workload (steady baseline plus spike load), a hybrid often works: subscription for the baseline, API for the spikes. Re-evaluate quarterly as usage patterns shift.

Can I optimize cost without sacrificing quality?

Yes — and it’s the default, not the exception, until you reach the triangle. Cost optimization that drops quality below tolerance is degradation of service, not optimization. The techniques that pull cost without dropping quality (substitution of misallocated calls, model pinning by stage, decomposition, scratchpad-forcing, prompt caching) are the ones to start with. Techniques that genuinely trade quality for cost belong at the ceiling, sparingly, and only with measurement.

How long does it take to see ROI from AI cost optimization work?

Model pinning: a week or two. Script substitution and dispatcher architecture: weeks to a month, depending on workload complexity. Full sequence including decomposition, caching, and batch processing: a few months for a mature production system. The savings start showing up in the bill immediately after the first deployment, which makes the work easier to justify than most engineering projects.

What are the most common AI cost optimization mistakes?

Starting at the wrong layer, going after caching and batch APIs before checking for misallocated LLM calls. Chasing token price without measuring quality, so you discover later that you switched to a cheaper model that fails more often. Hidden self-hosting costs that aren’t visible until the engineering time bill arrives. Adding dispatcher complexity on workloads that don’t have the complexity variance to benefit from routing. Every one of these traces back to reaching for tactical levers before doing the structural audit — treating the Accuracy-Speed-Cost Triangle as the diagnostic tool when it’s actually the ceiling.

Hermes Agent vs OpenClaw: When to Use Which (and When to Use Both)

Sebastian Chedal — Fri, 15 May 2026 18:07:52 +0000

Businesses comparing Hermes Agent and OpenClaw treat it as a winner-loser question. That framing is wrong. They are not competing for the same job. They are different layers of the same stack, and the right architecture for most agentic systems runs both, nested together, with Hermes driving and OpenClaw containing.

Architectural disagreement

Hermes Agent and OpenClaw share a lot of surface area. Both run on your own devices, connect to messaging channels, schedule cron jobs, store persistent memory, delegate to subagents, and integrate browser and terminal tools. Read the feature lists side by side and you would conclude they are competitors.

They are not, because they disagree on what the center of an agent system should be. Hermes is built around a closed learning loop: the agent executes a task, evaluates how it went, extracts a skill, refines it during subsequent runs, and retrieves the relevant pieces on future tasks. The agent is the load-bearing element.

OpenClaw inverts that. The center of OpenClaw is the Gateway, the single control plane and node transport for the whole system. Agents are containers the Gateway routes work to. The framework is the load-bearing element, and agents are interchangeable workers inside it.

Where OpenClaw wins

OpenClaw is the right call when the system needs strong containment and predictable workflows more than it needs deep reasoning inside any single agent. Three strengths matter:

Workflow state control. The Gateway gives you an explicit, inspectable control plane for routing work between stages. When work fails, you know where it failed and what state it was in.
Agent containerization. Each agent is isolated — its own workspace, scoped tools, scoped permissions. One agent cannot accidentally run another agent’s code or read its files.
Tool and skill scoping. You declare which tools each agent can call. A research agent does not get write access to your CRM. A social-media agent does not get shell access to production.

The shape of the work matters more than the agent’s IQ. If the job is “run this five-stage pipeline every day, route failures to a human, and never let stage three write to production without approval,” OpenClaw is built for that.

Where Hermes wins

Hermes is the right call when the value of the system depends on what happens inside a single agent’s reasoning, not on the workflow that connects multiple agents.

The differentiator is the self-reflective execution loop. Hermes does not just run tasks — it captures what worked, packages it as a reusable skill, improves the skill over time, and recalls the right piece of memory at the right moment. Nous Research describes Hermes as “the only agent with a built-in learning loop”.

That difference compounds on two task types:

Long-horizon work picked up across sessions. Multi-week projects where context drifts and “what did we decide last time?” is the most-asked question. Hermes is built to remember.
Higher-order reasoning with tight tool chaining. Tasks where the agent has to plan, execute a tool, evaluate the result, choose a different tool, and iterate. OpenClaw can do this, but the loop is not first-class. In Hermes, the loop is the agent.

What is ACP

The Agent Communication Protocol is what makes running both frameworks together a real architectural choice rather than a duct-tape job.

ACP is a standard for how one piece of software talks to an AI agent. The agent runs in one process. Something else — an editor, a framework, an orchestrator — runs in another. ACP defines the message format between them, so the client can send work, watch progress, see which tools the agent is using, approve sensitive actions, and receive responses. Hermes adopted ACP early and can run as an ACP server any ACP-compatible client can drive.

That last detail is the unlock. If Hermes can run as an ACP server, anything that speaks ACP — OpenClaw included — can use a Hermes agent as a node inside a larger system.

The “Hermes drives, OpenClaw contains” pattern

In an agent system where you need both workflow containment and self-reflective reasoning, you nest them.

OpenClaw is the outer container — control plane, messaging channels, scheduled jobs, multi-agent routing, tool and skill permissions. Inside, most agents are focused workflow executors. For the agents whose value depends on reasoning and learning over time, you run a Hermes agent as a node, exposed over ACP.

A concrete example: an outbound ABM system. The orchestration — sequencing stages, managing timing between touches, handling bounces, routing hot responses to a human — is OpenClaw’s job. The reasoning inside research and personalization is where Hermes earns its place. For each target account, Hermes builds a living profile: who the real influencers are, what language resonates, which angles have gotten traction. Each interaction feeds back into the profile. Over time, Hermes develops a sharper model of each account.

Hermes drives the reasoning inside the work. OpenClaw contains it.

A two-question decision framework

If you are deciding what to build, separate two questions before any vendor pitches you a framework.

Does the system need workflow containment, or higher-order reasoning inside a single agent? Containment means predictable stages, isolated agents, scoped tools, explicit hand-offs. Higher-order reasoning means a single agent that gets smarter at your specific job over time. Different problems, different solutions.
Do you need both? If the answer is “actually, both” — which for most production systems past a certain complexity it is — then a nested architecture is the answer. Hermes inside OpenClaw, communicating over ACP.

If a vendor pitches “we just use [single framework] for everything,” ask which of the two needs they are choosing not to meet. There is always a tradeoff, and a vendor who does not know what they are giving up is not the vendor you want building your agent system. Picking a builder is at least as consequential as picking a framework.

Bottom line

Hermes packages a gateway around an agent; OpenClaw packages agents inside a gateway. The difference is which load-bearing element your system needs. For a single specialist agent that learns one domain over time, pick Hermes. For a multi-stage workflow with several agents, different permissions, and broad channel reach, pick OpenClaw. For sophisticated systems that need both — pick a vendor who knows how to nest them. Agentic development is the discipline of architecting the whole system: frameworks, agents, tool scopes, deployment, monitoring, and recovery. The framework is the floor. The rest is the discipline.

Anthropic’s Multi-Agent Blueprint: What Production Constraints Add

Sebastian Chedal — Mon, 11 May 2026 18:06:52 +0000

Anthropic’s engineering team published one of the cleanest write-ups available on how a multi-agent system actually works in practice. The post is about Claude Research, an orchestrator-subagent pattern built for breadth-first research. The architecture is optimized for a particular task class, and the price of admission is a roughly fifteenfold token cost compared to a chat conversation. That cost is the tradeoff the system makes on purpose.

Most production systems make different tradeoffs. They run under cost ceilings, accuracy SLAs, speed budgets, and error rates that the research context does not impose. The blueprint’s patterns travel — orchestrator delegation, parallel subagents, condensed-return artifacts, end-state evaluation — but the architecture that emerges from applying them under production pressure is rarely the architecture in the post. The choices look the same up close and different at the system level.

The blueprint is for breadth-first research, and the cost multiplier travels with it

Anthropic’s system is built for a specific kind of work: research where the question is large, the directions are independent, and the answer is worth a lot of tokens. The lead agent plans an approach, spins up subagents to explore in parallel, and reconciles their findings against citations. On Anthropic’s internal evaluation, a multi-agent setup with Claude Opus 4 as lead and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2%.

The number that matters more: multi-agent systems use about 15x more tokens than chat interactions. The cost multiplier is the price of admission to the architecture. If the task does not decompose into parallel directions, you pay it without earning it.

Anthropic is direct about the limit: “domains that require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today.” That is the boundary of where the architecture earns its keep. Tasks with tightly-coupled state, sequential dependencies, or shared mutable context will hit coordination overhead faster than they hit parallelism gains.

The first decision is whether the task is in the right shape for the pattern. If it is a research-style problem with independent directions, parallel subagents are doing real work. If it is a workflow with chained dependencies, a single agent or a deterministic pipeline with smaller agents inside it usually wins on cost and reliability.

Token budget, not prompt cleverness, is the dominant performance lever

Anthropic’s variance analysis is the more useful diagnostic. In their BrowseComp evaluation, token usage by itself explained 80% of performance variance. Tool-call count and model choice were the other two factors. Prompt phrasing, instruction style, and the things teams typically iterate on did not show up as primary drivers.

The implication is practical. When a single-agent system plateaus on a complex task, the first question is whether it is context-bound, not whether the prompt needs more polish. A polished prompt cannot exceed the model’s working context. A multi-agent system, with separate context windows for each subagent, can. That is the mechanism, more than better instruction-following or any cleverness in the orchestrator.

Multi-agent’s main contribution to performance is parallel reasoning across more aggregate context than a single agent can hold. If the task fits inside one agent’s effective working window, the multiplier is rarely worth it. If the task genuinely needs more context than one agent can hold and the directions are independent, parallelism earns the cost.

Orchestrator delegation is a four-part contract that prevents agentic drift

The orchestrator-subagent split looks simple from a diagram and gets complicated in practice. Anthropic’s contract for each subagent: an objective, an output format, guidance on which tools and sources to use, and clear task boundaries. Miss any of the four and the subagent drifts — not because the model is poorly behaved, but because the orchestrator did not specify enough for it to know what done looks like.

Effort-scaling is part of that contract. Anthropic’s prompts embed concrete rules: 1 agent for simple fact-finding, 2 to 4 subagents for direct comparisons, and more than 10 subagents for complex research. Without rules like these, the lead agent over-scales — spinning up subagents for problems a single call could answer — and the cost multiplier compounds against you.

Tool ergonomics is the other load-bearing piece. The contract is only as good as the tool surface it points to. Anthropic ran a tool-testing agent that exercised flawed MCP tool descriptions, identified the failure patterns, and rewrote the descriptions; future agents using the rewritten tools cut task completion time by 40%. The orchestrator’s instructions assume the tools they describe behave the way the descriptions claim. When tool descriptions are vague or misleading, every downstream agent pays the tax.

Order of operations: get the four-part contract right, embed effort-scaling rules in the orchestrator prompt, then audit your tool descriptions before iterating on anything else. The contract and the tools are upstream of every other lever.

Context handling is external-memory-first, not bigger-context-first

The instinct on context limits is usually to ask for a larger window. Anthropic’s architecture does the opposite. The lead researcher saves its plan to memory before context fills, because past 200,000 tokens the context window can be truncated and the plan needs to survive. The architectural choice is to externalize early, not to chase larger windows.

The artifact pattern earns its place here. Instead of subagents reporting findings back through chat-style returns — long, lossy, expensive on lead-agent tokens — they write to a shared filesystem and return a lightweight reference. The lead agent does not re-read every detail; it gets a pointer and pulls what it needs. The pattern is not unique to Anthropic, but their post implies it through the memory system; practitioners across the industry have been naming it the artifact pattern because it solves a specific failure mode: the game of telephone, where information loses fidelity each time it passes from subagent to lead.

Fresh-context resets between sub-tasks are a deliberate design choice. If state lives outside the agents, the agents do not need to carry it in their context windows. “Bigger context” also stops being the answer to most context problems; the right move when an agent struggles with a long task is usually to externalize state and reset.

Evaluation grades outcomes, not the path the agent took

Evaluation is where multi-agent systems get the strangest. The path the agent takes through a complex task is rarely the path you would have prescribed in advance. Anthropic’s guidance: “judge whether agents achieved the right outcomes while also following a reasonable process.” Outcomes are graded; paths are observed but not required to match a template.

The mechanism most teams reach for is LLM-as-judge with a structured rubric — factual accuracy, citation accuracy, tool efficiency — producing a 0.0 to 1.0 score per output. The score does not substitute for human review; it scales review across thousands of runs without reading every trace by hand.

For state-mutating agents, end-state evaluation is the cleaner framing. Ignore the path entirely. Compare the final environment state to the goal state. Did the document get written, the ticket get closed, the file get moved? If yes, the agent succeeded — even if the trace looks meandering. Letting the agent iterate over its own process tends to produce better runs than prescribing the process up front, because the right path is often not knowable in advance.

Scoring is necessary but not sufficient. Production agents need traces, audit trails, and the ability to investigate a failure that scored well on the rubric but cost too much or used the wrong tool. The governance layer for production agents sits underneath evaluation, supplying the visibility scoring alone cannot provide.

Production constraints reshape the decisions the blueprint leaves to defaults

The blueprint and production part company here. Anthropic’s research context has no fixed daily cost ceiling, no hard accuracy SLA, no sub-second response budget, no error-rate threshold tied to revenue. Most production systems have at least one, often all four. The architecture decisions a team makes under those pressures are not the decisions the blueprint defaults to.

A few of the gaps the blueprint leaves to the reader:

Long-running state across sessions. The Claude Research system is session-bounded. A research run starts and finishes. Production agents often need to operate across days or weeks: a content pipeline that watches for new briefs, an operations agent that monitors a system continuously, an integration agent that processes events as they arrive. State across sessions is a different problem than state within one.
Failure cascades when a subagent fails mid-orchestration. The blueprint describes the happy path. Production has to handle a subagent that times out, returns malformed output, hits a rate limit, or fails its tool call. The lead agent needs to know whether to retry, fail over, partial-result, or abort the whole run, and that logic is not in the blueprint.
Multi-model pinning. Anthropic uses one model family throughout. Production teams often need a specific model version pinned for a specific job — partly for accuracy stability across runs, partly for cost control, partly because behavior changes between model versions can break workflows that depended on the old behavior.
Runaway-spend protection. The 15x cost multiplier compounds quickly when something misbehaves. A subagent that recursively spawns or a tool that returns oversized results can burn through a daily budget in minutes. The blueprint does not address circuit breakers, budget caps, or per-run cost ceilings.
Stateful resumption. When a long-running agent fails, restarting from scratch is wasteful. Checkpointing so the agent can resume from its last decision point, not its first, changes the cost economics of long jobs significantly. The blueprint mentions resumption in passing but does not treat it as a first-class architectural concern.

One example of how production pressures push toward different choices: in a content pipeline that runs autonomous agents end-to-end, fixed downstream crons were replaced with completion-triggered orchestration so that downstream stages fire the moment the previous stage finishes, instead of waiting for a scheduled tick. That is not a choice the blueprint suggests, because the blueprint is not session-spanning; production constraints make it obvious. Different pressures, different decisions.

The general pattern across these gaps: the blueprint optimizes for a single bounded run with a research outcome as the deliverable, while production systems usually optimize for repeated runs with reliability, predictable cost, and operational containment as the deliverables. Those are not opposing goals, but they push the architecture toward different shapes. A research system can afford to retry an entire run when something goes wrong; a production system that does that on every failure burns its budget and its SLA. A research system can afford to use the strongest available model throughout; a production system often pins a smaller model for the subagent tier because the cost difference compounds across thousands of calls per week.

Read the blueprint as a high-quality reference architecture for the task class it targets. Treat the patterns as primitives (orchestrator delegation, parallel subagents, condensed-return artifacts, end-state evaluation) and let the production constraints you are actually operating under decide how those primitives compose. The architecture lives in the composition, with each pattern earning its place in context.

When not to go multi-agent, and the question that comes first

Before “should I use a multi-agent architecture?” comes a different question: what job am I trying to remove from human supervision?

Multi-agent systems earn their keep when they reduce work; they fail when they multiply things to manage. A team running a single agent that already does its job well does not need a multi-agent architecture; it needs a clearer success metric and maybe a better tool surface. A team that has identified a research-shaped problem with independent directions and budget headroom for the cost multiplier is in the right place for the pattern.

A few heuristics for when single-agent or deterministic-workflow architectures are usually the right call:

Tightly-coupled context. If every agent needs the same shared state and changes propagate across the system, the coordination cost will exceed the parallelism gain.
Sequential dependencies. If step B requires step A’s output and step C requires step B’s output, you have a pipeline, not a parallel workload. A pipeline of small agents is usually simpler and cheaper than an orchestrator-subagent decomposition for the same work.
Deterministic workflow surface. If the steps are knowable in advance and the failure modes are predictable, a deterministic workflow with self-improvement scoped to skill optimization will be more reliable than a general-purpose agent picking between dozens of tools.
Insufficient budget for the cost multiplier. If the daily or per-run budget cannot absorb the token overhead, the architecture is the wrong tool for the budget.

For mid-market teams, complexity is its own failure mode. Every additional agent is another component to manage, debug, monitor, and pay for. Lower-order simple agents nested inside larger loops often produce better outcomes than a general-purpose multi-agent system trying to do everything. The mistake to avoid is adding agents because the architecture diagram looks impressive; the goal is to remove jobs from human supervision, never to create more agents for a human to supervise.

Sharper than “single or multi”: if I did not need to supervise this work, and the agent did it as well as or better than a person doing it today, what would that unlock? When the answer is concrete — a person freed up for higher-value work, a process that runs overnight, a backlog that clears without intervention — the architecture that earns its keep is the one that delivers that outcome with the fewest moving parts. The shape of the answer often points at where you are on the autonomy spectrum and what the next step is.

Anthropic’s blueprint documents one such point well. For any team adopting it, the work is to know which pressure the system is being built under, and to let that pressure shape the architecture that emerges. Same patterns, different production constraints, different decisions.

Frequently asked questions

What is Anthropic’s multi-agent research system?

Anthropic’s multi-agent research system, used in their Claude Research product, is an orchestrator-subagent architecture for breadth-first research. A lead agent plans the research approach and saves its plan to memory; it then spins up parallel subagents to explore independent directions, each with its own context window and tool access. Subagents return condensed findings, often via a shared memory store rather than long chat-style returns, and the lead agent reconciles them into a final answer with citations. On Anthropic’s internal evaluation, this setup outperformed a single Claude Opus 4 agent by 90.2% on their research eval.

What is the orchestrator-subagent (orchestrator-worker) pattern?

The orchestrator-subagent pattern, sometimes called orchestrator-worker, is a multi-agent design where one agent decomposes a task and delegates pieces of it to other agents. The orchestrator does not do the work itself; it plans, dispatches, and integrates results. Each subagent receives an objective, an output format, guidance on which tools and sources to use, and clear task boundaries. The pattern fits tasks that decompose naturally into independent directions and where parallel exploration is faster than sequential execution. It does not fit tasks with tightly-coupled context or heavy dependencies between subagents.

When should I use a multi-agent architecture vs. a single agent?

Use multi-agent when the task is breadth-first, the directions are independent, the aggregate context exceeds what a single agent can hold, and the budget can absorb the cost multiplier. Use single-agent when the task fits inside one context window, when steps are sequential, when the workflow is deterministic enough to specify, or when the budget is tight. The blueprint itself flags shared-context and high-dependency domains as poor fits for multi-agent. Most production tasks land closer to single-agent or deterministic-pipeline shapes than to research-style multi-agent shapes.

How does Anthropic’s multi-agent system handle context limits?

Anthropic’s system handles context limits by externalizing state to memory rather than chasing larger context windows. The lead researcher saves its plan to memory before context fills, because the context window can be truncated past a certain length. Subagents write findings to a shared filesystem and return lightweight references — the artifact pattern — so the lead agent does not re-read every detail through chat-style returns. Fresh-context resets between sub-tasks are part of the same strategy: state lives outside the agents, so agents can reset without losing it.

How much more expensive is a multi-agent system than a single agent?

Anthropic reports that multi-agent systems use roughly 15x more tokens than a chat conversation on the same surface task. The multiplier is the cost of running parallel subagents with their own context windows and tool calls. If the task is breadth-first and decomposes into independent directions, the multiplier buys parallelism that exceeds a single context window. If the task does not decompose, you pay the multiplier without earning it. Production teams often add cost circuit breakers and per-run budget caps because the multiplier compounds quickly when something misbehaves.

What does Anthropic’s blueprint not cover about production agent systems?

The blueprint focuses on session-bounded research and leaves several production concerns to the reader: long-running state across days or weeks, failure cascades when a subagent fails mid-orchestration, multi-model pinning for accuracy stability and cost control, runaway-spend protection through circuit breakers and budget caps, and stateful resumption from a checkpoint instead of a full restart. These are not flaws in the blueprint; they are concerns that emerge when the same patterns are applied under production constraints — cost ceilings, accuracy SLAs, speed budgets, error rates — that the research context does not impose.

Building autonomous agent systems under production constraints is the work we do every day. If you’re evaluating multi-agent architecture for a real job and want a practitioner’s view on where the patterns earn their keep, our managed autonomous AI agents service is the closest place to start.

AI Agent Deployment: The Operational Decision at Each Stage

Sebastian Chedal — Fri, 08 May 2026 18:07:21 +0000

Most teams running an AI agent pilot are being asked the same question right now: what do we build next? The published guidance is a stack of vendor maturity models that name the stages without naming the decisions inside them, and the team ends up debating models, prompts, and platforms while the pilot stalls.

A March 2026 Digital Applied survey found that 78% of surveyed enterprises had at least one agent pilot running and only 14% had scaled an agent to production-grade, organization-wide operation.

The same dataset surfaced something that reframes the problem: organizations with production-scale deployments did not have larger AI budgets than the organizations whose pilots stalled. They allocated the budget differently. Less on model selection and prompt engineering, more on evaluation infrastructure, monitoring tooling, and operational staffing. The teams that crossed into production reallocated. They did not outspend.

That finding changes what the deployment stages are for. Each stage has one operational decision that either reinforces the misallocation or breaks it. Get the decision right and the next stage gets cheaper. Get it wrong and you spend the next quarter rediscovering the same problems at higher volume.

This article walks the four operational decisions: workflow scope at pilot, monitoring placement at single-agent production, shared-state ownership at multi-agent coordination, and completion triggers at autonomous orchestration. It also covers the shape of governance cost across the stages, when to stay one stage longer, and the mechanism we run at each stage in our own production pipeline.

The deployment problem is mostly an allocation problem

The Digital Applied survey is the first dataset we have seen that quantifies what production-scale AI agent teams did differently. It is not what most vendor decks would predict. The teams that made it across had comparable AI budgets to the teams that stalled. The difference was where the dollars went.

The blocking factors stalled organizations cited are mostly operational, not modeling. Output quality at volume, monitoring and observability, and organizational ownership are all the work that happens after a model is chosen, after a prompt is tuned, after the demo is approved. The single most-cited operational gap was monitoring and observability, named by 54% of stalled organizations as a blocking factor. That figure shows up again in the Dynatrace work cited later, and it is the one to anchor on: more than half of stalled deployments cannot see what their agents are doing.

The misallocation pattern is recognizable. A team finishes a successful pilot. The next quarter’s budget conversation centers on which model to upgrade to, which prompt strategy to standardize on, which platform to consolidate on. The evaluation harness, the monitoring layer, and the operational headcount are deferred to “after we get the architecture right.” By the time the architecture is settled, the budget for the deferred work is gone, and the agents are running in production without the operational scaffolding they need to scale.

Each of the four deployment stages has one decision that breaks this pattern. Each decision puts a load-bearing piece of operational scaffolding in place before the misallocation can compound. The decisions are not abstract. We have made each of them in our own production agent pipeline, watched the failure modes when we got each one wrong, and rebuilt accordingly.

Pilot stage: the decision is workflow scope

Most pilots are scoped for demo appeal. Someone picks a workflow that will produce a compelling video, the team ships an agent that handles the happy path, and the pilot is declared a success. Then production handoff begins, and integration complexity, the most-cited scaling gap in the Digital Applied data, surfaces all at once. The pilot was never scoped to the messy edges of the workflow it claimed to automate.

The pilot decision is workflow scope. Scope governs every downstream cost. Pick a workflow with a clean input boundary, a measurable success metric, and a defined incident response, and the next three stages inherit a workable foundation. Pick a workflow that looks good in a slide deck, and you are paying for that scope decision for a year.

The mechanism is to define exit criteria at pilot start, not at production handoff. Three concrete criteria, written down before the agent runs:

Task volume threshold. What rate of work does the agent need to handle to be worth running in production? If the answer is “we will figure it out,” the pilot is not scoped.
Quality measurement. What does a wrong answer look like, and how is it caught? The answer cannot be “the user will tell us.” Production agents cost money per run; you need a quality signal that does not depend on a human checking every output.
Incident response. When the agent fails, what happens? Who gets paged? What runs in its place? “We will roll back” is not a plan if the agent is the only thing producing the work.

If the pilot cannot answer those three questions, the next stage is going to be operational firefighting. Worth pairing this stage with an honest AI readiness evaluation across data, governance, and culture before you commit to scaling the agent.

Single-agent production: the decision is monitoring placement

The pilot’s quality gate was a human in the loop. Production needs a different gate, and “we will add observability later” is the dominant failure pattern at this stage. A separate Dynatrace survey reports that a substantial share of leaders still rely on manual methods to monitor agent interactions — not an artifact of small deployments, but the operating reality of organizations that already have agents in production.

The single-agent production decision is monitoring placement. It has to be set before the agent goes live, not bolted on after the first incident. Three layers belong in place at deploy time:

Traces. Every agent run produces a structured trace: inputs, tool calls, outputs, duration, cost. Without traces, you cannot diagnose a failure that did not raise an exception.
Evaluation harness. A reference set of inputs and expected behaviors that runs before any change to the prompt, the model, or the tooling. Without an eval harness, every change is a guess.
Cost circuit breaker. A spending threshold that alerts at one level and halts the agent at another. Agents fail in directions that traditional monitoring does not catch. They keep running, just badly and expensively. Our own production pipeline holds to a predictable daily AI infrastructure baseline only because the cost-defense layers were built before the agents were turned on, not after the first runaway.

The order matters. Traces are the diagnostic substrate. The evaluation harness sits on top, using traces to score behavior. The cost circuit breaker is the last-resort guard for the failure modes that the evaluation harness does not catch in time. Build them in that order, and the next stage, multi-agent coordination, has the diagnostic data it needs. Skip the order, and the next stage is debugged from log files. The per-layer architecture is in the cost circuit breaker article. It is the single piece of single-agent infrastructure we would not deploy without.

Multi-agent coordination: the decision is shared-state ownership

Multi-agent failures look different from single-agent failures. They are not crashes. They are agents stepping on each other’s work, losing track of items in flight, and producing results that contradict each other because each agent inferred the state of the system from a different source. The loss is operational drift rather than catastrophic failure, which is harder to detect.

The multi-agent decision is shared-state ownership. Most of these failures trace to a single cause: agents are assumed to be isolated when they are context-coupled. They touch the same work, but no one named the canonical source of truth.

The mechanism is to name one explicit state owner for each piece of shared context, and require every agent to read and write through it. A file, a table, a queue, a database row: the form does not matter. What matters is that there is one place where the system’s state lives, and no agent infers state from another agent’s output.

In our own pipeline, the canonical state lives in two structured files: one tracks the production status of every content item, and the other tracks topic-level metadata across the inventory. Every agent in the pipeline reads from those files at the start of its work and writes to them at the end. No agent guesses where the work is by reading another agent’s draft. That single architectural decision, a named state owner, eliminated an entire class of failure that had been showing up as “missing items” and “duplicate work” before we made it. The broader pipeline architecture is documented in detail, but the load-bearing decision at this stage is the state-ownership one, not the pipeline shape.

The reason this works: shared state is the point at which multi-agent systems either become a coordinated team or a set of agents producing parallel inconsistent outputs. The investment goes into one well-designed shared structure, not into many ad-hoc handoffs.

Autonomous orchestration: replace fixed schedules with completion triggers

By the time a system has multiple agents in production, the orchestration layer becomes the bottleneck. Variable-duration AI work breaks fixed-schedule orchestration. The symptom is items waiting between stages: a research stage finishes at 11:14am, but the writing stage runs at noon, so the item sits for 46 minutes for no operational reason. Multiply that across a dozen stages and the lag compounds.

The autonomous orchestration decision is to move from fixed schedules to completion triggers. Only the entry point of the pipeline runs on a clock. Every downstream stage fires when the previous stage signals completion. The plumbing is straightforward: a stage finishes, writes its output, and calls the next stage.

The numbers are concrete. Under our previous fixed-schedule design, a piece of work that could move through the pipeline in two to three hours was taking six to twelve. After replacing the fixed crons with completion triggers, the two-to-three-hour window held. The full design and the failure modes that drove it are in the completion-triggered orchestration piece.

One caveat that matters more than the orchestration win itself: completion triggers compound failures faster than fixed schedules do. A bug in stage three under fixed scheduling waits until tomorrow’s run to surface. A bug under completion triggering fires the next stage immediately, which fires the next, which can produce a cascade of bad outputs in minutes. So this stage’s decision has a dependent decision attached: pair completion triggers with anti-loop guards, retry caps, and the cost circuit breaker from the single-agent stage. The orchestration speed-up is real. So is the failure speed-up. Both have to be designed for at the same time.

The cost of governance is per-stage, and the curve is steeper than vendors imply

Governance dollars do not scale linearly across the four stages. They scale by what the stage requires you to monitor. A single-agent production system needs evaluation and alerting. A multi-agent system adds shared-state audit and per-agent identity. An autonomous orchestration system adds completion-trigger guards, recovery infrastructure, and an anti-loop layer.

The shape matters more than the dollar figure. Our own ranges are useful as a reference example, with the caveat that the reader’s numbers will differ based on agent count, workload, and model mix: across nine production agents and sixty-two scheduled jobs at the autonomous-orchestration stage, our daily AI infrastructure cost runs roughly $15-20. That is operational AI infrastructure cost. It is not the full cost of running the system. The curve shape matters more than the dollar figure.

What the curve looks like, by stage:

Single-agent production. Evaluation harness, alerting, traces, cost circuit breaker. The cost is mostly tooling and the operational time to maintain reference sets and tune thresholds.
Multi-agent coordination. Add shared-state audit and per-agent identity. The identity-visibility gap that surveys keep surfacing is theoretical until the multi-agent stage; once two agents share work, it becomes operational.
Autonomous orchestration. Add completion-trigger guards, recovery crons, and per-stage cost limits. This is where agents can do the most damage in the shortest time, and the governance investment reflects that.

The allocation thesis applies again here. Governance dollars belong in evaluation, monitoring, and identity. They do not belong in picking a different model. The per-control breakdown is in the agent governance practitioners guide, mapped to the production stages.

Most teams should stay one stage longer than the vendor pitch implies

Vendors are selling autonomy. Most organizations are mid-curve and are being pushed forward before the decisions at their current stage are settled. The published survey data on enterprise-wide mature adoption is consistently a small minority of the field; the much larger group is the one that has shipped some agents but has not finished the operational scaffolding around them.

Staying longer at a stage is not stalling. It is finishing the operational decision at the current stage before adding the next layer of failure modes. A team that has not settled monitoring placement at single-agent production will find the multi-agent stage harder, not easier. A team that has not named shared-state ownership in multi-agent will find autonomous orchestration produces faster cascades, not faster work.

The question worth asking at the end of a quarter is not “are we ready for the next stage?” It is “have we settled the operational decision at the current stage?” If the answer is no, the next stage is going to be debugged on top of an unsettled one, and the cost of that compound failure shows up later as the kind of stall that the survey data is measuring.

This is also where the conceptual maturity layer lives. The five levels of AI maturity name what each level looks like. The four operational decisions in this article name what to build at each level so the next one becomes possible. The two layers are companions, not duplicates. The decisions in this article are the work that has to happen for an organization to actually move up the maturity curve, rather than describing where they currently sit on it.

Where to go from here

If you have a working pilot, the next operational decision is not which model to upgrade to. It is which workflow to harden, where to place monitoring before the agent goes live, who owns shared state when two agents touch the same work, and how to replace fixed schedules with completion triggers when orchestration starts to drag. Those four decisions, made deliberately, are what the production-scale teams in the Digital Applied survey did with their reallocated budgets.

If you want a partner who has already made each decision in a running production system and can build the infrastructure for your team, our managed autonomous AI agents service runs the full operational stack: evaluation, monitoring, shared-state, orchestration, and governance, at a published price. The decisions are the same whether we run them or you do. The article above is the framework. The service is the implementation.

Frequently Asked Questions

How do I know when my AI agent pilot is ready to move to production?

The pilot is ready when three exit criteria are met: the agent reliably handles a defined task volume, there is a quality measurement that does not depend on a human reviewing every output, and there is a defined incident response when the agent fails. If any of those is missing, production handoff will surface the gap as an integration failure rather than a pilot finding. Production-scale teams in the Digital Applied data wrote those criteria at pilot start, not at handoff.

What’s the operational difference between single-agent and multi-agent deployment?

A single agent fails in directions that traditional monitoring catches: error rates, latency, output quality. Multi-agent systems fail through coordination drift. Agents lose track of each other’s work, step on each other, or produce inconsistent outputs because each inferred the state of the system differently. The operational shift is from instrumenting the agent to instrumenting the shared state the agents read and write through. If you cannot point to one canonical state owner that every agent uses, you are running multiple agents, not a multi-agent system.

What does AI agent governance actually cost at each stage?

The shape is more useful than the figure. At single-agent production, governance is tooling and operational time for evaluation and alerting. At multi-agent it adds shared-state audit and per-agent identity — closing the visibility and containment gap that Cloud Security Alliance research has documented across organizations running agents. At autonomous orchestration it adds completion-trigger guards and recovery infrastructure. The curve, with costs concentrated in evaluation, monitoring, and identity rather than in model and prompt, is the part that generalizes across teams.

How do I scale AI agents without ballooning ongoing costs?

Build the cost defense before the agents go live, not after the first runaway. Daily and per-job spending limits, alerting thresholds set lower than halt thresholds, and an evaluation harness that catches behavioral drift before it shows up as a budget overrun. Cloud Security Alliance research found that 92% of organizations lack full visibility into AI agent identities, and most doubt they could detect or contain a compromised agent — that visibility deficit is what makes runaway costs expensive to catch later. Build identity, audit, and cost-defense into the deploy step. Our daily AI infrastructure cost has stayed in a predictable range as we have added agents and jobs because the limits were in place before the volume was.

When should I add a recovery or anti-loop layer to my agent system?

At the autonomous orchestration stage, before the first completion-triggered run. Completion triggers move work faster, and they also propagate failures faster. A recovery layer of retry caps, anti-loop guards, and cost ceilings tied to the per-stage budget is the dependent decision that has to ship with completion triggering, not after it.

Why do most AI agent pilots never reach production?

The Digital Applied survey found that pilots stall within months on average. The blocking factors named (integration complexity, output quality at volume, monitoring deficit, organizational ownership, domain training data) are consistent with pilots scoped for demo appeal rather than for a workflow with measurable success criteria, scaled into production without monitoring placement decided, and operated without a clear shared-state owner. Each of those is the absence of a decision at the corresponding stage. The cumulative result is the pre-production failure rate that maturity-model coverage keeps surfacing.

Agent Memory & Knowledge Systems Compared (2026 Guide)

Sebastian Chedal — Mon, 04 May 2026 18:07:06 +0000

Most companies deploying AI agents hit the same wall about two months in: the agent forgets everything between sessions, can’t read the company’s actual knowledge (strategy docs, pricing logic, customer notes), and has no clean way to write what it learns back to the team’s knowledge base for human review. The toolkit for solving this is strong, but the question that matters for a mid-market team is different from the question developers ask. It isn’t “which API surface is cleanest.” It’s “how does a company actually maintain its knowledge, feed it to agents, let agents add to it, and keep humans in the loop?”

As of April 2026, there are five named systems worth comparing (Mem0, Zep, Letta, Cognee, and Cloudflare Agent Memory) plus a sixth path: maintaining knowledge as plain markdown and giving agents read/write access through a semantic search index.

In this article:

The five questions to ask before you pick a memory system
What’s off the shelf in 2026 — and what you can build yourself
Mem0, Zep, Letta, Cognee, and Cloudflare Agent Memory, compared on the same scaffolding
The markdown-vault path nobody else writes about
A 4-step workflow for letting agents propose knowledge updates that humans review
A decision framework matched to mid-market deployments

System	Architecture	License	Bidirectional Sync	Best For
Mem0	Vector + graph + KV	Apache 2.0 / managed	Partial (API only)	Personalization, returning end-users
Zep / Graphiti	Temporal knowledge graph	Open source / managed	Partial (API only)	Entity + time queries, CRM agents
Letta	Tiered RAM/disk (agent-managed)	Apache 2.0 / managed	Weak	Long-horizon agents, unlimited memory
Cognee	Vector + knowledge graph from docs	Open core / managed	Partial (doc curation)	Unstructured document ingestion
Cloudflare Agent Memory	Typed (Facts/Events/Instructions/Tasks)	Managed only (private beta)	Partial (shared profiles)	Teams already on Cloudflare
Markdown vault + search	Files + semantic index	Infrastructure cost only	Strong (humans edit directly)	Full ownership, humans as first-class authors

The memory problem every mid-market deployment hits in month two

The first month of an agent deployment usually goes fine. Then three things start happening at once.

First, the session reset. The agent forgets yesterday’s conversation and the user re-explains context every time. By week three, people are typing the same paragraph of background into the prompt every morning.

Second, the knowledge gap. The agent doesn’t know the company’s pricing logic, brand voice rules, approved vendor list, or customer service notes. Those documents live in Notion, Obsidian, Google Drive, an internal wiki, or scattered Slack threads. The agent has no path to any of them.

Third, the learning leak. The agent figures something out during a session (a customer preference, a corrected spec, a new policy detail) and the moment the session ends, that learning is gone.

These three failures are usually framed as a context-window problem. They aren’t. They’re an organizational-knowledge problem. The question is not “how does the agent’s brain hold more information,” it is “where does the company’s knowledge live, who maintains it, and how does the agent participate in that loop without quietly rewriting things humans haven’t reviewed?” Every system below is a different answer to that question.

The five questions to ask before you pick a memory system

A buyer needs a self-diagnostic, a short list of questions to score against any candidate. Five questions cover the field:

1. Context management. How does the agent decide what fits in its working memory right now? Some systems keep the last N messages, some retrieve relevant memories on every turn, some compress conversations into running summaries. The right answer depends on how long your sessions are.

2. Connected knowledge body. Where does the agent’s knowledge come from, and who maintains it? If the only knowledge the agent has is what users say during sessions, the system is closed-loop. If the agent can read the company wiki, customer records, or a curated knowledge graph, it’s connected. Mid-market deployments almost always need the connected version, because the team already has its knowledge somewhere and the agent needs to plug into it.

3. Automatic vs engineered memory. Does the system decide what to remember on its own, or do you tell it explicitly? Automatic extraction is faster to deploy and harder to audit. Explicit memory is slower to set up and easier to control. Most mid-market teams want explicit at first and automatic only after they trust the system’s judgment.

4. Human-agent merge. Can humans read what the agent has learned, edit it, and contribute to the same knowledge base outside the agent loop? The agent should not be the only writer to its own memory. The human team needs a seat at the same table, ideally using normal tools (text editors, wikis, IDEs) rather than a separate “memory dashboard.”

5. Current limits. What does this system not do today? Every memory system has gaps. Some don’t handle entity changes over time, some don’t support multi-tenant scoping, some are private beta with no published pricing. Naming the limits before you commit saves the second deployment from fighting the first one’s blind spots.

These five run as a checklist against every system below.

The 2026 landscape — what’s off the shelf, what you build yourself

There are two paths through this market.

Off the shelf. Opinionated APIs and managed infrastructure. Integration time is days. Trade-offs are vendor lock-in, less control over how memory gets extracted and stored, and pricing models that are usually opaque until you scale. The named players are Mem0, Zep (with its open-source component Graphiti), Letta (formerly MemGPT), Cognee, and Cloudflare Agent Memory.

Build it yourself. Maintain the company’s knowledge as files, usually markdown, in a versioned folder. Index them with a local semantic search tool. Give agents a query interface and, optionally, a write-to-a-review-folder interface. Integration is longer up front, you own the operational complexity, and no vendor will support you. The advantages: knowledge stays portable, humans use normal tools to maintain it, and the cost is essentially infrastructure-only.

There’s also an architectural axis that cuts across both paths. Memory systems tend to fall into one of three patterns:

Vector-only. Embed everything, retrieve by similarity. Fast, simple, weak on temporal and relational queries.
Vector plus knowledge graph. Embed for similarity and extract entities/relationships for graph traversal. Better for “who owns what” and “what changed when” questions.
Tiered or agent-managed. The agent itself decides what to keep in working memory and what to page out to longer-term storage. More flexible, harder to reason about.

Vectorize’s 2026 framework comparison introduced this taxonomy in clean form, and it’s a useful overlay when reading the rest of this article.

The five systems, compared

Mem0 — the personalization memory layer

Mem0 is a vector + graph + key-value memory layer designed to give assistants and support agents persistent, scoped recall about end-users. Best for chatbots, support agents, and deployments where the same users return repeatedly.

The architecture combines three storage layers (vector, graph, key-value) with a four-scope memory model: user_id, agent_id, run_id, app_id, plus an optional org_id. Memories are extracted automatically from conversations and stored against whichever scopes apply. According to Mem0’s State of AI Agent Memory 2026 report (citing the ECAI 2025 paper, Chhikara et al.), Mem0 scores 66.9% on the LOCOMO benchmark at 0.71s median latency using around 1,800 tokens per conversation, versus a full-context baseline of 72.9% at 9.87s and around 26,000 tokens — roughly 14x the token cost for under 6 points of accuracy. The graph-enhanced variant (Mem0g) scores 68.4% at 1.09s. Mem0 publishes both the benchmark and the comparators, so treat absolute numbers as vendor-favorable; the latency and token-cost gaps are directionally useful regardless.

On the five questions:

Context management: retrieves relevant memories per turn, scoped by user/agent/run/app/org.
Connected knowledge body: partial. Mem0 holds what users say; pulling the company’s existing knowledge in is custom work.
Automatic vs engineered: automatic extraction by default, with explicit add/update APIs available.
Human-agent merge: weak. Humans can call the API, but the workflow is developer-shaped, not knowledge-worker-shaped.
Current limits: no native human-review workflow. The four-scope model is the closest the field gets to multi-stakeholder memory but it’s still agent-centric.

License: Apache 2.0 with around 48,000 GitHub stars per dev.to’s 2026 framework roundup. Atlan’s 2026 comparison also notes Mem0 has raised $24M in funding and holds SOC 2 compliance. Repo: github.com/mem0ai/mem0. Managed cloud has a free tier; production pricing is usage-based.

Zep / Graphiti — the temporal knowledge graph

Zep models memory as a temporal knowledge graph: facts have a time dimension, so “Alice owned the budget until February, then Bob took over” is a first-class query rather than a string-similarity guess. The open-source component is Graphiti; Zep Cloud is the managed product on top.

The temporal dimension matters most for production CRM and project agents, anywhere entities change relationships over time and the agent needs “what’s true now” separated from “what was true six months ago.” Zep groups conversations into episodes, summarizes them, and indexes the resulting graph. It scores 63.8% on LongMemEval per Atlan’s comparison, the strongest published number for temporal queries, versus Mem0’s 49.0% on the same benchmark.

One trade-off worth flagging: DevGenius’s builder comparison reports that immediate post-ingestion retrieval often misses correct answers because Zep’s graph processing runs in the background; correct answers tend to surface hours later once the graph catches up. The same piece notes Mem0’s published critique that Zep’s memory footprint can exceed 600,000 tokens per conversation versus Mem0’s ~1,800. That critique comes from Mem0, but the order-of-magnitude gap is consistent across third-party reports.

On the five questions:

Context management: episode-grouped, summarized, retrieved with temporal awareness.
Connected knowledge body: partial. Strong inside the graph it builds, weak at pulling external markdown or wiki content in without custom ingestion.
Automatic vs engineered: automatic extraction, explicit graph editing available.
Human-agent merge: weak. Humans interact with Zep through Zep’s tools, not their own.
Current limits: retrieval delay until graph processing completes. No native human-review workflow.

License: Graphiti is open source; Zep Cloud is usage-based. Around 24,000 GitHub stars per the dev.to roundup. SOC 2 compliant per Atlan.

Letta (formerly MemGPT) — OS-inspired tiered memory

Letta models agent memory after an operating system. Main context is RAM (what’s in the prompt right now). Archival memory is disk (long-term storage the agent can search). The agent itself decides what pages in and out via tool calls. Originally published as MemGPT, the project rebranded in 2024 and continues under the same architecture.

Best for long-running agents that need effectively unlimited memory and where you’re willing to trust the agent with its own paging decisions: research assistants, coding assistants on multi-week projects, deployments running hundreds or thousands of turns. The trade-off is that “the agent decides what to remember” is harder to audit than “the system decides on rules you wrote.”

On the five questions:

Context management: tiered RAM/disk model with agent-driven paging.
Connected knowledge body: partial. Archival memory can hold ingested documents, but you’re operating Letta’s storage, not the company’s existing knowledge base.
Automatic vs engineered: agent-managed, a third path between fully automatic and explicitly engineered by the operator.
Human-agent merge: weak. Humans can call the API; no native co-edit workflow.
Current limits: auditing what the agent chose to remember (and discard) is harder than with explicit-rule systems.

License: Apache 2.0, around 21,000 GitHub stars per the dev.to roundup. Managed cloud available; self-hosted deployment is well-documented.

Cognee — knowledge graph from unstructured data

Cognee is the closest existing system to “feed the company’s documents in and let the agent reason over them.” Its pipeline ingests raw documents, conversations, and external sources, extracts entities and relationships, builds a knowledge graph, and retrieves by graph traversal combined with vector search. The entry point is unstructured documents (not conversation logs) and the graph is the primary retrieval surface, which makes Cognee strong for institutional knowledge and weaker for fast conversational personalization. Best for research-heavy agents and deployments where the inputs are messy documents rather than clean conversations.

On the five questions:

Context management: graph traversal plus vector retrieval; long-form document support is the strength.
Connected knowledge body: stronger here than the conversational-memory peers. Ingestion is the design center.
Automatic vs engineered: automatic extraction with configurable pipelines.
Human-agent merge: partial. Humans curate the input documents, but Cognee’s representation of them is opaque to non-engineers.
Current limits: no native human-review workflow on agent-added knowledge; managed-service pricing not transparent at the time of writing.

License: open core with around 12,000 GitHub stars per the dev.to roundup. Managed cloud available.

Cloudflare Agent Memory — the April 2026 entrant

Cloudflare announced Agent Memory in private beta on April 17, 2026. It’s the most significant new entrant this year, shipping as a managed service running on Workers, Durable Objects, and Vectorize.

Five operations (ingest, remember, recall, forget, list) cover the API surface. Ingestion runs as a two-pass pipeline at 10,000-character chunks with two-message overlap, with an eight-check verifier filtering extracted memories before they land. Memories are typed into one of four classes: Facts (atomic stable knowledge), Events (timestamped happenings), Instructions (procedures), and Tasks (ephemeral). A profile model can be shared across multiple agents and humans, the closest any managed service gets to a multi-stakeholder memory layer. Cloudflare also committed publicly that customer memory is exportable (“your memories are yours; every memory is exportable”), which most managed services don’t.

On the five questions:

Context management: typed retrieval (Facts/Events/Instructions/Tasks) with verifier-gated ingestion.
Connected knowledge body: partial. Designed primarily for conversational and event-driven inputs; document ingestion is supported but not the design center.
Automatic vs engineered: automatic with a strong verifier in the loop.
Human-agent merge: the shared-profile model gestures toward this, but the example in the launch post is “two agents share memory,” not “humans write the source of truth.”
Current limits: private beta with no published pricing; Cloudflare-ecosystem dependency; production proof points are weeks old, not years.

License: managed service, no open-source release. Pricing: not yet published as of April 2026. Best fit: teams already on Cloudflare who want the lowest-friction managed memory layer and are comfortable being early adopters.

The build-it-yourself path: markdown vault plus semantic search

A folder of markdown files plus a local semantic search index is a legitimate competitor to all five managed paths above, especially for mid-market companies that already maintain knowledge in Notion, Obsidian, or git repos. This is one of the patterns we’ve watched work in practice — see how production agent teams handle memory in practice for the operational shape.

The pattern is simple. Maintain company knowledge as plain markdown in a versioned folder (an Obsidian vault, a git repo, a GitHub wiki, a Notion export). Index it with a local semantic search tool. Give agents read access through a query tool that returns matching files (or excerpts) with provenance. Optionally, give the agent write access to a designated subfolder where new notes go for human review before promotion into the canonical base.

The advantages stack up quickly. Knowledge stays portable: no vendor owns your facts, and migrating to a different agent platform means changing the query tool, not exporting and reformatting a database. Humans edit knowledge using normal tools (text editors, Obsidian, IDEs, GitHub PR review), so there’s no separate “memory dashboard” anyone has to learn. The same knowledge base feeds multiple agents and the team simultaneously. Cost is infrastructure-only.

The pattern has a documented public example. A February 2026 walkthrough at eastondev.com describes configuring an agent platform’s Obsidian-vault skill to sync conversation memory as Markdown notes with bidirectional links and structured directories (session logs in one folder, knowledge base in another). When Perplexity is asked about bidirectional human↔agent knowledge sync in 2026, that walkthrough is the project it cites: the only documented end-to-end pattern at the time of writing. For a longer-form view of the same shape, see how a real production pipeline uses memory across multiple stages.

Tools that fit this lane: Obsidian for the markdown editor and graph layer; a local semantic search index combining BM25 and vector search over the vault; LangMem or LlamaIndex memory modules when you want a memory abstraction pairable with a markdown backend instead of a SaaS layer.

When this path is the wrong answer: temporal entity tracking is non-trivial to build (use Zep), agent-managed paging across very long sessions is also non-trivial (use Letta), and if you genuinely don’t want any infrastructure to operate, the managed services exist for a reason.

The bidirectional sync question — how knowledge flows both ways

Most teams treat agent memory as one-way. The agent reads from some knowledge, operates on it, and the work product evaporates. The systems that actually work in production close the loop: agent reads, operates, writes back to a holding area, human reviews, knowledge gets promoted into the canonical base. Four steps, all of them necessary.

Step 1: Source of truth lives with humans. The canonical knowledge base, the place where the company’s strategy, pricing, customer details, and policies actually live, is something humans maintain primarily. An Obsidian vault, a Notion workspace, an internal wiki, a git repo of markdown files. Whatever it is, the humans on the team are the authoritative authors. This principle of building your own knowledge base rather than letting it live inside a vendor’s database is what makes the rest of the workflow possible.

Step 2: Agent reads with provenance. When the agent answers a question or makes a decision, it cites which document (or which memory record) the answer came from. No “trust me” responses. Provenance is non-optional, because without it humans can’t audit what the agent is doing.

Step 3: Agent writes to a review queue, not the source of truth. When the agent learns something new (a customer corrected a fact, a project changed scope, a pricing exception was approved) it writes that new note to a pending/ or inbox/ folder. Never directly to the canonical base. The agent’s job is to propose, not to publish.

Step 4: Human review promotes or rejects. A periodic review pass (daily for high-velocity environments, weekly for most) either promotes the agent’s proposed notes into the canonical base or rejects them. The canonical base only grows under human authority. The review interface is whatever the team already uses: a folder, a Pull Request, a Notion page with a checkbox.

How each system maps to these steps tells you the most about whether it’s a fit:

Mem0: step 2 strong (four-scope provenance), step 1 partial, steps 3 and 4 require custom work.
Zep: step 2 strong (episode-level provenance), step 1 partial, steps 3 and 4 require custom work.
Letta: step 2 harder (paging decisions aren’t always traceable), steps 3 and 4 require careful tool wrapping.
Cognee: step 1 strongest (document ingestion is the design center), step 2 partial, steps 3 and 4 require custom work.
Cloudflare Agent Memory: typed classification and shared profiles gesture at multi-stakeholder memory; step 4 is the gap.
Markdown vault plus semantic search: step 4 is just “humans editing a folder” or “merging a Pull Request.” That’s where this path quietly wins. Steps 1–3 require operational discipline rather than a vendor.

No system natively implements step 4. All of them assume the agent has authority to update memory directly. The systems that come closest do so by accident (Cloudflare’s shared profiles, Mem0’s scoped models) not by design. The markdown-vault path makes step 4 a workflow choice instead of a feature request.

A decision framework for picking the right system

Read the framework as “if your situation is X, start with Y”:

Already on Cloudflare and want low-friction managed: Cloudflare Agent Memory (private beta; confirm access first).
Adaptive personalization for end-users (chatbot, support, returning users): Mem0.
Entities and relationships change over time (“who owned this account in February”): Zep / Graphiti.
Long-horizon agents needing effectively unlimited memory: Letta.
Ingesting unstructured documents, reasoning over a knowledge graph: Cognee.
Full ownership, portability, humans as first-class authors: markdown vault plus semantic search.
Already on LangChain/LangGraph or LlamaIndex: use their memory modules first; revisit only if you outgrow them.

Most mid-market deployments end up combining a markdown vault for canonical knowledge with one of the off-the-shelf layers for transient session memory. The vault holds what the team owns; the SaaS layer holds what the agent needs to remember about an active conversation. That split keeps canonical knowledge portable while letting the agent operate at the speed users expect.

Open problems in the field

The agent-memory category is roughly eighteen months old as a distinct discipline. A few caveats apply across all six paths above. No system natively implements the human-review-promotion gate; all assume the agent has authority to update memory directly. LOCOMO and LongMemEval are useful but easy to overfit (Cloudflare’s launch post says so directly) so treat scores as directional. Most managed services route conversation extraction through their own LLMs — fine for some businesses, a deal-breaker for others. None publish per-query pricing in a way that lets a buyer model real-world cost ahead of time. Cloudflare publicly committed to memory export; most others have not. Voice agent memory is a distinct emerging sub-problem.

The market gap is wide enough that one of the major systems will likely close it within twelve months.

FAQ

What is the best AI agent memory system in 2026?

There isn’t a single best. Mem0 leads on personalization and benchmark scores. Zep / Graphiti leads on temporal queries. Letta leads on long-horizon agent-managed memory. Cognee leads on unstructured-document ingestion. Cloudflare Agent Memory is the most significant new managed entrant. For deployments where humans need to be first-class authors of the knowledge base, a markdown vault plus a semantic search index is often the right answer.

Is Cloudflare Agent Memory open source?

No. Cloudflare Agent Memory is a managed service in private beta as of April 17, 2026, running on Workers, Durable Objects, and Vectorize. Cloudflare has committed publicly to making customer memory exportable, but the service itself is closed-source.

What’s the difference between Mem0 and Zep?

Mem0 is optimized for personalization, remembering things about end-users across sessions, with a four-scope memory model (user_id / agent_id / run_id / app_id). Zep is optimized for temporal knowledge, tracking how entities and relationships change over time using a knowledge graph. Mem0 is faster on retrieval; Zep is more accurate on “what was true when” questions. Per published benchmarks, Mem0 leads LOCOMO and Zep leads LongMemEval.

Can I use Obsidian as memory for an AI agent?

Yes. The pattern is to maintain company knowledge as markdown in an Obsidian vault, index it with a local semantic search tool, and give the agent a query interface. Optionally, give the agent write access to a review folder where humans promote or reject new notes. A February 2026 walkthrough at eastondev.com documents one full implementation.

How do I let an AI agent update my company’s knowledge base?

Don’t let it write directly. Use a four-step bilateral sync workflow: humans maintain the canonical knowledge base, the agent reads with provenance, the agent writes new learnings to a review folder (not the canonical base), and a periodic human review promotes or rejects them. None of the major managed memory systems implement step four natively, which is why the markdown-vault path is often the easiest fit.

If you don’t want to build this

If your business is hitting the memory wall and you don’t want to evaluate six options and stand up the bidirectional review workflow yourself, that’s the kind of work we do. We can run the memory architecture and the human-review workflow with you, so the canonical knowledge stays yours and the agent participates in the loop you already trust.

What MCP, A2A, and UCP Mean for Your Website in 2026

Sebastian Chedal — Sat, 02 May 2026 18:06:58 +0000

If you run a website in 2026, you have probably watched three different articles about MCP, A2A, and UCP scroll past in the last two weeks and wondered whether any of it changes what you should be doing this quarter. The short answer is yes, but probably less than the headlines suggest, and not in the direction the headlines point. The agentic protocol stack is real infrastructure that is now mainstream conversation, and most of the work the average website owner needs to do about it can be done in an afternoon.

Three sources published the same underlying observation within roughly two weeks of each other. Backlinko released a six-protocol primer on MCP, A2A, NLWeb, WebMCP, ACP, and UCP, framing them as “what robots.txt and XML sitemaps were to 2005 Google.” Addy Osmani, Google Cloud’s Director of Engineering, published an Agentic Engine Optimization framework along with an open-source audit tool. Conductor analyzed 13,770 domains and 17 million AI responses and named the resulting visibility layer “the parallel surface.” Three independent signals, same conclusion. Agentic protocols are now part of how websites get discovered, queried, and (eventually) transacted with by AI agents on behalf of their users.

This article is the version for the person who runs a website and wants to know which of these protocols matter for their site, which ones they can ignore, and what is reasonable to actually do about any of it before the end of the quarter.

What “Protocol-Ready” Means

Protocol-ready means an AI agent can discover, query, and (where it makes sense) transact with a website through a standardized interface, instead of scraping HTML and guessing at structure. That is the whole definition.

The closest historical parallel is the one Backlinko reaches for and gets right. Their verified framing: “Think of how robots.txt and XML sitemaps became table stakes for search crawlers. Agentic protocols are shaping up to be that for AI agents.” Robots.txt was a quiet text file that turned into existential SEO infrastructure within three years of nobody caring about it. The trajectory of the agentic protocol stack looks similar, though earlier on the curve.

The signal that this is now mainstream rather than speculative is convergence. DigitalApplied’s ecosystem map reports 97 million MCP downloads as of March 2026. Backlinko’s count of the PulseMCP directory has more than 10,000 MCP servers live as of early 2026. Conductor’s 2026 benchmark finds AI referral traffic averaging around 1% of total website traffic and growing roughly 1% per month. The 1% number is small, but the growth rate is the part to watch. The infrastructure has reached the volume where ignoring it stops being defensible, even if acting on it is still optional for most sites.

For the content-side companion to the infrastructure questions in this article, see our agentic SEO practitioner guide, which covers what to publish so AI agents can actually use it.

The Three Protocols That Matter Now (and the Three to Watch, Not Build For)

Backlinko enumerates six protocols. The count is correct as a taxonomy, and misleading as a buying recommendation. For 2026 website-scale decisions, three deserve real attention. Three more are worth tracking and nothing more.

Build for now

MCP (Model Context Protocol). The agent-to-tools layer. Anthropic launched MCP in November 2024, and it is now governed by the Agentic AI Foundation under the Linux Foundation. The standard has been adopted by OpenAI, Google, and Microsoft. If your business has any internal system you would want AI tools to query (a product catalog, a CRM, a CMS, a support knowledge base, an inventory database), an MCP server is the standard interface for exposing that system to agents. It is the only protocol on this list that has cleared “is this real” status. If you have nothing for an agent to query, you do not need MCP.

A2A (Agent-to-Agent). The agent-to-agent layer. Google launched A2A in April 2025 with more than 50 technology partners, including Salesforce, PayPal, SAP, Workday, and ServiceNow. The Linux Foundation now maintains it under Apache 2.0. A2A becomes relevant when a website operates more than one agent that needs to coordinate with another agent (yours or someone else’s). Most websites are not running multiple agents yet. If you are running one agent or none, A2A is informational. If you reach three or more by the end of 2026, you will need it.

UCP (Universal Commerce Protocol). The agent-to-commerce layer. Sundar Pichai announced UCP at NRF 2026, co-developed by Google and Shopify with launch partners including Target, Walmart, Wayfair, and Etsy, plus 20+ additional partners including Mastercard, Visa, Stripe, and American Express. UCP runs on top of OAuth 2.0 and PCI-DSS, with MCP and A2A bindings built in. UCP launched less than 14 weeks after OpenAI and Stripe announced ACP, the competing OpenAI-led commerce protocol. The two protocols overlap. UCP has the broader retailer coalition; ACP has live distribution inside ChatGPT. If your site sells products and you are picking one to keep on your radar today, UCP is the safer bet on coalition breadth.

Watch, do not build for yet

NLWeb. A natural-language interface for websites, created by R.V. Guha, who also created RSS, RDF, and Schema.org. Heavy pedigree. Early adopters include TripAdvisor, Shopify, Eventbrite, O’Reilly, and Hearst, announced at Microsoft Build 2025. Interesting long-term. Most websites do not need it yet.

WebMCP. A Google-and-Microsoft W3C Community Group proposal, with an early preview shipping in Chrome in February 2026. Pre-standard. Worth watching, not worth implementing this quarter.

ACP (Agent Commerce Protocol). OpenAI and Stripe’s commerce protocol. Live in ChatGPT Instant Checkout since September 2025, with 900 million weekly ChatGPT users and a reported 4% merchant fee per Opascope’s synthesis. Real, but overlapping with UCP. If you only have budget for one commerce protocol implementation, the broader-coalition standard wins on portability.

Run This on Your Own Site: A Five-Point Readiness Check

Most websites only need to act on two or three of the five questions below. The point of running through all five is to know which two or three those are.

1. Structured-data baseline. Schema.org coverage for Organization, Product, Service, FAQPage, and Article at minimum. If your structured data is incomplete, no protocol implementation will compensate, because agents still need the structured signals underneath. Run Osmani’s agentic-seo audit tool against your own domain. The tool runs ten checks across five categories (Discovery, Content, Token Efficiency, Agent Context, AI Usability) and scores out of 100. Free, public, fifteen minutes. Run it against a competitor’s domain in the same session if you want a calibration point.

2. Content recency check. Amsive reported that 50% of AI-cited content is less than 13 weeks old. If your last cornerstone publish was six months ago, fix that before anything else. Recency is the precondition; protocols are the amplifier. Cornerstone-content cadence is a bigger lever for AI visibility right now than any single manifest decision.

3. /.well-known/ manifest decision. There are three possible manifests, and not every site should publish all three. A UCP manifest at /.well-known/ucp is relevant if you sell products online. An LLMs.txt file is relevant for content-heavy sites that want to expose a curated reading order to AI agents. An agents.md file at the repository root is relevant if your site or codebase is going to be navigated by coding agents. Most sites need one or two of these, not all three. Decide what to publish, not all of it.

4. MCP tool exposure decision. Do you have an internal API, database, or system an agent should reach? If yes, an MCP server wrapping that system is the right pattern. If no, and most brochure-site businesses are in this category, skip MCP entirely this quarter. There is no point building infrastructure for agents to use when there is nothing for them to use it for. If you do expose an internal system, build a cost circuit breaker pattern in front of it before going live. Runaway agent calls produce surprise bills.

5. Citation baseline. Before any protocol work, measure where your site is currently being cited in AI answers across Perplexity, ChatGPT, Gemini, Claude, and Google AI Mode. Conductor’s 2026 AEO/GEO Benchmarks, built on 13,770 domains and 17 million AI responses, give you the industry calibration. AI referral traffic averages around 1% of total and is growing roughly 1% per month. If you do not measure where you are cited today, you cannot tell whether anything you do tomorrow is working.

Five questions, answerable in an afternoon. Most websites only need to act on two or three of them.

When you can skip this entirely. Sites with fewer than 50 indexed pages, sites in regulated verticals where agent transactions are not yet legal (regulated financial advice, healthcare prescribing, anything that requires a licensed human in the loop), and sites whose current content strategy is not producing anything citable in the first place. The structured-data and content-recency checks above will surface this quickly. If both fail, fix those first; the protocol questions can wait.

Where This Is Going (and What to Do About It)

The trajectory is directionally certain and short-term modest, and that is the framing to take into your next planning meeting. Backlinko, Pipe17, and the Google Developers Blog all published their protocol primers in Q1 2026. Search Engine Journal, SEMrush, and Ahrefs will follow this year. Conductor has already named “the parallel surface of visibility” as the canonical 2026 framing. Protocol-readiness is going to show up as a normal RFP requirement on a 12-to-24-month horizon, not a “by July” deadline. The current AI-referral share is small. The growth rate is the part that compounds.

What is reasonable to do now if you run a website. Run Osmani’s agentic-seo tool on your domain (15 minutes). Audit your cornerstone content recency (1 hour). Decide whether you have an internal system that would benefit from MCP exposure (most websites do not, and “no” is a perfectly reasonable answer). If you sell products online, put a calendar reminder to revisit the UCP manifest question in Q3, when the retailer adoption curve will be clearer. None of this is a multi-quarter program. It is afternoon-scale work for most sites, and skip-entirely work for many of them.

We are a technology studio that builds autonomous AI systems. The readiness work in this article sits in front of the platform layer we run for clients with bigger needs (clients running production agents, exposing internal systems through MCP, or building multi-agent workflows that coordinate over A2A) at Fountain City’s managed autonomous AI agents.

FAQ

What is MCP (Model Context Protocol)?

MCP is the standardized interface AI agents use to talk to tools and data sources. Anthropic launched MCP in November 2024, and it is now governed by the Agentic AI Foundation under the Linux Foundation, with adoption from OpenAI, Google, and Microsoft. According to Backlinko’s count of the PulseMCP directory, more than 10,000 MCP servers are live as of early 2026. Practically, if you have an internal system an AI tool should query, an MCP server is the standard wrapper.

What is UCP (Universal Commerce Protocol)?

UCP is the agent-to-commerce protocol announced by Google and Shopify at NRF 2026. Launch partners include Target, Walmart, Wayfair, Etsy, Mastercard, Visa, Stripe, and American Express, with 20+ additional partners endorsing the standard. UCP runs on OAuth 2.0 and PCI-DSS and includes MCP and A2A bindings. It exists so AI agents can complete purchases on behalf of shoppers using a standardized handshake instead of brittle scraping.

What is the difference between MCP, A2A, and UCP?

MCP connects agents to tools and data. A2A connects agents to other agents. UCP connects agents to commerce checkout. Different layers of the same stack, and most websites only need one or two of them.

What does “protocol-ready” mean for a website?

Protocol-ready means an AI agent can discover, query, and (where it makes sense) transact with the site through a standardized interface, instead of scraping HTML and guessing at structure. Concretely: structured-data coverage in place, recent cornerstone content, the right /.well-known/ manifest published, and (if internal systems are involved) an MCP server with auth and rate limits.

Is this the same as GEO or AEO?

Adjacent, not identical. GEO (Generative Engine Optimization) and AEO (Answer Engine Optimization) are about optimizing content to be cited by AI engines. Protocol readiness is the infrastructure layer underneath that. The standardized interfaces agents use to discover, query, and transact with a site. The five-point readiness check covers both, because the questions overlap.

Does my site need all six protocols?

No. For 2026 decisions, three matter (MCP, A2A, UCP), and three are worth tracking but not building for yet (NLWeb, WebMCP, ACP). Most websites only need one or two of the build-for-now three. The five-point readiness check is the way to figure out which.

When can I skip this entirely?

Sites with fewer than 50 indexed pages, sites in regulated verticals where agent transactions are not yet legal, and sites whose current content is not producing anything citable in the first place. If the structured-data and content-recency checks both fail, fix those first; the protocol questions can wait.

Claude Code and Codex Together: Driver/Worker Orchestration in Production

Sebastian Chedal — Fri, 01 May 2026 18:12:58 +0000

The pattern that has held up across complex refactors, full WordPress migrations, and ground-up SAAS rebuilds is hierarchical: Claude Code (Opus 4.7) is the driver. Codex (GPT-5.5) is the worker. Claude Code plans, calls Codex to do the heavy execution, gets the results back, reasons over them, decides what’s next.

The version stamps matter for an article like this. Opus 4.7 launched April 16, 2026. GPT-5.5 launched April 23, 2026. The framework we currently run on top of them — BEADS with Metaswarm v0.11.0 — landed mid-April.

The Quick Verdict

Workload	Where it lives	Why
Planning, architecture, ambiguous specs	Claude Code (driver)	Long-context coherence, self-verification sub-agents
Long terminal runs, mechanical execution	Codex (worker)	Sustained 45+ minute runs, ~72% fewer output tokens
Reasoning over returned work, integration, review	Claude Code (driver)	Review is folded into the driver’s loop, not a separate step
Single-tool work that fits in one context window	Either, alone	Driver/worker overhead doesn’t earn its keep

Benchmark anchors: Lushbinary, April 2026, cross-checked against FwdSlash.

What Each Is Specifically Better At (April 2026)

Where Claude Code (Opus 4.7) Wins

Practitioners running both consistently describe Claude Code as the tool for the thinking work: the ambiguous problem, the large codebase, the architecture decision that will outlast the session. Chandler Nguyen’s follow-up post in late April put it plainly after weeks of running both: “Codex took the coding seat and Claude Code took everything else.” The “everything else” covers planning, comprehension, reviewing what came back from the worker, deciding when something is actually done.

The benchmarks line up with that read. Opus 4.7 leads on SWE-bench Pro at 64.3%, SWE-bench Verified at roughly 87.6%, CursorBench at 70%, and GPQA Diamond at 94.2%. Two operational features show up in daily use beyond what those numbers capture: CLAUDE.md persistent project context (so the agent re-loads architecture decisions across sessions), and what Chandler called the killer feature, the harness spawning verification sub-agents without being asked. On long sessions, especially over 90 minutes of continuous work on the same problem, it holds the thread better than alternatives we’ve tested.

Claude Code’s token consumption is roughly 3-4x higher than Codex CLI on equivalent tasks. The harness is doing more (context preloading, sub-agent spawning, automatic verification passes) and you pay for that in tokens. For deep work, the cost is justified. For high-volume mechanical transformations, it isn’t. That gap is most of why the driver/worker split makes sense.

Where Codex (GPT-5.5) Wins

Among practitioners running both, Codex is where the long execution lives. It runs hard for stretches Claude Code wouldn’t sustain. Chandler’s experience report describes Codex working 45+ minutes continuously without losing the thread. The cloud-container architecture lets you fire-and-disconnect: hand off a task, close the laptop, come back when it’s done. That sustained-run profile is the operational reason it works as a worker. The driver doesn’t have to babysit it.

GPT-5.5 leads on Terminal-Bench 2.0 at 82.7%, OSWorld-Verified (computer use) at 78.7%, GDPval at 84.9%, and Tau2-bench Telecom at 98.0%. OpenAI says 85%+ of the company uses Codex weekly across engineering, finance, comms, marketing, data science, and product. They run it because it executes.

Token efficiency is where the gap compounds at scale. GPT-5.5 uses roughly 72% fewer output tokens than Opus 4.7 on equivalent coding tasks. When the worker is doing the bulk of the volume (terminal runs, mechanical transformations, parallelizable sub-tasks) that efficiency is what makes the dual-tool monthly bill defensible.

Note: An interactive chart comparing benchmark scores appears at this point in the original article. View the chart on fountaincity.tech.

The Harness Effect (Why This Comparison Is Mostly About the Harness, Not the Model)

Matt Mayer ran the same model through two different harnesses on identical tasks: Claude Opus scored 77% in Claude Code and 93% in Cursor. Same model, same tasks, sixteen percentage points from the harness alone.

CORE-Bench reproduced the pattern more dramatically. Claude Opus scored 42% with a minimal scaffold and 78% inside Claude Code’s full harness. Thirty-six points of capability appeared from the wrapper, not the weights. Nate’s Newsletter reported the same gap in independent testing: a 36-point spread on identical tasks driven entirely by harness differences.

The harness has four components, per Jonathan Fulton’s architectural breakdown: a loop that decides when to call the model again, a context manager that handles compaction and memory, a tool registry with descriptions and schemas, and an approval system that intercepts tool calls. Codex and Claude Code converge on similar architectures here. The differences that drive the harness effect are subtler: how aggressively each one summarizes context, how many parallel sub-agents it manages, what the default tool descriptions look like, how the system prompt is structured.

If 16-36 percentage points of capability come from the wrapper rather than the weights, then nesting harnesses (putting one inside another in a driver/worker topology) is a way of stacking those gains, not averaging them. The driver gets the planning and integration capability of one wrapper. The worker gets the terminal autonomy and token efficiency of another. The combined system is bigger than either side, and the cross-harness review that emerges from the topology is what catches the bugs neither single harness sees.

How We Run Them Together: Driver/Worker Orchestration

The pattern is hierarchical, not parallel. Driver/Worker Orchestration: Claude Code drives. Codex executes when the driver delegates. Results return up to the driver. Working alternatives include the Planner-Driver Pattern and the Orchestrator/Worker Harness.

Layer	What happens	Why this side
Driver keeps (Claude Code)	Planning, codebase comprehension, architecture decisions, deciding what to delegate, deciding when the task is done	The driver’s job is to hold the whole picture. Long-context coherence and the self-verification sub-agents make it the right tool for the work that has to remember why earlier decisions were made.
Driver delegates to worker (Codex)	Long terminal runs, mechanical transformations, parallelizable sub-tasks, anything where 45+ minute uninterrupted execution and lower per-token cost are the right shape	The worker doesn’t need to hold the whole picture. It needs a scoped task, the ability to run hard for an hour, and the discipline to report back cleanly. Codex’s terminal autonomy and token efficiency fit that shape.
Worker returns to driver	Codex reports results, diffs, test outcomes, and any unresolved questions back up. Claude Code reads the returned work in its own context, reasons over it, integrates it, decides next steps	Review is implicit in the topology rather than a separate “cross-model review pipeline step.” The driver always re-reads the worker’s output before merging it into the plan; cross-harness coverage is a side-effect, not a manual step bolted onto the end.

The driver’s loop never closes. Claude Code spawns Codex, waits for it to finish, then re-engages with the returned work. The next task usually emerges from reasoning over what came back, not from a pre-planned queue. That’s why the topology compounds. Each worker run sharpens the driver’s plan; each driver decision changes the next thing the worker gets asked to do.

Shared context, separate context files. Claude Code reads CLAUDE.md at the project root; Codex reads from ~/.codex/skills/. Both have to know the same conventions or the worker’s output won’t fit cleanly back into the driver’s plan. Chandler’s cross-pollination workflow is the practical answer: have Codex study your existing Claude Code skills and produce equivalents under ~/.codex/skills. Same conventions, two file formats. The Skills standard is converging across both tools, but as of April 2026 you’re still translating between formats.

The cleanest version of this runs Codex from inside the Claude Code session, through an orchestration framework that handles the spawn, wait, and return. The worker doesn’t see the user; it sees the driver. The user sees only the driver. That’s what makes the loop close: Claude Code is the only thing the engineer interacts with directly.

The worker reports structured results: diffs, test results, log excerpts, unanswered questions. The driver reasons better when the worker’s return packet is shaped for reasoning rather than just for human review. This is mostly a matter of how the framework prompts the worker. Most orchestration frameworks now support structured return packets out of the box.

The Orchestration Framework Layer (BEADS+Metaswarm and the 2026 Ecosystem)

The driver/worker topology runs through a framework: the substrate that handles spawn, context handoff, structured return, and session bookkeeping so the driver can pick up where the worker left off. As of April 2026 we run on BEADS with Metaswarm v0.11.0. Metaswarm provides the multi-agent orchestration layer; BEADS handles persistent issue tracking, context priming, and semantic summarization across sessions, exposed as a Claude Code plugin. It’s what we use today. It’s not what we’ll necessarily use next month.

Framework choice is fluid in a way that didn’t exist before agentic coding. Switching between Metaswarm and an alternative is a per-project decision now, not a per-company one. You can scaffold one system, test a different framework on the next sprint, and migrate gradually if the new one earns it. The pattern (Driver/Worker Orchestration) is what holds across framework swaps.

The wider 2026 ecosystem at the harness/framework layer:

BEADS + Metaswarm: our current stack. Metaswarm’s session hooks defer to the standalone BEADS plugin for context priming and decision tracking, which means the driver can survive context compaction without losing the thread.
Archon: described in April 2026 research as the first open-source harness builder for orchestrating Claude Code and Codex together. Worth a look if you want to build your own multi-tool flow rather than wire up shell scripts.
Citadel: agent orchestration harness for Claude Code and Codex with parallel agents in isolated worktrees, four-tier intent routing, and persistent campaign memory across sessions. The closest in scope to BEADS + Metaswarm if you want a different shape on the same problem.
HumanInLoop: open-source strategy harness on top of Claude Code — DAG-based multi-agent coordination with cascade safety, focused on telling each agent what to build and why before delegation. Different angle on the orchestration question.
awesome-harness-engineering: the canonical GitHub corpus on harness patterns. First read for anyone trying to understand what’s actually being built at this layer.

The Codex CLI repo sits at 67k GitHub stars; Claude Code at 114k. The community of practice around both is active enough that the driver/worker topology is being independently rediscovered week by week. Most teams who run both for more than a month end up at some version of it.

Where We’ve Run This (Three Production Categories)

The pattern doesn’t pay for itself on small tasks. Three workload shapes earn it.

Complex code refactoring. Multi-file refactors across a large codebase, where the architecture decision drives a series of mechanical transformations downstream. The driver holds the architecture and the invariants the refactor has to preserve. The worker does the long mechanical pass, file by file, returning diffs and test results. The driver re-reads each return, catches the cases where the mechanical transformation broke an architectural assumption, and either fixes them in-place or sends the worker back with a tightened spec.

WordPress site and server migrations. Building or migrating an entire WordPress site, including the underlying server. The work is a mix of architectural decisions (theme structure, plugin selection, server topology) and long mechanical execution (block migration, content import, server provisioning, deployment scripts). The driver/worker split fits naturally: Claude Code reasons about the architecture and the migration order, Codex executes the long terminal sessions and reports back. Some of these runs go for hours.

Ground-up SAAS rebuilds. Re-platforming an existing SAAS system with upgraded security, statefulness, and reliability. The driver holds the new architecture, the security model, the state-handling decisions. The worker rebuilds modules, runs migration scripts, executes the long test passes that catch regressions. The combined session has been the highest-leverage version of the pattern we run.

The economics across these three categories: teams running this report roughly 80% higher result quality versus single-tool runs of comparable shape, with substantially more code shipped per session and a lower per-task cost (because the worker is doing the volume on the more token-efficient model). Wall-clock per session is slightly slower than single-tool runs would be (the driver/worker handoffs add a few minutes each cycle), but you do other work while the worker runs, so wall-clock isn’t the right unit. The longest single combined run we’ve executed start-to-finish was just under four hours. None of those numbers are A/B-clean; they’re what we see in practice across these three workload shapes.

The same pattern runs on our content side. Our multi-agent content pipeline runs on the same driver/worker structure at a monthly cost equivalent to roughly 3 hours of a mid-level engineer (a planning agent that delegates execution to specialized workers and integrates the returned work). Different domain, same topology. The agent team running that pipeline is structured around the same driver/worker logic at a higher level of abstraction.

What This Costs (At Team Scale)

Scale	Monthly tooling spend	Reference point
Solo developer (driver + worker)	$120-$400	Claude Max $100-$200 + ChatGPT Plus $20 or Pro $200
4-engineer team	$480-$1,200	4× Claude Max + shared/individual ChatGPT seats
Our internal pipeline (10+ agents)	~$450-$600	Cost equivalent to roughly 3 hours of a mid-level engineer per month

A mid-level engineer fully loaded runs $150K-$200K/year, which is $12K-$17K/month. The 4-engineer dual-tool stack pays for itself with single-digit hours of replaced work per engineer per month. The only published case study at large company scale we’ve seen is Anthropic’s own Rust C-compiler internal study: roughly 2,000 sessions, ~$20K total cost, on a 100K-line codebase. That’s vendor-published economics on a single-tool engagement, useful as a reference shape for what large-scale agentic work costs.

The driver/worker version of the bill comes out lower than running everything on Claude Code, because the worker is doing the volume on the more token-efficient model.

A 90-Day Team Adoption Playbook

The driver/worker pattern is teachable, but it doesn’t install itself. Teams that adopt it cleanly tend to follow some version of this rollout.

Weeks 1-2: Get one engineer fluent on the driver

Pick the driver first. Claude Code is the safer default for the driver role for most teams, because the planning, comprehension, and review work is what the driver does and that’s where Claude Code currently leads. Get one engineer fluent before involving anyone else. Set up CLAUDE.md for your codebase. Don’t add the worker yet. The point of this phase is for the engineer to internalize what work the driver actually does and what work it should hand off.

Weeks 3-4: Add the worker inside the driver’s harness

Same engineer now adds Codex as the worker. Pick a framework (BEADS+Metaswarm, Archon, or roll your own) that handles the spawn-and-return mechanics. The single calibration question this phase answers: what work should the driver delegate, and what should it keep? The answer is codebase-specific. By end of week 4, the engineer should have a one-page allocation document that captures it. Run cross-harness review on every non-trivial PR by virtue of the topology, not as a separate step.

Weeks 5-8: Roll out to the team

Other engineers adopt the driver first, then add the worker. Publish your CLAUDE.md, your ~/.codex/skills, and your framework configuration in the repo so the team inherits the same context. Hold a weekly 30-minute review: what did the driver/worker flow catch that single-tool would have missed? What did the framework get in the way of? Adjust the framework config rather than the topology. The topology is the whole point.

Weeks 9-12: Measure and decide on the framework

Three numbers to track. Token cost split between the two harnesses (worker should be doing meaningfully more of the volume; if it isn’t, the driver is over-keeping). Pull requests per engineer per week (delta from before adoption). Regression catch rate (driver re-reads of worker output should catch things that single-tool runs would have shipped). At the 12-week mark, the decision is usually about the framework, not the topology: keep BEADS+Metaswarm, swap to Archon, or move to whatever has appeared in the months since this article was written. The topology survives the framework swap.

Common Pitfalls

Treating the worker as a peer. The point isn’t redundancy or parallel allocation. The worker doesn’t see the user, doesn’t hold the architecture, doesn’t decide when something is done. Treating it as a peer collapses the pattern back into the parallel version that doesn’t compound.
Skipping the result-integration step in the driver. The whole topology depends on the driver re-reading the worker’s output before integrating it. If you let the worker’s diffs auto-merge, you’ve removed most of the value.
Over-anchoring on the framework. Framework switching is cheap now. Pick one, run with it, swap it when something better lands. Don’t build the team’s entire workflow around any specific framework’s idiosyncrasies.
Ignoring token-cost monitoring. Both harnesses can spike unexpectedly. Set thresholds and alerts; the cost-control pattern is detailed in the cost circuit breaker post.

When You Should Not Use Both

The driver/worker pattern earns its overhead on a specific shape of work. Outside that shape, single-tool is the right answer.

If your work fits in one context window or sits cleanly in one category, pick the matching tool and go deep. Driver/worker pays off when the work is large enough that the driver has something to hand off; on small focused tasks or uniform workloads, the handoff overhead exceeds the gain. If your work is 100% terminal-heavy ops, Codex alone is fine. If it’s 100% deep architectural reasoning over a small codebase you can hold in your head, Claude Code alone is fine.

Teams without operational discipline for the handoff topology should skip the second tool until they have it. Running two harnesses without the driver re-reading worker output is just running two harnesses; you get the cost of both with the catch rate of one. The structural discipline matters more than the tool count.

If your team is on one tool and shipping fine, the upgrade priority is probably not adding the second tool. It’s getting better at the one you have. The harness-effect data above (16-36 percentage points hidden in better harness configuration) suggests most teams have meaningful headroom on their current tool before they need a second.

Where Fountain City Fits

We run Driver/Worker Orchestration in our own pipeline and on client engagements. We teach it through agentic coding training for development teams and agencies. When teams want the orchestration built and operated for them rather than learning to run it themselves, that’s the work behind managed autonomous AI agents (also see our agentic development service for build-only engagements). The same driver/worker logic shows up in other agent applications too — see how the pattern shows up in agentic SEO for a different domain example.

If you want to run this yourself, you have what you need. If you want help, that’s the conversation we have.

Frequently Asked Questions

Is GPT-5.5 better than Claude Opus 4.7 for coding?

Neither is uniformly better. Opus 4.7 leads on SWE-bench Pro (64.3% vs 58.6%) and on architecture-heavy benchmarks (CursorBench 70%, GPQA Diamond 94.2%). GPT-5.5 leads on Terminal-Bench 2.0 (82.7%), OSWorld-Verified (78.7%), and Tau2-bench Telecom (98.0%), and uses ~72% fewer output tokens on equivalent tasks. The right answer depends on what shape of work dominates your team. For mixed workloads, the answer is to use both, with Claude Code as the driver and Codex as the worker, per the topology described above.

Should I use Codex or Claude Code if I can only afford one?

If your work is heavily terminal-based, ops-heavy, or token-cost-sensitive, pick Codex. If your work is architecture-heavy, involves long multi-file refactors, or requires sustained reasoning over ambiguous specs, pick Claude Code. Solo developers with mixed workloads typically default to Claude Code for the planning sophistication and add ChatGPT Plus ($20/mo) only when they hit a workload Claude Code is poor at, at which point they’re effectively running the driver/worker pattern at a small scale.

Can I use Claude Code’s CLAUDE.md context with Codex?

Not directly. Codex reads from ~/.codex/skills/. The practical workaround is the cross-pollination pattern: ask Codex to study your CLAUDE.md and your Claude Code plugins, then generate equivalent skills under ~/.codex/skills. The Skills standard is converging across both tools, so over time this is becoming more portable, but as of April 2026 you’re still translating between formats.

What is the harness effect, and why does it matter for the driver/worker pattern?

The harness effect is the capability gap between the same model running in two different harnesses. Matt Mayer’s research found Claude Opus scoring 77% in Claude Code and 93% in Cursor on identical tasks, with 16 percentage points coming purely from the harness. CORE-Bench found a 36-point gap in similar testing. The implication for the driver/worker pattern: nesting harnesses stacks the harness gains rather than averaging them. The driver gets one wrapper’s planning capability; the worker gets another’s terminal autonomy and token efficiency. That’s what makes the topology compound rather than dilute.

Are there open-source alternatives to Claude Code and Codex?

Yes. OpenCode is the most prominent: open-source with an apply_patch tool tuned for Codex-model performance. Archon is the open-source harness builder for orchestrating multiple coding agents. The Skills standard (Anthropic-originated, now multi-tool) makes cross-tool portability practical. The awesome-harness-engineering GitHub repo is the canonical inventory. We currently run BEADS+Metaswarm on top of Claude Code as the driver and Codex as the worker; the framework choice is fluid.

How long does it take a team to adopt the driver/worker workflow?

Roughly 90 days from cold start to measured rollout. Two weeks for the first engineer to get fluent on the driver alone. Two more weeks to add the worker and calibrate the delegation pattern for that codebase. Four weeks of team rollout. Four weeks of measurement before deciding whether to keep the framework or swap it. The full playbook is in the section above.

Last updated: April 2026. Both Codex and Claude Code update frequently, and the framework layer (BEADS+Metaswarm, Archon, OpenCode, others) moves faster than either model. We’ll refresh this article as Opus 4.8 and GPT-5.6 land, and as the framework choice changes.

Agentic Engineering Is Here: What Karpathy’s Naming Means for Your AI Investment

Sebastian Chedal — Tue, 28 Apr 2026 18:12:15 +0000

Your team adopted AI coding tools six months ago. Are they actually faster?

If the answer is ambiguous, you’re in good company. The productivity claims for AI-assisted development have ranged from 55-88% improvement (early Copilot studies) down to negative results for experienced engineers working on codebases they know well. The gap between those numbers isn’t a measurement error. It describes two different situations, and the difference shapes every AI investment decision.

In February 2026, Andrej Karpathy gave this gap a name. He proposed retiring the term “vibe coding” and replacing it with something more precise: agentic engineering. Within weeks, monthly searches for the term grew from a few hundred to nearly 3,000. The naming stuck because the discipline behind it has its own skills, failure modes, and quality standards, distinct from both traditional software engineering and from casual AI prompting.

What Karpathy Actually Said (And Why the Language Matters)

Karpathy’s framing:

“Agentic because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. Engineering to emphasize that there is an art & science and expertise to it.”

Two things are happening in that sentence. First, the default mode of working has changed: instead of a developer writing code, a developer is directing agents that write code and then reviewing what comes back. Second, that orchestration takes expertise. It is not just a different interface for the same work. It’s a different discipline with its own skills, failure modes, and quality standards.

Vibe coding was the early name for “give the AI a rough idea of what you want and see what it generates.” It worked well for prototypes, demos, and things that didn’t need to survive contact with reality. Agentic engineering is what you need when the output has to actually hold up.

When a field gets a name that distinguishes craft from carelessness, it usually means the field is serious enough to have developed standards. That’s now true of this one.

The Productivity Paradox Business Leaders Need to Understand

The productivity claims for AI-assisted development have ranged from 55-88% improvement (early Copilot studies from 2023-2024) down to zero or negative. A METR study from mid-2025 found that experienced open-source developers were approximately 20% slower when using AI tools on their own codebases. The study ran 16 developers across real repositories averaging 22,000 GitHub stars, not toy projects.

Research by Yegor Denisov-Blanch at Stanford puts the median productivity lift at 10-15%, not the 55-88% figure that circulated in early coverage.

These numbers don’t contradict each other. They describe different situations. The high-end figures came from developers using AI on unfamiliar tasks: generating boilerplate, writing documentation, producing code in languages they knew less well. The lower or negative figures came from experienced developers working on complex codebases they already understood deeply. There, AI interrupted their flow more than it accelerated it.

Addy Osmani’s practitioner analysis states it directly: “Agentic engineering disproportionately benefits senior engineers. If you have deep fundamentals, you can leverage AI as a massive force multiplier.” The inverse is also true. Developers who use AI to skip fundamentals accumulate invisible debt. Code that demos fine fails six months later when something needs to change and nobody understands the underlying structure.

According to IBM’s coverage of the Stack Overflow 2025 Developer Survey, 84% of developers use or intend to use AI-assisted programming, but only 3% say they “highly trust” AI-generated output. The people closest to the tools are the least convinced by them. Seasoned engineers reported the lowest rate of high trust (2.6%) and the highest rate of high distrust (20%). The developers who are best positioned to use these tools well are also the most skeptical of what the tools produce. That caution is itself a core agentic engineering practice.

ROI from agentic engineering depends far more on the skill of the orchestrator than on the cost of the AI tools. A senior engineer or a team that has put in the deliberate practice required will get dramatically different results than someone who installed an AI extension and called it done. Tool cost is nearly irrelevant. The human running the system determines the outcome.

Two Things People Call Agentic Engineering (That Are Very Different)

The term is being used for two distinct applications. They share a methodology but produce different value and require different evaluation criteria.

The first meaning is the one Karpathy coined: an engineering team using AI agents to write, test, and refine code. The human developer orchestrates the agents, reviews outputs, sets standards, and owns the final system. This applies to software product teams building applications.

The second meaning is newer and gets far less coverage: agents performing specific business functions end-to-end. Content production, research, data analysis, customer operations, process automation. No code is being written. Business work is being done. The orchestration discipline is the same, but the domain is operational rather than technical.

If you’re evaluating a software development firm’s claim to “do agentic engineering,” you should be asking about their code review processes, their testing methodology, and how they handle agent-generated code that fails quietly. If you’re evaluating a vendor claiming to use agentic engineering for business operations, you should be asking about their quality gates, their output validation processes, and what their failure response looks like.

The skills required are also different. Agentic engineering for software development requires deep engineering fundamentals. Agentic engineering for business operations requires deep domain expertise in whatever function the agent is performing, plus the architectural knowledge to design systems that catch their own errors.

What Agentic Engineering for Business Operations Actually Looks Like

Most coverage of agentic engineering is developer-facing. The same discipline applies to ongoing business operations, and one worked example is the pipeline that produced this article.

The article you are reading started as a content brief produced by our SEO research agent. The brief contained a target keyword cluster, a competitive analysis of the top ten SERP results, and a set of source links to anchor factual claims. The brief is the spec. Without it, the writing agent would be generating content from vibes, not from data. The task is designed before the agent touches it.

Once the brief was approved, the writing agent loaded it along with the company’s brand voice rules, positioning documents, and recent article history. The agent writes a first draft, but the draft does not go to the human yet. It passes through a self-review stage where the same agent evaluates the draft against the voice guide, checking for banned patterns (guru framing, AI-sounding repetition, dramatic setups), verifying that every specific claim has a source, and flagging sections that feel thin. The review generates a report.

Anthropic’s research on multi-agent harnesses surfaces the same pattern: when an agent is asked to evaluate work it produced, it tends toward confident self-approval rather than honest critique. Their engineering team published a reference architecture for this exact challenge, a planner, generator, and evaluator in sequence, and their finding was blunt: agents that generate content “confidently praise” their own output even when quality is mediocre. The solution is architectural: separate the generator from the evaluator so they’re not the same system assessing its own work.

In our pipeline, the structural answer to this problem is adversarial review. After self-review, the draft goes to a separate review stage that evaluates it from a different angle: not “does this match the voice guide” but “does this article add something new that a reader couldn’t get from the other nine results on the SERP.” A single agent reviewing its own work will miss things. Two stages with different evaluation criteria catch more. The generator and the evaluator have to be structurally separate.

Once the review passes, the human editor, Sebastian in our case, reads the final draft. He approves, requests changes, or rejects. The human owns the output even though an agent produced the draft. The approval is not a formality. Articles come back with revision instructions regularly, and the revision loop runs until the human is satisfied.

The article then moves through art direction (image generation based on brand visual guidelines), deduplication checking (ensuring this article doesn’t repeat the same proof points as the last three published pieces), and finally publication to WordPress. At each stage, defined quality gates determine whether the article advances or goes back. The article doesn’t flow forward because someone clicked approve. It flows forward because it passed a mechanical check.

This is one article. The same pipeline runs dozens of pieces per month. The same architectural shape, spec, generate, review, gate, publish, runs our software development pipeline, our SEO research, and the systems we build for clients. The vocabulary changes (“article” instead of “PR,” “editorial review” instead of “code review”), but the engineering posture is identical.

For longer worked examples, see our case studies on the Voice Intelligence Platform (telephony + AI orchestration, zero human-written code) and the Hydraulic 3D Simulation (18,000 lines of physics code, $360 in API spend).

Agentic engineering for business operations is orchestration design. The AI capability matters, but the system design, how tasks move, how quality is assessed, how errors get caught before they propagate, is where the engineering lives.

The 5 Signs Your Team (or Vendor) Is Actually Doing Agentic Engineering

Five markers separate professional practice from label adoption:

They start with a spec, not a prompt. Agentic engineering requires designing the task before AI touches it: what inputs, what outputs, what quality criteria, what failure modes. If someone jumps straight to prompting without this design phase, that’s vibe coding with extra steps, not agentic engineering.
They review every output every time through a defined process, not spot-checks. Systematic validation. The human owns the output even if an agent created it. A team genuinely doing agentic engineering will have a clear answer to “what is your output review process.” A team that isn’t will talk about how good the AI is.
They have quality gates, not just outputs. Results pass through defined criteria before moving to the next stage. Automated tests, structured review rubrics, or a validation step that must pass before handoff. If every stage produces output that flows directly to the next stage without validation, that’s a pipeline, not engineering.
They can explain what went wrong. Production agentic systems fail. The failure stories are the proof of production experience. A practitioner running real systems can tell you how a specific run failed, why it failed, and what changed in response. If someone has no failure stories, they have no production systems.
Their agents do boring work reliably. The best agentic systems are optimized for repeatability, not just capability. A system that produces impressive output occasionally is a demo. A system that produces good-enough output consistently is engineering. If every run requires significant cleanup, it’s not there yet.

These questions work for evaluating internal teams and vendors equally. The answers reveal whether someone has worked through the hard parts of production deployment, or is still describing what the technology is theoretically capable of.

What This Means for Your AI Budget in 2026

Agentic engineering is not a tool you buy. It’s a capability you build, hire, or contract for. The AI subscriptions are a small part of the cost. The capability to orchestrate, validate, and run systems reliably is where the investment goes. Three paths get you there:

Build the capability in-house. This requires hiring engineers who understand both the domain and the orchestration layer. Practitioner analysis suggests consistent productivity gains require roughly 30-100 hours of deliberate practice per person. This is not something that comes from onboarding documentation. Expect a real ramp time before the investment returns measurable value. The payoff, when it arrives, compounds: a senior engineer running agentic workflows can handle workloads that would otherwise require multiple people. The risk: if that engineer leaves, the capability leaves with them. For companies with thin technical teams, this is the strongest argument for the other two paths.

Train your existing team. Structured training on agentic development, how to design tasks, validate outputs, and build quality gates, accelerates the learning curve significantly. This is what our agentic coding workshops are built to do: take developers who understand their domain and give them the orchestration discipline that makes their AI use productive rather than risky. Training distributes the knowledge across the team rather than concentrating it in one person, which mitigates the key-person risk.

Contract with a team already running production systems. This is the lowest-risk path if the need is immediate. The cost is real, but you’re paying for operational depth, not just AI access. The key question to ask any vendor: “Show me a production system you’ve been running for more than six months. What failed, and what did you fix?” The answer tells you more than any capability list. If you’re evaluating this path, our agentic development services are built on production systems that have been running and failing and improving for well over a year.

Production agentic systems for business operations are not expensive to run once they’re built. The AI infrastructure cost is a fraction of what the equivalent human work would cost. The investment is in building and validating the system, not in running it. A well-designed agentic system runs at a fraction of the cost of manual execution. This holds only after the engineering work is done correctly.

The Consensus Behind the Name

Karpathy’s naming didn’t create this paradigm. It named something that was already developing. What makes early 2026 a meaningful moment is that three independent signals converged on the same conclusion within weeks of each other.

Karpathy named the discipline from the practitioner developer community. Separately, Anthropic published a reference architecture for multi-agent systems, the planner/generator/evaluator design they developed through running production multi-hour autonomous coding sessions. And Cloudflare launched their Agents Week, announcing infrastructure specifically designed for agentic workloads, built on the premise that agents require one-to-one compute isolation that the container model can’t provide efficiently at scale.

The model creator named the discipline. A leading AI lab published its reference architecture. A major infrastructure provider built the plumbing for it. When those three things happen independently in the same month, the paradigm is established rather than emerging.

Whether agentic engineering is established is no longer the question. How quickly your organization needs to develop or access the capability is, and which of the three paths fits your current team and timeline.

FAQ

Is agentic engineering the same as vibe coding?

No. Vibe coding describes generating code through informal prompting without systematic validation: the AI builds something, you hope it works. Agentic engineering describes orchestrating AI agents with professional discipline: designing tasks before executing them, validating outputs systematically, and maintaining human ownership of results. Vibe coding produces prototypes. Agentic engineering produces systems that hold up.

What skills do you need to do agentic engineering?

For software development: deep software engineering fundamentals plus the discipline to design, validate, and own AI-generated outputs. For business operations: deep domain expertise in whatever function the agent is performing, plus architectural knowledge of how to build multi-agent systems with reliable quality gates. In both cases, senior-level mastery of the underlying domain is the prerequisite. AI amplifies that expertise; it doesn’t substitute for it.

How long does it take to see productivity gains from agentic engineering?

Practitioner research suggests 30-100 hours of deliberate practice before consistent gains appear. That’s per person, per domain. The gains compound over time: once the orchestration patterns are internalized, the productivity differential between AI-augmented and non-augmented work becomes substantial. Expecting immediate returns from minimal onboarding will produce disappointment, not results.

Can agentic engineering be applied to business operations, not just software development?

Yes. This is the use case that gets least coverage. Agents can perform specific business functions end-to-end: content production, market research, data analysis, customer operations, knowledge management, process documentation. The orchestration discipline is identical; the domain expertise required shifts to match the function. We design and deploy these systems, and the methodology is the same as for software: spec the task, validate the output, gate the handoffs.

What’s the difference between agentic engineering and AI automation?

AI automation describes rule-based or AI-assisted workflows where the logic is predefined and the AI fills in specific tasks within that logic. Agentic engineering involves agents that make judgment calls, handle exceptions, and operate across long-horizon tasks with minimal handholding. The boundary is blurring, but the distinction is useful: automation executes defined steps; agentic engineering handles the steps that aren’t fully defined in advance.

How do I evaluate whether a vendor is actually doing agentic engineering?

Ask for their failure stories. Ask how their output review process works and who is accountable for results. Ask what their quality gates look like. A vendor running production agentic systems will have specific, concrete answers, including what broke, when, and what changed. A vendor who has adopted the terminology without the practice will describe capabilities and architectures. The difference in response texture is usually clear within a few questions.

Two AI Subscriptions and 150GB of Government Data: What the Mexico Breach Means for Every Business Running AI

Sebastian Chedal — Sat, 25 Apr 2026 18:07:25 +0000

Between December 2025 and February 2026, one person used two consumer AI subscriptions to breach nine Mexican government agencies, steal about 150GB of sensitive data, and expose roughly 195 million taxpayer records. No malware team. No nation-state. No custom infrastructure. A single operator, a Claude account, a ChatGPT account, and about six weeks.

The forensic detail matters because it rewrites the threat model every business running AI agents is operating under. Gambit Security’s investigation logged 1,088 attacker prompts that generated 5,317 AI-executed commands across 34 sessions, with Claude producing about 75% of the remote commands. The underlying vulnerabilities were conventional, the kind any patch cycle could have closed. What was new was the speed and the operator. That’s what this article is about.

In this article:

What actually happened in the Mexico breach, in plain language
Why HawkEye’s “persistent average attacker” concept changes the threat model for every AI deployment
Three lessons from the breach that apply directly to any business running agents
Five governance steps you can put in place this week, from a team running 9 production agents
What the EU AI Act’s August 2026 deadline means for the window you have to act

What Actually Happened

The campaign opened on December 27, 2025 with a social engineering move. The attacker contacted Mexican federal agencies claiming to be a legitimate bug bounty researcher. Once inside the network perimeter, they fed Claude a 1,084-line “hacking manual” that coached the model on operating stealthily, deleting history files, and acting as an elite offensive researcher. When Claude hit guardrails, the attacker rephrased. When it refused entirely, they switched to ChatGPT for the same task. Cross-platform evasion turned out to be trivial.

Over six weeks, the operation compromised the federal tax authority, the electoral institute, four state governments, a water utility, and a financial institution. At the tax authority (SAT), the attacker accessed 195 million taxpayer records and stood up a fake tax certificate service for monetization. In Mexico City, they used a scheduled task to install a persistent key, then took control of roughly 220 million civil records. In Jalisco, they seized an entire 13-node Nutanix cluster hosting health records and domestic violence victim data.

The scale is what the forensic report makes concrete. The attacker wrote a 17,550-line Python script (BACKUPOSINT.py) that piped stolen data through the OpenAI API for analysis, producing 2,597 structured intelligence reports across 305 internal servers. Gambit counted 400+ custom attack scripts, 301 in Bash, 113 in Python. Twenty tailored exploits targeted twenty specific, known CVEs. None of these are new categories of vulnerability. The CVEs existed before AI. The patches existed before AI. What didn’t exist before AI was a single person converting them into a working intelligence pipeline in six weeks.

As Paubox put it in their summary, “AI didn’t just assist, it functioned as the operational team: writing exploits, building tools, automating exfiltration.”

The Persistent Average Attacker

HawkEye’s analysts coined a phrase in their writeup that’s worth sitting with. In the final paragraph of their breach analysis, they wrote:

“Security teams that are still calibrating their defenses around what an elite attacker can do need to recalibrate around what a persistent, average one can now accomplish with AI assistance.”

The concept is the intellectual contribution of this incident. Security programs are built around threat tiers: script kiddies at the bottom, organized crime in the middle, advanced persistent threats at the top. Resources flow to defending against the top tier, because the top tier is assumed to be where creative exploitation, novel tooling, and team-level output live. The Mexico breach inverts that. A single person with a $20/month subscription produced team-level output. The operator wasn’t elite. They were patient.

The supporting data is consistent across independent sources. Arkose Labs surveyed 300 enterprise leaders and found 97% expect a material AI-agent-driven security or fraud incident within 12 months, with nearly half expecting one within six. Google’s Cybersecurity Forecast 2026 reports that more than 80% of employees use unapproved AI tools at work, with fewer than 20% using only company-approved AI. Bessemer’s 2026 analysis cites IBM’s Cost of a Data Breach Report showing shadow AI breaches cost an average of $4.63 million, about $670,000 more than a standard breach.

None of those numbers describe a sophisticated adversary. They describe ordinary people with consumer AI tools operating at scales that used to require teams.

Three Lessons From the Breach

Three patterns in the Mexico incident generalize to any business running AI.

1. AI Tools Can’t Tell Authorized From Unauthorized Use

Claude didn’t know it was helping an attacker until the conversation pattern tripped a safety heuristic. When it refused, the attacker rephrased. When Claude refused again, the attacker moved the same task to ChatGPT. This is an important thing to internalize: model safety training is probabilistic, and an operator who treats guardrails as obstacles to route around will, given enough tries, route around them. Model vendors are aware of this and Anthropic actually kicked the attacker off twice. The attacker just came back with a new account.

For businesses, the implication is not “pick a safer model.” Every major provider has the same property. The implication is that model-level safety is one layer among several, and it cannot be the only layer. Anything you rely on a model refusing to do should also be something your infrastructure refuses to execute.

2. The Vulnerabilities Were Old. The Attack Speed Was New.

The twenty CVEs the attacker exploited were standard. They had patches available. The government agencies had the same profile any mid-market company has: a backlog of known vulnerabilities, limited patching bandwidth, and the assumption that exploitation of conventional bugs is slow enough to catch in a review cycle. What AI changed was the compression of the exploit-to-exfiltration timeline. A vulnerability assessment to working exfiltration path now fits in a single afternoon instead of a multi-week project.

If your organization runs a mature vulnerability management program, the pace of that program may no longer match the pace of attack. If your organization runs an immature one, the gap is worse. The practical consequence is that “we’ll patch it in the next cycle” is no longer a defensible answer for anything that’s both exposed and exploitable.

3. A Single Operator Produced Team-Level Output

The 305 servers, 2,597 intelligence reports, and 400+ attack scripts would, pre-AI, require a team. Here they came from one person. This compression of attacker capability is permanent. It is not a one-off. The playbook is now public, which means the technical barrier to repeating it is how quickly a motivated operator can read a few forensic writeups.

For defense, this means the traffic profile of an attack may no longer match the expected signature of a solo actor. An alerting system that triages “probable bot scan,” “probable insider error,” and “probable team-scale operation” needs to rethink the middle category. A lot of future incidents will look like team-scale operations conducted by one person.

What This Means If You’re Running AI Agents

There’s a clean asymmetry between how this breach is usually read and how business leaders deploying agents should read it. The usual reading is “attackers are using AI, so I need better defensive AI.” The more useful reading is that the breach is a preview of what an ungoverned agent inside your own environment can do when something goes wrong, whether that something is a compromised prompt, an embedded malicious instruction in a document, or a confused integration.

A production AI agent is, by design, an operator. It has credentials, it acts on systems, it chains tool calls, and it’s fast. If an attacker can use consumer AI from outside your perimeter to compromise government networks, the risk profile of an AI agent you’ve already placed inside your perimeter, connected to production systems, is not smaller. It’s the same capability, pointed inward.

Three risk categories are worth naming for any business running agents:

Agents as targets. Prompt injection, tool-call hijacking, and data exfiltration through an agent’s own legitimate channels. The attacker doesn’t breach your perimeter, they submit a support ticket.
Agents as amplifiers. An agent with broad permissions plus a compromised instruction equals an internal Mexico breach at compressed speed. This is the scenario Bessemer’s analysis highlighted when citing McKinsey’s “Lilli” AI platform being compromised by an autonomous agent in under two hours.
Shadow agents. The Google statistic (80% of employees using unapproved AI) translates directly into people standing up agents with personal accounts, connecting them to company data through browser extensions, MCP servers, and SaaS integrations, with no IT visibility.

Arkose’s survey is worth reading alongside this. 57% of organizations have no formal governance controls for AI agents. Only 6% of security budgets are allocated to AI-agent risk. The gap between expected incidents (97%) and allocated resources (6%) is the gap every mid-market security program is quietly running today.

Five Things to Do This Week

We run 9 production AI agents at Fountain City on a documented governance architecture that costs us roughly $450 to $600 per month to operate. The specific thresholds, circuit-breaker design, and trip logic are documented in our cost circuit breaker article, and the broader hardening stack lives in our AI agent security hardening guide. The five items below are the concrete governance moves that map directly onto the failure modes the Mexico breach illustrated, written for a business leader who has an agent program and wants to tighten it this week.

1. Inventory Every Agent, Tool, and AI Subscription

You can’t govern what you haven’t counted. The inventory is not just the agents IT approved. It’s every browser extension using OpenAI, every Claude subscription on a corporate card, every Zapier flow with an AI step, every sales rep using a “just for notes” AI notetaker that is, technically, a recording and transcription agent connected to your meetings. If the Google statistic holds in your company, the real count is four to five times whatever IT has on its list.

A week-one inventory doesn’t need to be perfect. It needs to exist, be dated, and get reviewed.

2. Put a Spending Cap on Everything That Calls an API

The Mexico attacker had no spending cap. If they had, the 5,317 commands and 2,597 intelligence reports would have tripped a halt well before the breach completed. Runaway cost is the most reliable early signal of misuse, whether the misuse is a bug, a compromised prompt, or an insider experimenting outside policy.

Our thresholds are documented in the cost circuit breaker article linked above. The exact numbers matter less than the fact that they exist and enforce. If your current architecture can’t halt an agent on spend, that’s a week-one fix.

3. Pin Models and Keep Low-Cost Models Out of Critical Roles

Model selection is a security decision, not just a cost decision. Pin specific model versions to specific tasks, so a capability change in the model doesn’t silently expand what your agent can do. And don’t let the cheapest models run anything critical. Lower-tier models are more prone to pattern errors and more susceptible to prompt injection, which means giving them access to production systems or sensitive data is a policy decision that should be made explicitly, not by default.

General rule: the model tier should be calibrated to the blast radius of the task, not to the price list.

4. Require Comprehensive Audit Trails

The Mexico breach was discovered in part because the attacker’s own conversation logs were publicly accessible from a misconfigured server. That’s the low bar. The high bar is: every prompt into every production agent, every tool call it makes, every data source it touches, every output it produces, logged in a form that supports both real-time anomaly detection and after-the-fact forensics.

This is boring, expensive, and non-negotiable. If a future incident traces back to one of your agents, the first question will be “show me what it did.” The answer “we don’t have logs going back that far” is the answer that becomes the press quote.

5. Separate Agent Permissions by Task

The government agencies gave broad system access to accounts that ended up compromised. The lesson is the oldest one in security, just applied to a new class of principal. Each agent should get only the permissions it needs for its specific job. Read-only where read-only works. Per-environment scoping where cross-environment access isn’t required. Timeouts on sessions so a compromised agent doesn’t have an unlimited runway.

Least privilege isn’t just for employees anymore. An agent is an actor with credentials. Treat it as one.

The Window Is Closing, But Not for the Reasons You Think

The urgency here is that the Mexico breach is now a template. Every forensic writeup, every reconstruction of the attacker’s workflow, every public conference talk about the incident shortens the distance between “motivated operator” and “working offensive pipeline.” The technical floor has dropped.

The regulatory floor is rising at the same time. Full enforcement of the EU AI Act lands in August 2026. For any business with European exposure, that’s a hard date by which “we were still figuring out governance” stops being an acceptable answer. For US-only businesses, the state-level regulation following EU precedent will run on a similar timeline, measured in quarters not years.

The companies that will scale AI agents safely are the ones that treat governance as part of the build, not part of the cleanup. The rest will be case studies. You probably already know which one you want to be. The question is whether you have your inventory done, your spending caps live, your models pinned, your logs complete, and your permissions scoped, by the end of the quarter.

If you want a second set of eyes on where your program sits against this threat model, our AI Risk and Security Assessment is the structured version of the conversation we’re having in the second half of this article. It covers inventory, spending posture, model selection, logging depth, and permission scoping against your actual deployment.

Frequently Asked Questions

Was the Mexico breach carried out by a sophisticated hacker?

No. According to Gambit Security’s forensic analysis, the operation was run by a single individual with no identified nation-state or organized crime connection. The attacker used consumer Claude and ChatGPT subscriptions, exploited twenty known CVEs with existing patches, and relied on AI to generate the custom tooling. The significance of the incident is that it didn’t require sophistication.

Can consumer AI tools like Claude and ChatGPT be used to attack my business?

Yes, but the pattern to worry about is not “AI creates novel vulnerabilities in your systems.” It’s “AI dramatically compresses the time from discovering a conventional vulnerability in your systems to exploiting it.” The defensive implication is that patching cadences, alert latencies, and vulnerability management cycles that were adequate at pre-AI attacker speed may no longer be adequate at post-AI attacker speed.

What is the “persistent average attacker” and why does it matter?

The phrase comes from HawkEye’s analysis of the Mexico breach. It describes an operator who is not elite, not backed by a team, and not using novel techniques, but who is patient and equipped with AI. The reason it matters is that most security programs are calibrated around sophisticated adversaries. The Mexico incident demonstrated that an ordinary person with consumer AI tools can now produce team-level output. Defenses calibrated only for the top of the threat pyramid will underprotect against the much larger population that just got an order-of-magnitude capability boost.

How much does AI agent governance actually cost?

Less than people assume. Our own governance stack (logging, cost circuit breakers, model pinning, audit trails) runs at a small percentage of total operating cost across the agents we run in production. Governance is a small line item, and a small fraction of the cost of even a minor incident. IBM’s 2025 data, cited by Bessemer, puts shadow AI breaches at about $4.63 million per incident on average.

Does a small or mid-size business need to worry about this?

Yes. Mid-market companies typically have more ungoverned AI usage than enterprise, with fewer resources to detect misuse. The Google statistic (80% of employees using unapproved AI) holds across company sizes, which means the inventory problem is proportionally worse at smaller organizations that don’t have a dedicated AI governance function. The good news is that the first three of the five governance moves above are operational, not technical, and can be started this week without any new tooling.

What’s the single most important thing to do right now?

Inventory. You can’t cap spend, pin models, log activity, or scope permissions for agents you don’t know exist. Every governance move downstream depends on knowing what’s running. Start there, and the rest of the program has somewhere to attach.

"Build, Don't Buy" AI Agents: A Practitioner's Guide to Replacing SaaS

Sebastian Chedal — Thu, 23 Apr 2026 18:09:18 +0000

The Build vs. Buy Question Has Changed

Two signals landed in the same week. A CIO.com report showed enterprises spending $280 million annually on 600+ SaaS applications. And a solopreneur documented 33 custom AI agents running her entire business for $10-20 a month.

Enterprise and solo operators arrived at the same question independently: why am I paying for software I barely use when I could build exactly what I need?

The old rule was simple. Buy software for anything that isn't your core competency. It was good advice when building meant hiring a development team, managing servers, and maintaining code. But AI agents have shifted the economics. A custom agent that does one job well can now cost less to build and run than the SaaS subscription it replaces.

That doesn't mean "always build" is the new rule. It means the decision framework has changed, and most of the content out there is either a vendor selling you their platform or a dev shop selling you a build engagement. What follows is the practitioner's version, based on building these systems for clients and running them internally.

The SaaS Replacement Decision Framework

Build-vs-buy is a decades-old IT decision. Lemkin's 90/10 rule is directionally correct for the AI agent era. The CIO.com enterprise analysis focuses on spend optimization at scale. Both frameworks answer "should I consider replacing SaaS with agents?" What they don't answer is: which specific tools should I replace, and in what order? That's the practitioner gap. The four factors below are what we use to evaluate every SaaS tool in a client's stack. They're derived from the same economic logic as Lemkin's rule and the CIO analysis, but refined by what we've actually seen in production builds.

Factor 1: Feature Utilization Rate

Large enterprises run 600+ SaaS applications. Mid-market companies maintain smaller stacks, but the pattern is the same: for any given tool, the typical team uses 10-15% of available features. You're paying for a content platform with 200 features when you need 12 of them. A custom agent built around those 12 features costs a fraction of the subscription and does exactly what your workflow requires.

The trigger: if your team has never opened half the tabs in a tool's interface, that tool is a replacement candidate.

Factor 2: Data Lock-in Exposure

Some SaaS tools hold your data in formats that make leaving expensive. CRM systems with years of interaction history. Project management tools where your entire operational knowledge lives in proprietary fields. A client's entire sales history lives in a CRM's proprietary deal stages. Migrating that data to a new system means manually remapping three years of pipeline data, custom fields, and automation triggers. The longer you stay, the more leverage the vendor has on pricing, and a custom agent that processes and stores data in formats you control eliminates vendor lock-in entirely. This factor weighs heavier the more proprietary data the tool accumulates.

Factor 3: Integration Friction

Count how many Zapier connections, middleware layers, or custom API bridges you maintain to keep your tools talking to each other. Each integration is a maintenance surface and a failure point. One client maintained six Zapier connections and a custom webhook to keep their CRM, invoicing, and website analytics in sync. When one connection broke, the downstream data was silently wrong for two weeks before anyone noticed. When three SaaS tools need a middleware layer to work together, the total system cost includes the tools, the middleware, and the engineering time to keep the connections running.

A purpose-built agent that handles the entire workflow natively eliminates the integration layer. The savings compound as the number of connected tools grows.

Factor 4: AI Readiness of the Vendor

This one comes from Jason Lemkin at SaaStr: "If it's February 2026 and your product has zero AI features, that's your signal to start building." A SaaS tool that hasn't shipped meaningful AI capabilities by now is running on legacy architecture. That vendor is either unable or unwilling to evolve. Your custom replacement will outpace them within months.

But there's a nuance. Some vendors have shipped AI features, but they're shallow. A CRM that added "AI-powered insights" that's really just a GPT wrapper over your data. A content platform that added "AI writing" that produces generic copy with no access to your brand voice rules, no integration with your knowledge base, and no connection to the rest of your content workflow. The useful version of AI readiness is a spectrum: no AI features at all (clear replacement candidate), bolted-on AI (checkbox feature, not workflow-integrated, limited utility), and deeply integrated AI (core to the product, meaningfully changes how you use the tool). Only the third category is a strong argument for keeping the SaaS tool. The second is actually the most dangerous, because the vendor can claim "we have AI" while the actual capability is superficial, and the buyer feels locked in because "they're working on it."

Score each tool against these four factors. Two or more red flags and the tool belongs on your replacement shortlist. Gartner projects 35% of current SaaS tools will be replaced or absorbed by 2030, and the companies making that shift are the ones evaluating their stacks methodically rather than reactively.

The Framework in Practice: A Real Build Decision

A client needed a data intelligence platform that provides full customer journey analytics across five interconnected systems: HubSpot (CRM, deals, marketing), QuickBooks (invoicing, revenue), WooCommerce (e-commerce orders), website analytics (visitor behavior, forms, repeat visits), and ad platforms (LinkedIn/YouTube retargeting with UTM tracking).

The feature list was ambitious: complete customer journey visualization across every touchpoint, individual customer journey flow charts, path-to-product analysis (what journey leads to a specific product purchase), UTM source-to-sale attribution, action-to-conversion analysis (which behaviors predict purchase), ML prediction on future customer actions, and conversational BI that lets you talk to the data in natural language with charts and tables generated in chat.

The build uses Grist (open-source, self-hosted spreadsheet/database) as the data layer, connecting to all five systems through APIs, with AI agents handling conversational analytics and prediction. The project is in final testing, with most main features built.

Before building, we researched what the SaaS equivalent would cost.

No single SaaS platform covers the full scope. The client would need 2-4 platforms combined. A lean mid-market stack (Mixpanel or Amplitude, HubSpot Pro, a BI/chat layer) would run roughly $5,000-$20,000+ per year depending on event volume and seats. A revenue/marketing ops stack (HubSpot Enterprise, attribution tool, BI/chat layer) would cost roughly $15,000-$60,000+ per year. An enterprise journey suite (Adobe Customer Journey Analytics or Qualtrics XM/CX) would cost $25,000-$200,000+ annually, often much higher with implementation. And setup effort for the SaaS route: 60-150+ hours for cross-system implementation that unifies QuickBooks, HubSpot, WooCommerce, website events, UTMs, and retargeting touchpoints. The hard part isn't clicking buttons in the product. It's identity resolution, naming conventions, backfills, event design, data QA, and reporting logic.

The client never built this capability before because the SaaS route was unaffordable.

Scored against the four factors:

Feature utilization: Low. No single SaaS tool covers the full scope (journey analytics, CRM, invoicing, attribution, conversational BI, ML prediction). The client would use a fraction of each platform and still have gaps.
Data lock-in: High risk. Customer journey data fragmented across 2-4 vendors in proprietary formats. Leaving any one of them means losing part of the customer picture.
Integration friction: Extreme. The SaaS research estimated 60-150+ hours just for cross-system identity resolution and data integration. Each platform connection is a maintenance surface.
AI readiness: Weak in mid-market tools. Conversational BI and ML prediction are either premium add-ons, require separate platforms, or don't exist in the tools that cover the other needs.

All four factors flagged red. The framework predicted that building would win on every dimension.

The actual build cost: under $10,000 for design, development, and testing. Monthly operating cost under $150 (hosting at roughly $100/month plus AI tokens at roughly $50/month after the first month stabilizes; first month token costs are higher at roughly $200 during setup and tuning). Annual operating cost: roughly $1,800/year.

The comparison is stark. Year 1: roughly $11,500 total (build + operating) versus $11,000-$35,000 for the leanest SaaS option (subscription + 60-150 hours of setup labor at $100/hour). The enterprise SaaS route ($25,000-$200,000+ annually plus implementation) doesn't bear comparison. Year 2 onward: roughly $1,800/year versus $5,000-$20,000/year in SaaS subscriptions, which will have increased by then. The gap widens every year.

The client now has full customer journey analytics, conversational BI, ML prediction, and cross-system attribution, capabilities that in the SaaS world either don't exist in the mid-market tier or require $25,000+ enterprise suites. The custom build connects all five systems natively through a single data layer, eliminating the middleware and identity-stitching overhead that makes the SaaS route expensive.

What to Build First: The Replacement Sequence

The biggest mistake in SaaS replacement is starting with the highest-stakes tools. Companies that try to replace their customer support platform or CRM first tend to stall. The implementation is complex, the failure consequences are visible, and the team hasn't built any operational muscle for running custom systems.

A better sequence:

Tier 1: Internal tools you touch daily. Reporting dashboards, research workflows, content production, internal knowledge bases. These affect only your team. If something breaks, the customer never sees it. This is where you learn how to operate custom agents with minimal risk.

We followed this progression ourselves. Our first custom agents replaced internal content production workflows — research aggregation, draft generation, cross-article quality checks. The stakes were low enough to learn from every failure, and the operational patterns we developed there became the foundation for everything we build for clients.

Tier 2: Customer-adjacent tools. CRM enrichment, lead scoring, proposal generation, support triage that routes to humans. These touch customer data but don't face customers directly. Failures are catchable before they reach anyone external.

Tier 3: Customer-facing tools. Portals, communication interfaces, interactive tools. Only attempt these after you've operated Tier 1 and Tier 2 systems long enough to understand the maintenance patterns. SaaStr's Jason Lemkin replaced a sponsors portal that had been costing $5,000-$10,000 annually, but he did it after months of building internal tools first.

The principle is straightforward: start where the cost of failure is lowest and the learning value is highest.

What NOT to Build: The Keep List

The honest answer to "build or buy" includes a list of things you should never build, even when the technology makes it possible.

Compliance and regulatory tools. SOC2 audit trails, GDPR consent management, HIPAA documentation. The value of these tools is the vendor's legal and compliance team maintaining them as regulations change. Building your own means hiring that compliance expertise permanently.

Payment processing. Stripe, payment gateways, financial transaction systems. The security, fraud detection, and regulatory requirements make this a permanent cost center with no upside in building custom.

Identity and authentication. SSO providers, multi-factor auth, credential management. The attack surface is enormous and the liability is existential. Let specialists handle this.

Platform-native tools where the platform IS the value. If your entire sales operation runs on Salesforce, building a Salesforce replacement isn't a SaaS substitution. It's a business migration. These are different decisions with different economics.

Tools where vendor-managed security is the product. Email security, endpoint protection, network monitoring. You're paying for the vendor's threat intelligence and response team, not just the software.

SaaStr's "90/10 rule" is directionally correct: buy 90% of your tools, build the 10% where custom agents deliver disproportionate value. The framework above helps you identify which 10%.

The Agency Dimension: Build Once, Deploy for Ten Clients

The article so far frames build-vs-buy as a single-company decision. But agencies face a second dimension: should I build agent capabilities I can resell to my clients?

The economics are fundamentally different. An agency that builds a custom research agent for one client can deploy variants for ten clients. A $15,000 build that serves 10 clients at $500/month each pays for itself in three months and generates recurring revenue after that. The build cost amortizes across the client portfolio in a way that makes no sense for a single company.

The alternative is reselling a SaaS platform with agency branding. That makes the agency a middleman adding margin, not a builder creating proprietary value. When the SaaS vendor raises prices or changes features, the agency has no control. A custom build gives the agency full control over pricing, features, and the client relationship.

We see this pattern directly in our work. Agencies come to us because they want to offer AI agent capabilities to their clients without being dependent on a SaaS platform they don't control. The build-vs-buy framework applies the same way, but the breakeven math is faster because the build serves multiple revenue streams.

The Middle Ground: No-Code Agent Platforms

The choice isn't strictly binary. No-code agent platforms (Relevance AI, CrewAI, and similar tools) sit between full custom builds and off-the-shelf SaaS. They work well for simple, single-agent workflows with standard integrations: a research agent that queries public data, a content summarizer that processes feeds, a lead qualifier that works within your existing CRM.

They break down when you need multi-agent coordination, custom quality gates, deep integration with your specific data systems, or workflows that span multiple business domains. That gap, complex, multi-system, domain-specific agent work, is where custom builds operate. The four-factor framework still applies. If a no-code platform covers your needs without the lock-in and friction problems, it's a valid option. If it doesn't, you're back to the build-vs-SaaS decision.

The Real Economics: Build Costs vs. SaaS Subscriptions

Most cost comparisons in this space are unreliable. Enterprise vendors claim building costs $8.3 million over three years. Solopreneurs claim $10 a month in API costs. The reality depends entirely on scale and scope.

Cost Factor	Solopreneur	Mid-Market	Enterprise
Build Cost (one-time)	$0 – $500 (DIY)	$6K – $18K	$50K+
Monthly Operating	$10 – $50 API	$600 – $4K managed	$5K – $15K+
SaaS Equivalent	$100 – $500/mo	$2K – $10K/mo	$50K – $280K/yr
Breakeven Timeline	Immediate	3 – 9 months	6 – 18 months

The solopreneur numbers come from Kim Doyal, who runs 33 custom agents on $10-20 a month in API costs and reports a 75-80% reduction in time spent on repetitive work. These figures assume the builder is also the operator with technical skills, which is a different model from a mid-market team hiring an implementation partner. The mid-market numbers reflect what agentic development actually costs when an implementation partner handles the build and ongoing management. Enterprise ranges are directional, drawn from Clustox and industry benchmarks.

The critical number for mid-market buyers: at $2,000 a month in SaaS spend being replaced, an $18,000 build pays for itself in nine months. A $6,000 build breaks even in three. These numbers don't account for the value of owning your system, no vendor lock-in, no price increases, no feature changes you didn't ask for. The agent does exactly what you need and nothing else.

There's also a cost trajectory working in favor of custom builds that most comparisons miss. SaaS pricing only goes up. A tool that costs $500/month today will cost $600/month in two years because vendors raise prices. Custom agent API costs go down every six months as models get cheaper and more efficient. A custom agent that costs $200/month today will likely cost $120/month in two years. The cost crossover widens over time, not narrows. This is one of the strongest long-term arguments for building.

Three Mistakes That Kill SaaS Replacement Projects

Mistake 1: Replacing Customer-Facing Tools First

A company replacing their customer support chatbot before they've ever run a custom agent internally is making the highest-stakes bet with the least experience. When the agent produces an incorrect response, the customer sees it. When it goes down, the customer notices. Start with internal tools where failures are private and learning is cheap.

Mistake 2: Building What You Don't Understand

If nobody on your team can articulate why your current tool's workflow exists, a custom agent won't fix that. Agents automate processes. If the process itself is unclear, the agent will automate confusion faster. Before building a replacement, document the workflow the tool supports. Every step, every decision point, every exception. If you can't write it down, you can't automate it.

Mistake 3: Ignoring Maintenance Compounding

Jason Lemkin's most important observation from building 20+ custom tools: "Every app you build is an app you now have to maintain." Each custom system adds to your maintenance surface. API providers change their interfaces. Models update and produce different outputs. Edge cases accumulate.

Analysis from Clustox puts the numbers in sharper focus: first-year costs for AI-built systems run roughly 12% higher than initial estimates once you factor in code review overhead and a testing burden that's 1.7 times the norm. AI-generated code carries roughly double the code churn rate of traditional development, and by year two, cumulative maintenance costs can reach four times traditional levels as technical debt compounds. These figures are drawn from Clustox's build-vs-buy comparison for AI tools, which aggregates data across multiple enterprise deployments.

The mitigation: either budget for ongoing maintenance from day one, or work with an implementation partner who handles the maintenance surface for you. The second option converts an unpredictable engineering cost into a predictable monthly fee.

[Interactive chart on the original post.]

How to Get Started

The gap between "this sounds right" and "we actually replaced a SaaS tool" is narrower than it looks, but only if you approach it methodically.

Week 1: Audit your stack. List every SaaS tool your team uses. For each one, note the monthly cost, how many features your team actually touches, and whether it integrates cleanly with your other tools. Most teams discover 3-5 obvious candidates within an hour of honest assessment.

Week 2: Score the top candidates. Run your shortlist through the four-factor evaluation. Utilization rate below 20%? Data locked in proprietary formats? Middleware required to connect it? No AI features shipped? Two or more flags and the tool moves to the replacement list. Cross-check against the Keep List — if it falls in a "never build" category, leave it regardless of the score.

Week 3-4: Build one agent. Pick the highest-scoring internal tool. Define exactly what the replacement needs to do, not everything the SaaS tool does, just the specific workflows your team relies on. The build itself is faster than most people expect. Simple internal agents that handle reporting, research aggregation, or content workflows can be operational in days, assuming either an internal developer or an implementation partner doing the build. For teams without technical resources, "days" means days of working with a builder, not days of building yourself. For context on evaluating build partners, the comparison guide covers what to look for.

Month 2-3: Operate and measure. Run the custom agent alongside the SaaS tool for 30 days. Track the actual API costs, the time your team spends on oversight, and any edge cases that surface. Compare against the SaaS subscription cost. The real numbers will be different from projections — they always are — but the gap between "projected" and "actual" is where your operational learning lives.

After 30 days of measured operation, you'll know whether the economics hold and whether your team can sustain the maintenance. That knowledge is worth more than any vendor comparison chart.

Making the Decision

The decision tree is simpler than most guides make it:

Score the tool against the four evaluation factors (utilization, lock-in, integration, AI readiness)
If two or more factors flag high risk, the tool is a replacement candidate
Check the Keep List. If the tool falls in a "never build" category, keep it regardless of the score
Place replacements in the right tier (internal → customer-adjacent → customer-facing) and sequence them accordingly
Build one agent first, operate it for 30 days, and measure real costs against projections before committing to the next build

The companies that succeed at this aren't the ones that replace everything at once. They're the ones that pick the right first replacement, learn from operating it, and expand methodically.

If you're evaluating whether custom agents make sense for your stack, the agentic development page covers how we scope and price these builds, and the implementation guide walks through the build process from a practitioner's perspective. For context on what an AI agent actually is and how it differs from conventional automation, that's a good starting point. And for a broader view of the AI implementation services available, the services overview has the full picture.

FAQ: Build vs. Buy AI Agents

How long does it take to build a custom AI agent to replace a SaaS tool?

Simple internal tools — a reporting dashboard, a research workflow, a content production pipeline — can be built and deployed in days. Customer-facing systems with integrations, error handling, and monitoring typically take two to six weeks. Multi-agent systems that coordinate several workflows take longer, often one to three months from scoping to production. The complexity of the workflow being replaced matters more than the technology involved.

What SaaS tools are companies replacing with AI agents in 2026?

The most common categories are content production tools, CRM enrichment and lead scoring, research and competitive intelligence platforms, internal reporting dashboards, and customer support triage. These share a pattern: the SaaS tool provides broad capability, but the team uses a narrow slice of it. That narrow slice is exactly what a purpose-built agent handles well. Gartner projects 40% of enterprise applications will embed task-specific agents by the end of 2026, up from less than 5% in 2025.

How much does it cost to build a custom AI agent?

For a solopreneur using AI-assisted development, the build cost can be near zero with $10-50 a month in API costs. For a mid-market business working with an implementation partner, expect $6,000-$18,000 for the initial build plus $600-$4,000 a month for managed operation and API costs. Enterprise multi-agent systems start at $25,000 and scale with complexity. See the cost comparison table above for breakeven timelines against typical SaaS spend.

Can a small team build AI agents without developers?

For internal tools, yes. AI-assisted development approaches let non-developers describe workflows in natural language and generate working agents. Kim Doyal runs 33 agents without a development background. For production systems that handle customer data or integrate with critical business processes, engineering oversight matters. The build itself may use AI-assisted development, but someone needs to validate security, error handling, and edge cases.

What happens if the AI agent breaks?

Every custom system needs monitoring and fallback plans. Agents should fail gracefully — alerting a human rather than producing incorrect outputs silently. The maintenance reality is real and quantifiable: first-year costs run roughly 12% above initial estimates (per Clustox's analysis), and by year two, cumulative maintenance can reach four times traditional levels. Budget for ongoing maintenance or use a managed service model. This is the single biggest factor most build-vs-buy analyses underestimate.

Is it cheaper to build or buy AI in 2026?

It depends on the tool and your utilization rate. If you're using 80%+ of a tool's features, buying is almost certainly still the right choice. If you're using 10-15% and paying full price, building the slice you actually need will likely cost less within the first year. Run the four-factor evaluation from this guide against each tool in your stack. The answer will be different for every tool.