<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sebastian Chedal</title>
    <description>The latest articles on DEV Community by Sebastian Chedal (@sebastian_chedal).</description>
    <link>https://dev.to/sebastian_chedal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3846478%2F5d345e20-5611-4756-9633-253eef7d12a5.jpg</url>
      <title>DEV Community: Sebastian Chedal</title>
      <link>https://dev.to/sebastian_chedal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sebastian_chedal"/>
    <language>en</language>
    <item>
      <title>Is AI as bad for the environment as people say it is?</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Fri, 12 Jun 2026 18:07:15 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/is-ai-as-bad-for-the-environment-as-people-say-it-is-4hd3</link>
      <guid>https://dev.to/sebastian_chedal/is-ai-as-bad-for-the-environment-as-people-say-it-is-4hd3</guid>
      <description>&lt;p&gt;A lot of the AI-environment writing on LinkedIn and in mainstream press, while correct when it was written, has been overtaken by new data. The per-query energy and water numbers that anchored the 2024 panic narrative have come down by an order of magnitude as first-party disclosures from &lt;a href="https://cloud.google.com/blog/products/infrastructure/measuring-the-environmental-impact-of-ai-inference" rel="noopener noreferrer"&gt;Google&lt;/a&gt;, OpenAI, and Mistral replaced 2023 best-guesses. I was curious what the current truth is and I feel it is important to get a more clear picture of the real environmental impact AI is having, and is projected to have in the future.&lt;/p&gt;

&lt;p&gt;First of all, it is worth noting that previous AI environmental discussions rarely consider whether AI adoption is net-positive for an operator’s overall footprint (you, me, anyone using AI). To really understand the impact, we need to think about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The class of AI model being used&lt;/li&gt;
&lt;li&gt;What workflow AI is replacing&lt;/li&gt;
&lt;li&gt;Whether a user can even avoid AI inference (now that AI Overviews run on roughly half of Google searches)&lt;/li&gt;
&lt;li&gt;What kind of infrastructure backs the compute, and&lt;/li&gt;
&lt;li&gt;Whether efficiency gains get spent on more output or more work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article I will cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What the per-query energy numbers actually are in 2026, and why they fell so fast&lt;/li&gt;
&lt;li&gt;Why reasoning models break the math, and what that means for which model you pick&lt;/li&gt;
&lt;li&gt;The missing counterfactual: the workflow AI replaces is &lt;em&gt;not&lt;/em&gt; zero&lt;/li&gt;
&lt;li&gt;Why opting out by going back to Google search is no longer an option&lt;/li&gt;
&lt;li&gt;Four pre-commercial infrastructure bets where the math gets rewritten&lt;/li&gt;
&lt;li&gt;What responsible AI use in 2026 looks like&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The AI-environment narrative in 2026
&lt;/h2&gt;

&lt;p&gt;The loudest version of the AI-environment story runs roughly like this: a single ChatGPT query uses ten times the energy of a Google search; a viral statistic says it drinks 500 mL of water per prompt; data center buildout is straining grids in Virginia and Arizona; the individual user, by reaching for ChatGPT instead of search, is participating in a climate harm. Some of that read is correct. Data center growth &lt;em&gt;is&lt;/em&gt; a real systemic concern, reasoning models &lt;em&gt;do&lt;/em&gt; change the math in ways most coverage hasn’t caught up to, and the deployment timeline for compute infrastructure &lt;em&gt;is&lt;/em&gt; running ahead of the clean-energy timeline meant to power it.&lt;/p&gt;

&lt;p&gt;The part that’s harder to hold in mind is that the numbers underneath the narrative moved fast and downward in late 2025 and into 2026. &lt;a href="https://hannahritchie.substack.com/p/ai-footprint-august-2025" rel="noopener noreferrer"&gt;Hannah Ritchie’s August 2025 update at Sustainability by Numbers&lt;/a&gt; is the cleanest single read on what changed; &lt;a href="https://blog.andymasley.com/p/individual-ai-use-is-not-bad-for" rel="noopener noreferrer"&gt;Andy Masley’s individual-use essay&lt;/a&gt; built the counterfactual frame that the panic narrative tends to skip. Both writers were working in good faith with the numbers they had when they wrote. The numbers are only getting better.&lt;/p&gt;

&lt;h2&gt;
  
  
  The per-query numbers came down by orders of magnitude
&lt;/h2&gt;

&lt;p&gt;Four independent estimates published in 2025 and into 2026 converge in a tight band. Google’s August 2025 paper on Gemini inference puts a median text query at &lt;strong&gt;0.24 Wh&lt;/strong&gt; of electricity. &lt;a href="https://blog.samaltman.com/the-gentle-singularity" rel="noopener noreferrer"&gt;Sam Altman casually mentioned&lt;/a&gt; in mid-2025 that a standard ChatGPT text query uses about 0.34 Wh. &lt;a href="https://epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use" rel="noopener noreferrer"&gt;Epoch AI’s independent estimate&lt;/a&gt; for GPT-4o landed at roughly 0.3 Wh, which they explicitly framed as “ten times less than the older estimate” of around 3 Wh that was circulating in 2023. The &lt;a href="https://arxiv.org/html/2505.09598v2" rel="noopener noreferrer"&gt;May 2025 arxiv benchmark “How Hungry is AI?”&lt;/a&gt; measured a GPT-4o short query at 0.43 Wh. The most rigorous figure published since is also the most recent: an &lt;a href="https://www.cell.com/joule/fulltext/S2542-4351(26)00114-5" rel="noopener noreferrer"&gt;April 2026 peer-reviewed study in &lt;em&gt;Joule&lt;/em&gt;&lt;/a&gt; from Microsoft Research measured a median query at &lt;strong&gt;0.31 Wh&lt;/strong&gt; (interquartile range 0.16 to 0.60) across optimized frontier-scale inference, and found that the widely circulated public estimates overstate real-world energy use by four to twenty times. It lands squarely in the same band, from a stronger source.&lt;/p&gt;

&lt;p&gt;While the numbers are not identical, they sit in a band of roughly 0.24 to 0.43 Wh, and the most recent first-party disclosure is at the low end. Energy use is dropping and also converging, only a year and a half ago, they didn’t agree to within a factor of ten.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Studies lag.&lt;/strong&gt; Rigorous per-query measurement runs, on average, six to twelve months behind a model’s release. The figures above are the most recent &lt;em&gt;measured&lt;/em&gt; generation, the models the benchmarks could actually access, and the trend line points down from there. OpenAI, for example, &lt;a href="https://www.datacenterdynamics.com/en/news/gpt-5-could-require-significantly-more-energy-per-chatgpt-response-compared-to-prior-versions-report/" rel="noopener noreferrer"&gt;declined to publish a first-party energy figure for GPT-5&lt;/a&gt;, so some of the newest flagship numbers are third-party estimates rather than direct vendor disclosures.&lt;/p&gt;

&lt;p&gt;So why did things improve? More efficient hardware (the H100-class accelerators are an enormous step over the A100-class that anchored 2023 estimates), more efficient model architectures, and an honest correction to overly pessimistic token-count assumptions in the original estimates. Epoch AI names that last factor explicitly. And while Altman shared no methodology for his figure, the independent estimates from Google, Epoch, and the arxiv benchmark corroborate it.&lt;/p&gt;

&lt;p&gt;Image generation lands in roughly the same range as text per query, by MIT Technology Review’s May 2025 reporting that Ritchie summarizes, which surprised most people who followed the early coverage. Video is the genuine outlier: a five-second clip runs two to three orders of magnitude above a text query, with measured figures landing anywhere from 30 to nearly 1,000 Wh depending on resolution and length, and that gap looks structural rather than something efficiency gains will close quickly. The cleanest early measurement came from OpenAI’s Sora, which OpenAI &lt;a href="https://help.openai.com/en/articles/20001152-what-to-know-about-the-sora-discontinuation" rel="noopener noreferrer"&gt;retired in 2026&lt;/a&gt;. The current video models (Google’s Veo 3, Kling, Runway) have not published comparable first-party numbers, but nothing about them changes the underlying physics: generating video is a far larger compute event than generating text. &lt;a href="https://mistral.ai/news/our-contribution-to-a-global-environmental-standard-for-ai" rel="noopener noreferrer"&gt;Mistral’s lifecycle assessment&lt;/a&gt; contributes a useful upper-bound figure for thinking about scale, with Ritchie’s caveat that Mistral’s methodology disclosure was light.&lt;/p&gt;

&lt;h3&gt;
  
  
  What about water?
&lt;/h3&gt;

&lt;p&gt;Water flows along the same pattern. The viral 500 mL-per-prompt figure was actually a misreading of the original study, and the honest range works out to something like 10 to 30 mL per query depending on cooling architecture and data-center location. And some air-cooled facilities run a water usage effectiveness of &lt;em&gt;zero&lt;/em&gt;. The water question hasn’t gone away (it concentrates regionally, which matters in drought-stressed siting decisions), but the per-query framing the panic narrative used was off by a &lt;em&gt;couple&lt;/em&gt; &lt;em&gt;of&lt;/em&gt; &lt;em&gt;orders&lt;/em&gt; of magnitude.&lt;/p&gt;

&lt;p&gt;Things are also trending in the right direction: first-party disclosures have replaced third-party best-guesses, efficiency is improving generation over generation, and the direction looks durable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reasoning models break the math
&lt;/h2&gt;

&lt;p&gt;It’s easy to overlook that short-query numbers don’t apply to reasoning models. The April 2026 &lt;em&gt;Joule&lt;/em&gt; study finds that long reasoning and agentic queries raise energy consumption by more than an order of magnitude, driven by the extra tokens generated and the reduced serving concurrency those workloads force.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Study data.&lt;/strong&gt; The &lt;a href="https://www.tomshardware.com/tech-industry/artificial-intelligence/chatgpt-5-power-consumption-could-be-as-much-as-eight-times-higher-than-gpt-4-research-institute-estimates-medium-sized-gpt-5-response-can-consume-up-to-40-watt-hours-of-electricity" rel="noopener noreferrer"&gt;University of Rhode Island’s AI lab&lt;/a&gt; measured a medium GPT-5 response at roughly 18 Wh on average and up to 40 Wh under extended thinking, against a fraction of a watt-hour for a plain query. That is one model swinging by more than an order of magnitude depending on how hard it is asked to think, running about 8.6 times more power-hungry than GPT-4 on a medium response. Earlier cross-model work found the same pattern: the May 2025 “How Hungry is AI?” benchmark measured o3 and DeepSeek-R1 at &lt;strong&gt;over 33 Wh per long prompt, more than 70 times a lightweight model&lt;/strong&gt; on equivalent work. Running a reasoning model on a one-line question is not the same energy event as running a small model on it. The gap is large, and it scales with how much reasoning the model is asked to do.&lt;/p&gt;

&lt;p&gt;That same 2025 benchmark also found that, among the models it could test at the time, Claude 3.7 Sonnet ranked highest in eco-efficiency, a useful counterweight to the simple “bigger model, worse number” framing. The specific model is a generation old now (we are on the 4-series), but the point that survives is the general one: energy efficiency is not strictly a function of parameter count or reasoning depth, architecture and training choices matter at least as much.&lt;/p&gt;

&lt;p&gt;So in other words: matching model class to task complexity is not just a cost optimization, it is a real energy decision. Don’t run a frontier reasoning model (GPT-5 Pro, Gemini 3 Deep Think, Claude Opus in extended thinking) on a one-line lookup. Don’t reach for the most capable model when a smaller one will produce the same output. The discipline we walked through in our &lt;a href="https://fountaincity.tech/resources/blog/ai-cost-optimization-practitioner-framework/" rel="noopener noreferrer"&gt;AI cost optimization framework&lt;/a&gt; (dispatcher-first architecture, model-tier matching, agent decomposition) is the same discipline that reduces energy consumption at scale. Thankfully, cost discipline &lt;em&gt;is&lt;/em&gt; energy discipline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxy6z8wxk3k8isbrgkfv7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxy6z8wxk3k8isbrgkfv7.jpg" alt="Glowing artificial seed and a massive mechanical root system" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Is it better on the environment to do the same work without AI?
&lt;/h2&gt;

&lt;p&gt;Most published coverage of AI’s energy use compares an AI query against itself: Wh per query, mL per query, grams of CO2 per response. That framing leaves out the comparison that actually matters: what workflow does the AI query replace, and what did that workflow cost (in terms of energy, effort, water etc.)?&lt;/p&gt;

&lt;p&gt;Andy Masley and Hannah Ritchie built a counterfactual model. Masley’s water comparisons show that a ChatGPT query is materially less water-intensive than streaming music, an hour of social media browsing, or an hour on Zoom (the specifics live in the FAQ below). Ritchie’s data shows that AI’s share of global electricity sits &lt;em&gt;well under 1%&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So what about not using AI, would that be better for the environment? Let’s call this the slow-path workflow: Doing it the old way (searching Google, opening ten pages, reading them, synthesizing the findings, writing it up) burns display energy, network round-trips, and server calls, plus a lot of human time. Those all add up. A single AI call compresses them into one event, and that compression is real. The old way also draws power and water, arguably more of both, plus more of your time, which is the resource people most often forget to count.&lt;/p&gt;

&lt;p&gt;That doesn’t make AI free. It does mean the honest question is always “compared to what?”&lt;/p&gt;

&lt;h3&gt;
  
  
  So how much of our total water does AI actually use?
&lt;/h3&gt;

&lt;p&gt;Per query the water cost is tiny, as we saw above. Zoom out to national water use and AI barely registers. You get clicks talking about AI draining our water, but here is the actual breakdown:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;41.3% Thermoelectric Power&lt;/li&gt;
&lt;li&gt;36.6% Agriculture&lt;/li&gt;
&lt;li&gt;12.1% Public use&lt;/li&gt;
&lt;li&gt;4.6% Industry&lt;/li&gt;
&lt;li&gt;2.3% Aquaculture&lt;/li&gt;
&lt;li&gt;1.0% Mining&lt;/li&gt;
&lt;li&gt;less than 1% Data centers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So for all the water we are measuring, we are barely using any water for data centers right now at all. &lt;a href="https://www.fwpcoa.org/content.aspx?page_id=5&amp;amp;club_id=859275&amp;amp;item_id=130961" rel="noopener noreferrer"&gt;Source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we really wanted to reduce water use, we should be looking at agricultural practices of growing food crops that are excessively water intensive and rethinking how we generate power (which, by the way, is advancing).&lt;/p&gt;

&lt;h3&gt;
  
  
  What about power use?
&lt;/h3&gt;

&lt;p&gt;As of June 2026 the best estimate I can find is that we are using 2 to 4% of our national power grid for AI data centers, and 4 to 6% overall once you count all data centers (not just AI ones).&lt;/p&gt;

&lt;p&gt;One scope note while we are counting: these are operational numbers, the cost of running the models. The embodied carbon of building the hardware (chip fabrication, data-center construction, the minerals in a GPU) is a real and separate part of the footprint that this piece does not try to tackle.&lt;/p&gt;

&lt;p&gt;I won’t pretend AI’s environmental number is small and therefore fine. But the displacement direction matters more than the absolute number. AI is actually net-positive for footprint when it substitutes for a higher-energy workflow, neutral when it’s additive at the margin, but net-negative when it scales total output proportionally to whatever efficiency gain it produced.&lt;/p&gt;

&lt;p&gt;So if we did the same amount of work, and did it all with AI, we should be using less energy, less water and winning more free time. (But let me get back to this at the end of the article.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfijuf50pm0abdgxdvr6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfijuf50pm0abdgxdvr6.jpg" alt="Professional at a desk reviewing environmental data on a holographic interface" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  You can’t opt out of AI…
&lt;/h2&gt;

&lt;p&gt;The activist read in 2024 sometimes ended with a clean recommendation: just use Google search instead. In 2024 that recommendation was correct, or at least available. In 2026 it has been functionally retired by the search product itself. &lt;a href="https://www.brightedge.com/resources/weekly-ai-search-insights/ai-overviews-one-year-presence-size-citing" rel="noopener noreferrer"&gt;BrightEdge data from February 2026&lt;/a&gt; finds that AI Overviews now trigger on roughly &lt;strong&gt;48% of tracked search queries&lt;/strong&gt;, up sharply year-over-year. Other trackers put the US-specific figure higher.&lt;/p&gt;

&lt;p&gt;There is no permanent product-wide opt-out. Search Labs toggles are feature-specific. Browser extensions can hide the AI Overview surface but do not reduce the underlying inference. Whether you click into the AI Overview or not, the inference ran. That part of the activist advice has been overtaken by product reality, not refuted by argument.&lt;/p&gt;

&lt;p&gt;The way I’d put it: “you can’t avoid it” is a statement of fact, not a defense. The alternative to using AI for an information task in 2026 is the slow-path workflow described above, and that workflow has a real resource cost too. The question shifted from “should I use AI” to “which AI workflow uses less energy for the work I’m doing.” That’s a different conversation, and a more useful one. It maps to the same shift we covered in our piece on &lt;a href="https://fountaincity.tech/resources/blog/making-your-business-visible-to-ai-a-strategic-guide-to-appearing-in-ai-recommendations/" rel="noopener noreferrer"&gt;making your business visible to AI search&lt;/a&gt;: the restructuring of discovery is restructuring the energy question alongside it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where we are going
&lt;/h2&gt;

&lt;p&gt;Everything above is the demand side: what a query costs to run today. The environmentally-friendly supply side is moving too, and it is where the long-term picture gets decided. Below are a few early bets on where the energy to power AI comes from next:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Panthalassa.&lt;/strong&gt; The Oregon-based startup &lt;a href="https://www.latitudemedia.com/news/are-thiel-funded-floating-data-centers-enough-to-make-wave-energy-pencil/" rel="noopener noreferrer"&gt;raised a $140 million Series B in May 2026&lt;/a&gt;, led by Peter Thiel with participation from John Doerr, Marc Benioff’s Time Ventures, and Mike Schroepfer’s Gigascale Capital. The technology: floating autonomous nodes that convert wave energy directly to compute, cooled by surrounding seawater, transmitting via LEO satellite, with a 2027 commercial target.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Musk’s 100 GW claim.&lt;/strong&gt; At &lt;a href="https://pv-magazine-usa.com/2026/01/26/elon-musk-at-wef-spacex-and-tesla-to-produce-100-gw-each-of-pv-per-year-in-the-u-s-this-decade" rel="noopener noreferrer"&gt;WEF Davos on January 22, 2026&lt;/a&gt;, Musk said Tesla and SpaceX were each separately working to build 100 GW per year of US solar manufacturing capacity this decade, framing energy as “the bottleneck of the AI revolution.” For context, China’s annual PV production capacity is in the same order of magnitude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aikido Technologies.&lt;/strong&gt; The middle path between speculative and tactical. &lt;a href="https://eandt.theiet.org/2026/03/04/ai-data-centre-onboard-floating-offshore-wind-platform-targeted-uk-waters-2028" rel="noopener noreferrer"&gt;Aikido’s AO60DC&lt;/a&gt; integrates an offshore wind turbine in the 15 to 18 MW range with roughly 10 to 12 MW of IT capacity on a single floating platform, designed for farms scaling toward 1 GW-plus of IT load. A small proof-of-concept is operating in Norway; first commercial project targets UK waters by 2028.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Space-based solar.&lt;/strong&gt; The long-horizon bet. Feasibility analyses (per pv-magazine’s coverage) are real engineering rather than slideware, but the production timeline is probably decades out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj30i905joocw2er2vil9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj30i905joocw2er2vil9.jpg" alt="Futuristic offshore wind turbine and glowing data servers" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What responsible AI use in 2026 looks like
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Match model class to task.&lt;/strong&gt; Don’t run a frontier reasoning model (GPT-5 Pro, Gemini 3 Deep Think, Opus in extended thinking) on a one-line question. The reasoning-model energy gap, an order of magnitude on average and up to 70 times at the extreme, is the data point that makes this not just a cost recommendation but an energy one. Pick the smallest model that does the task at quality. Reserve reasoning models for tasks that actually reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimize the dispatcher layer.&lt;/strong&gt; Cost discipline is energy discipline. Reductions in API spend tend to flow through to roughly proportional energy reductions on the same workload. Spend the engineering time on prompt compression, output budget tuning, and tier routing. The savings compound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use AI to replace higher-energy workflows, not layer on top of them.&lt;/strong&gt; AI that substitutes for an hour of Zoom-and-document-review is net-positive on the math. AI that gets added to existing workflows without removing anything is additive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefer hyperscalers with published WUE and PUE.&lt;/strong&gt; Hyperscaler data-center efficiency tends to run meaningfully better than typical on-prem deployments (specifics in the FAQ below). For workloads without regulatory or data-residency reasons to stay local, the embodied-carbon argument also runs in favor of cloud.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are we just working harder?
&lt;/h3&gt;

&lt;p&gt;If the same work gets done in 20% of the time but the working hours don’t change, total resource use goes up roughly 5x, not down. Efficiency that doesn’t change how we choose to spend the gains multiplies total resource use rather than reducing it. That’s not a uniquely AI problem (Jevons named the pattern in coal in 1865), but it is the question every operator has to answer now. AI adoption is net-positive for footprint when it’s substitutional, neutral when it’s additive at the margin, and net-negative when it scales total output proportionally to the efficiency gain. The decision is not the technology. The decision is: &lt;em&gt;what we do with the time the technology gives back?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Where the energy picture goes from here looks more open than the panic narrative allows. Fusion is always 20 years away. Geothermal at scale, solar manufacturing at the scale Musk is gesturing at (even discounted appropriately), offshore wave and wind compute integration, and space-based solar all have shipping timelines in 2027 to 2030. The future shape looks energy-rich rather than energy-poor, if compute discipline can keep pace with capacity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhn0pdcljga04kex29007.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhn0pdcljga04kex29007.jpg" alt="Professionals discussing cost discipline at a whiteboard" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How much energy does one ChatGPT query actually use in 2026?
&lt;/h3&gt;

&lt;p&gt;The per-query band is covered in detail in the body above. The short version: a standard text query lands roughly an order of magnitude lower than the figure widely cited in 2023, and the revision comes from more efficient hardware and models plus a correction to overly pessimistic token-count assumptions in the original estimates. Reasoning models (GPT-5 in extended thinking, DeepSeek-R1, and similar) are a different category and use materially more energy per long prompt, an order of magnitude or more in several 2026 measurements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is AI worse for the environment than a Google search?
&lt;/h3&gt;

&lt;p&gt;Per query, the gap is much smaller than the “10x” claim that circulated in 2024. A Google search and a Gemini text query land in the same per-query range on Google’s own August 2025 disclosure. The question is mostly moot now, because AI Overviews already trigger on roughly half of tracked Google searches, which means most search queries include AI inference whether the user clicked into it or not.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I opt out of AI in Google search?
&lt;/h3&gt;

&lt;p&gt;Not in any complete way. There is no permanent product-wide opt-out. Search Labs toggles are feature-specific. Browser extensions can hide the AI Overview surface, but the inference still ran on Google’s infrastructure when the search happened. The framing has shifted from “should I use AI” to “which AI workflow uses less energy for the work I’m doing.”&lt;/p&gt;

&lt;h3&gt;
  
  
  Does generating an AI image use the same energy as charging my phone?
&lt;/h3&gt;

&lt;p&gt;No — image generation runs roughly two orders of magnitude lower than charging a phone, per MIT Technology Review’s May 2025 reporting. Video generation is the genuine outlier: a short AI-generated clip lands closer in energy cost to charging a phone or running a microwave for a minute. The image-versus-video gap is the most important distinction in the consumer AI energy conversation.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much water does a ChatGPT query use?
&lt;/h3&gt;

&lt;p&gt;The honest range works out to something like 10 to 30 mL per query depending on cooling architecture and data-center location, with some air-cooled facilities running effectively zero on-site water. For comparison, &lt;a href="https://blog.andymasley.com/p/individual-ai-use-is-not-bad-for" rel="noopener noreferrer"&gt;Andy Masley’s counterfactual&lt;/a&gt; finds streaming a song uses roughly 250 mL, an hour of social media browsing 430 mL, and an hour on Zoom 1,720 mL.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I stop using ChatGPT to help the environment?
&lt;/h3&gt;

&lt;p&gt;For meaningful climate impact, individual AI use is not the leverage point. Data centers account for roughly 1.5% of global electricity by IEA figures, and AI specifically is under 0.2%. Systemic decisions — grid mix, hyperscaler siting, model architecture choices, workflow substitution — are where the math moves. Where individual choice does matter: don’t reach for reasoning models on tasks that don’t need reasoning, and use AI to replace higher-energy workflows rather than as a layer added on top.&lt;/p&gt;

&lt;h3&gt;
  
  
  What percentage of US electricity goes to data centers?
&lt;/h3&gt;

&lt;p&gt;Approximately 4 to 6% of US electricity, by Hannah Ritchie’s summary of IEA data. AI specifically is a fraction of that. Data center electricity is concentrated regionally — Virginia, Texas, the Pacific Northwest, parts of the Southeast — which is where the grid-pressure conversations are most active. The national figure understates the local-cluster strain in those regions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is cloud AI more efficient than running AI locally on my laptop?
&lt;/h3&gt;

&lt;p&gt;Generally yes, by a meaningful margin. Hyperscaler data centers run a power usage effectiveness (PUE) between roughly 1.08 and 1.25 (Google’s published range), versus typical on-prem deployments that run 1.4 to 2x worse. The embodied carbon of laptop and consumer-GPU manufacturing also tends to push the comparison further in cloud’s favor, particularly for workloads that don’t run continuously. Where local makes sense is regulated data-residency workloads, intermittent inference on hardware that already exists for other reasons, and edge applications where round-trip latency matters more than efficiency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbirtl4silci8liw8vo0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbirtl4silci8liw8vo0.jpg" alt="Ornate fountain with water turning into glowing butterflies" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>business</category>
      <category>llm</category>
    </item>
    <item>
      <title>Evaluation-Led Agent Development: Five Disciplines That Separate Production from Pilot</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Fri, 05 Jun 2026 18:07:48 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/evaluation-led-agent-development-five-disciplines-that-separate-production-from-pilot-2cch</link>
      <guid>https://dev.to/sebastian_chedal/evaluation-led-agent-development-five-disciplines-that-separate-production-from-pilot-2cch</guid>
      <description>&lt;p&gt;The gap between an agent that runs in a demo and an agent that runs in production isn’t a tooling gap or a model-capability gap. It’s a discipline gap in discipline. The discipline that closes that gap is evaluation, not as a QA afterthought, but as the operating practice that determines whether the rest of the work ever gets used.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why evaluation became the production distinction
&lt;/h2&gt;

&lt;p&gt;In a recent Databricks’s &lt;em&gt;State of AI Agents 2026&lt;/em&gt; report (via &lt;a href="https://lovelytics.com/post/state-of-ai-agents-2026-lessons-on-governance-evaluation-and-scale" rel="noopener noreferrer"&gt;Lovelytics’ practitioner summary&lt;/a&gt;) they found that organizations using systematic evaluation frameworks achieve nearly 6× higher production success rates. At the practitioner tier, &lt;a href="https://nav43.com/blog/agentic-ai-workflows-for-seo/" rel="noopener noreferrer"&gt;NAV43’s Frase/Graphed data&lt;/a&gt; shows 90.3% of marketing organizations have AI agents somewhere in their stack, and only about 13% have those agents integrated into production workflows. The root cause they both point at isn’t a poor model, framework, or orchestrator but a lack of discipline in their building, testing, and re-testing systems on the pilot’s way to production.&lt;/p&gt;

&lt;p&gt;The academic name for this is evaluation-driven development and operations — EDDOps in the literature. We’ll use the more practitioner-readable phrase here:** evaluation-led agent development**. The disciplines themselves are converging across the publications. The &lt;a href="https://www.infoq.com/articles/evaluating-ai-agents-lessons-learned/" rel="noopener noreferrer"&gt;InfoQ five-pillar framework&lt;/a&gt;, &lt;a href="https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-for-production-a-practical-guide-to-strands-evals/" rel="noopener noreferrer"&gt;AWS Strands’ Cases/Experiments/Evaluators pattern&lt;/a&gt;, Microsoft’s online-and-offline split, and Arthur AI’s supervised-versus-unsupervised distinction all describe the same shape from different angles.&amp;nbsp; Below we will go deeper into the common overlap between all these articles and also bring in our personal perspective and experience into the subject.&lt;/p&gt;

&lt;p&gt;PS For more on why the pilot stage is the failure point in the first place, our piece on &lt;a href="https://fountaincity.tech/resources/blog/why-ai-pilots-fail/" rel="noopener noreferrer"&gt;why AI pilots fail&lt;/a&gt; covers the broader operational pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Accuracy tier is the upfront decision that cascades into everything
&lt;/h2&gt;

&lt;p&gt;Before a single test gets written, the system needs to define the acceptable failure rate, declared upfront. This decision cascades into every downstream choice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What counts as a passing test?&lt;/li&gt;
&lt;li&gt;What level of judge-human agreement is acceptable?&lt;/li&gt;
&lt;li&gt;Where can spot-test sampling replace full-coverage testing?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some example tiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6-sigma&lt;/strong&gt; for finance, scientific, regulated, and medical-adjacent domains. The cost of a misstep is large or irreversible; the system needs full-coverage validation against ground truth, often including cross-checks against external sources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4-sigma&lt;/strong&gt; for HR, general knowledge-work, and most internal productivity agents. The cost of a misstep is real but recoverable; the volume is high enough that exhaustive coverage isn’t economical.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;80% tier&amp;nbsp;&lt;/strong&gt;for systems designed to augment human productivity rather than replace it. For example a system where the AI sets up the initial custom engineering solution for a new prospect client-project. Or an automated RFP response system. Getting the human 80% done at the start saves significant measurable hours without needing to be 100% accurate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An illustration from a recent client build: we worked on a hydraulic 3D simulation system that produced engineering models without human-written code in the loop. That system needed sigma-4 accuracy, with no errors creating anything beyond a small imperfection at small very small scales. So the validation method wasn’t a small gold set. It used Gemini 3.1 Pro to cross-check Anthropic’s Opus system-output against published peer-reviewed literature. Then substantial generation orders to ensure the model would be consistent in each generation. The tier dictated the validation method, not the other way around.&lt;/p&gt;

&lt;p&gt;Tier choice also sets what “passing” your LLM judge means. &lt;a href="https://arize.com/llm-as-a-judge/" rel="noopener noreferrer"&gt;Arize’s published target of 75–90% judge-human agreement&lt;/a&gt; reads as a fixed number until you notice it’s tier-dependent — a system designed to augment human productivity can live with 80% right, 20% left to the human to finalize; a 6-sigma financial system likely can’t. Name the tier first, and every other decision in the chain gets cheaper to make.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpgs50fc2ecet0r8dwsv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpgs50fc2ecet0r8dwsv.jpg" alt="Two professionals in a modern office looking at a shared monitor reviewing system metrics" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation belongs at the harness layer, not just at the output
&lt;/h2&gt;

&lt;p&gt;An agent doesn’t fail at the output. It fails at memory, at a tool call, at a feedback loop that doesn’t terminate, at an API budget that doesn’t trigger a cutoff. An output-only eval tells you something broke. A layer-targeted eval tells you what — and that’s the difference between an alert and a fix.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://fountaincity.tech/resources/blog/anatomy-of-an-agent-harness/" rel="noopener noreferrer"&gt;anatomy of an agent harness&lt;/a&gt; breaks the harness into seven components: execution sandbox, auth and identity, memory and context, tool calls, orchestration, cost governance, and observability. Each has its own failure modes, and each gets its own test surface.&lt;/p&gt;

&lt;p&gt;What we’ve found running this in practice is that the test surface gets concrete quickly once the layers are named. The following list is the set of tests we think of first when planning our work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory recall under context churn.&lt;/strong&gt; Does the agent retrieve the right prior context when the window has been rewritten several times? Synthesize churn by injecting unrelated turns between question and answer, then measure retrieval accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-call schema adherence.&lt;/strong&gt; Does the agent produce tool arguments that match the declared schema, including under prompt variation? Does it always call the tools you expect? A tool-call linter at the gateway catches drift before it reaches the tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API overspend cutoffs.&lt;/strong&gt; Does the cost-governance layer actually halt the run when the per-task budget is hit? Test by setting a deliberately low cap and confirming the cutoff fires; many systems alert without halting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feedback-loop termination.&lt;/strong&gt; Does the agent escape a stuck state? Inject a recoverable failure (a tool that fails on the first call, succeeds on the second) and confirm the agent retries and proceeds, rather than looping or stalling without a logged failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination control gates.&lt;/strong&gt; Where are the gates that catch fabricated outputs, and do they fire on known failure cases? Run a held-out set of prompts that are known to induce hallucination in similar systems and confirm the gates catch them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permission and policy boundaries.&lt;/strong&gt; Does the agent attempt actions outside its authorization scope, and does the sandbox refuse correctly? How does the agent respond when a permission is denied, does it go into a death-spiral? Test by running prompts that try to escalate, and confirm the refusal is logged and surfaced.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability completeness.&lt;/strong&gt; Can a trace be reconstructed for any production interaction? If a failure can’t be debugged after the fact, the observability layer itself has a failure mode the evaluation needs to catch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these tests live at the output. They live at the layer where the failure originates. Output-level evals stay useful as the canary; layer-level evals are how the team fixes what the canary surfaces.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where your evaluation signal comes from determines the cadence
&lt;/h2&gt;

&lt;p&gt;In our direction experience: the question “how often should evaluation crons run” is usually the wrong question. The right one we find is “where does your evaluation signal come from?”.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production observability.&lt;/strong&gt; Real usage is the strongest evaluation signal. If the system is being used at any meaningful volume, the production traffic itself becomes the eval dataset. &lt;a href="https://microsoft.github.io/ai-agents-for-beginners/10-ai-agents-production/" rel="noopener noreferrer"&gt;Microsoft’s continuous improvement loop&lt;/a&gt; describes the mechanic: observability data from production informs offline experimentation and refinement; the loop runs continuously, not as a one-time gate. &lt;a href="https://www.arthur.ai/blog/best-practices-for-building-agents-part-3-continuous-evaluations" rel="noopener noreferrer"&gt;Arthur AI’s distinction&lt;/a&gt; between supervised evaluations (which require a known correct answer) and unsupervised evaluations (which assess behavior from the agent’s own context alone) is the operational mechanism. Unsupervised evals can run against every production interaction without needing a labeled set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trigger-based, in-process evaluation.&lt;/strong&gt; One agent judges the prior agent’s output as part of the workflow. This is not on the clock; it’s driven by execution. For high-volume, lower-criticality operations, sampling is fine. Here the judge tastes a random percentage of runs, or uses a risk model to route higher-stakes outputs to the strict judge gate. We tend to think of this the way a factory tests bolts: you don’t have to inspect every bolt to know the batch is good, but you do have to inspect enough that the inference is defensible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cron-based evaluation.&lt;/strong&gt; For the system that doesn’t get used enough to accumulate production observation, but has to perform when called, cron is the fallback. Low-traffic internal agents, regulated systems with sparse usage, and pre-launch pilots where production data doesn’t exist yet: these are the specific kinds of cases where a scheduled benchmark run earns its place. Pilot-phase batch testing where we are synthesizing thousands to hundreds of thousands of test interactions through the system to surface failure modes before users see them, are also good examples, though it’s batch-on-demand rather than truly a “cron”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Systems with strong production traffic shouldn’t be running synthetic crons it doesn’t need, unless there are really critical scenarios that are otherwise not being hit otherwise. Meanwhile a system with no or little production usage shouldn’t pretend trigger-based evals will catch what only batch testing finds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1d0xgps7yk2a46d3nx7i.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1d0xgps7yk2a46d3nx7i.jpg" alt="Close up of glowing translucent data streams and metric panels hovering above a dark walnut desk" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  When LLM-as-judge fails, it fails at the rubric
&lt;/h2&gt;

&lt;p&gt;It is good practice to use an LLM to test and evaluate the quality of your AI system. We call that the LLM-as-judge because it is testing your system or specific agents within your system, to determine if they are not making any mistakes.&lt;/p&gt;

&lt;p&gt;The judge is only as good as the rubric it judges against. Teams iterate the judge prompt and the judge model without iterating the pass/fail definition, and the wrong things keep passing while the right things keep failing. The dominant failure mode of LLM-as-judge in practice isn’t bias in the model; it’s a pass/fail definition that was never sharpened against actual failure cases. Refining the criteria: what specifically counts as a pass for a given test case, broken down by what the system needs to demonstrate often results in far greater improvements than than refining the prompt or swapping the judge model.&lt;/p&gt;

&lt;p&gt;Practically, the discipline has three moves. First, score binary pass/fail rather than on a range. Arthur AI’s observation is that the same interaction can score a 4 on one run and a 6 on another from the same judge; binary judgments are more consistent and force the rubric to be sharp. Second, validate the judge against a small golden dataset — your accuracy-tier-appropriate judge-human agreement target on the gold set is a reasonable bar for most tier-4-sigma systems and a starting point to tighten upward for higher-tier work. Third, refine the rubric on every failure case before refining anything else. If the judge passed something that shouldn’t have passed, review the rubric carefully to ensure your criteria is not the problem. The model is mostly innocent.&lt;/p&gt;

&lt;p&gt;This is not to say it is not worth swapping models and comparing. This can lead to very measurable changes in price or performance, but that doesn’t change that models, and your systems, will always optimize towards the thing we evaluate them against, not how smart they are generally.&lt;/p&gt;

&lt;p&gt;DSPy fits here as the structured-optimization layer for cases where the rubric is well-defined enough to optimize against. In plain English, DSPy is a way to declare your task as composable modules and let a compiler optimize the prompts against a measurable downstream metric: instead of hand-tuning prompts, you tune the metric and let the compiler find the prompt. It pays off most clearly for people-facing systems where input prompt quality varies widely (the input you can’t control), and less for closed-domain backend tasks where prompt quality is already stable. DSPy doesn’t replace LLM-as-judge; it operates on top of a judge metric that’s already calibrated. Sequence matters: calibrate the judge first, then optimize against it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A minimum-viable evaluation setup agent testing
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trace and logging layer.&lt;/strong&gt; You need to be able to review exactly what fails, when and under what condition, how often and after how many tries… you can’t really over log, logging is cheap, especially in pilot and development stages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A small gold set of 10–50 examples.&lt;/strong&gt; The top most important cases the system has to handle correctly, written down explicitly, with expected outputs or expected trajectories.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One deterministic grader.&lt;/strong&gt; Schema validity, latency, cost per task, token usage. Things that don’t need an LLM to judge. Run on every interaction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One LLM judge with a calibrated rubric.&lt;/strong&gt; Calibrated to your accuracy-tier-appropriate judge-human agreement target on the gold set before scaling to production traffic. Binary pass/fail. Rubric updates on every failure case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-feedback loop.&lt;/strong&gt; Failures from production get added back to the gold set. The judge gets re-validated against the expanded set periodically. The system learns from being used, not just from being built.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In our experience: Accuracy needs are defined upfront; infrastructure (including cost) is designed before the PoC; the gold set comes before the judge platform; the judge gets calibrated before any DSPy optimization gets layered on; and your cost-of-evaluation is baked into the infrastructure design from the start, then optimized with testing.&lt;/p&gt;

&lt;p&gt;Evaluation crons, judge calls, and continuous test runs all show up on the API invoice like any other model call. From here you could read our work on &lt;a href="https://fountaincity.tech/resources/blog/ai-cost-optimization-practitioner-framework/" rel="noopener noreferrer"&gt;cost-optimization in AI systems&lt;/a&gt; which talks about the dispatcher-first architecture that catches needless model calls in the agent workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F06%2F2026-06-03-J-evaluation-led-agent-development-04.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F06%2F2026-06-03-J-evaluation-led-agent-development-04.svg" alt="Flowchart of the minimum viable evaluation loop: define success criteria, build gold set, run hybrid eval, identify regressions, feed failures back" width="100" height="50.0"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The five disciplines
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Discipline&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;What it costs to do badly&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy tier declaration&lt;/td&gt;
&lt;td&gt;The acceptable failure rate, named before any test gets written&lt;/td&gt;
&lt;td&gt;Wasted budget on over-engineered evals for low-stakes systems; shipping high-stakes systems without defensible accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Harness-layer testing&lt;/td&gt;
&lt;td&gt;Memory, tool calls, cost cutoffs, feedback loops, hallucination gates, permissions, observability — each with its own test surface&lt;/td&gt;
&lt;td&gt;Failures that surface at the output with no signal about which layer broke; alerts you can’t act on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Signal-source matching&lt;/td&gt;
&lt;td&gt;Whether evaluation runs against production traffic, in-process triggers, or scheduled batches — based on usage volume&lt;/td&gt;
&lt;td&gt;Synthetic crons that miss what real users do; production systems with no offline regression coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Judge + rubric calibration&lt;/td&gt;
&lt;td&gt;Pass/fail definition, judge prompt effectiveness, model validation&lt;/td&gt;
&lt;td&gt;Confident wrong answers passing through unnoticed; correct answers flagged as failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost-of-evaluation budgeting&lt;/td&gt;
&lt;td&gt;Per-task judge cost, weekly benchmark cost, cost per failure caught&lt;/td&gt;
&lt;td&gt;Evaluation infrastructure costing more than the agents it evaluates; evaluation rollbacks under cost pressure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The ten-question audit
&lt;/h2&gt;

&lt;p&gt;Here are some questions a technical lead, agency owner, or program owner can ask their team to quickly learn where the production gaps lay:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Have we defined the level of accuracy this system has to meet — and does the team agree on it?&lt;/li&gt;
&lt;li&gt;Do we have a gold set of test cases (10–50 examples) that the system has to pass before any change ships?&lt;/li&gt;
&lt;li&gt;When a test fails, can we tell which harness layer broke: memory, tool call, cost cutoff, hallucination gate, or only that the output was wrong?&lt;/li&gt;
&lt;li&gt;Where does our evaluation signal come from: production traffic, in-process triggers, or scheduled batches? Have we made that choice deliberately?&lt;/li&gt;
&lt;li&gt;If we run an LLM as a judge, do we know what percentage of the time it agrees with a human on the gold set? Is that percentage acceptable for our accuracy tier?&lt;/li&gt;
&lt;li&gt;When the judge passes something that shouldn’t have passed, do we update the rubric, or only the data and the prompt?&lt;/li&gt;
&lt;li&gt;What does evaluation cost us per week, and is that cost line tracked alongside the agents’ own cost line?&lt;/li&gt;
&lt;li&gt;When a report of a real failure lands, does that failure end up in the gold set automatically, or does it get lost?&lt;/li&gt;
&lt;li&gt;If a high-volume operation can’t run full evaluation on every call, what is our sampling strategy — and is the risk model behind it defensible?&lt;/li&gt;
&lt;li&gt;Could we hand this evaluation setup to a new engineer joining the team next month and have them know what each component does and why?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzmcaqqd23u2yoqmw5865.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzmcaqqd23u2yoqmw5865.jpg" alt="Beautiful fountain in a sunset-lit plaza with holographic data fragments floating in the mist" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is evaluation-led agent development?
&lt;/h3&gt;

&lt;p&gt;A development practice in which evaluation is the primary discipline shaping how an agent is built, tested, and operated — not a quality-assurance step at the end. The academic name is evaluation-driven development and operations (EDDOps). In practice, it means defining accuracy tiers before writing tests, testing at the harness layer rather than only at the output, matching evaluation signal to production usage patterns, calibrating LLM-as-judge against human-labeled gold sets, and budgeting evaluation as infrastructure from the start.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I evaluate an AI agent in production?
&lt;/h3&gt;

&lt;p&gt;Three signal sources to choose from based on usage volume: production observability with unsupervised evals running against every interaction, trigger-based in-process evals where one agent judges the prior agent’s output, and scheduled batch or cron evaluation for systems without enough production traffic to self-validate. Most production systems with real usage volume run unsupervised evals on production data continuously, with offline regression tests against a gold set on every change.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s the difference between offline and online evaluation for AI agents?
&lt;/h3&gt;

&lt;p&gt;Offline evaluation runs against fixed datasets (held-out test cases, historical traces, synthesized usage) and is the default for pre-production regression testing and CI/CD gates. Online evaluation runs against live production traffic, often using unsupervised evals that don’t require a known correct answer. Both belong in a production system: offline catches regressions before deployment, online catches drift after deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should I use LLM-as-judge for evaluating AI agents?
&lt;/h3&gt;

&lt;p&gt;Whenever the evaluation requires semantic judgment — helpfulness, groundedness, tone, reasoning quality — that deterministic checks can’t capture. Reserve deterministic graders (schema, latency, cost) for what they’re good at, and use LLM-as-judge for the rest. Always calibrate against a gold set first; aim for &lt;a href="https://arize.com/llm-as-a-judge/" rel="noopener noreferrer"&gt;75–90% agreement with human labels&lt;/a&gt; before scaling.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the limitations of LLM-as-judge?
&lt;/h3&gt;

&lt;p&gt;The published limitations (position bias, verbosity bias, self-enhancement bias, prompt sensitivity) are real and worth knowing. The more common failure in practice is that the test criteria itself was undertheorized: the rubric the judge judges against was never sharpened against actual failure cases. &lt;a href="https://arxiv.org/html/2512.04123v1" rel="noopener noreferrer"&gt;Recent measurement work&lt;/a&gt; finds 74% of production agents still rely primarily on human-in-the-loop evaluation rather than standardized benchmarks.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is DSPy and when should I use it?
&lt;/h3&gt;

&lt;p&gt;DSPy is a framework for declaring tasks as composable modules and letting a compiler optimize prompts against a measurable downstream metric. Use it when the metric is well-defined (a calibrated judge counts) and the input prompt quality is variable — typically people-facing systems where you can’t control what users type. Skip it when the metric is squishy or when prompt quality is already stable; hand-tuning still wins there.&lt;/p&gt;

&lt;h3&gt;
  
  
  How big should my held-out test set be for an AI agent?
&lt;/h3&gt;

&lt;p&gt;Start with 10–50 examples — small enough to write by hand, large enough to catch the failure modes you already know about. The set grows as production failures get added back into it. Most small-team systems plateau usefully around 100–300 examples, though the right size is whatever covers the failure modes the accuracy tier requires.&lt;/p&gt;

&lt;h3&gt;
  
  
  How often should I run evaluation crons against my agents?
&lt;/h3&gt;

&lt;p&gt;Probably not on a clock. For systems with meaningful production traffic, run unsupervised evals against production interactions and offline regression tests on every deployment. Cron-based evaluation is the right cadence for systems with sparse usage — internal agents called rarely but expected to perform when called — where production data isn’t accumulating fast enough to provide its own signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can a small team (1-5 agents) actually do evaluation-led development?
&lt;/h3&gt;

&lt;p&gt;Yes, and the discipline matters more for small teams than for large ones because there’s less margin for a failure mode to surface twice. The five-component minimum stack (trace layer, gold set of 10–50 examples, deterministic grader, calibrated LLM judge, production-feedback loop) is buildable in one sprint. The constraint is sequencing discipline; headcount isn’t the gating factor.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is evaluation-led development the same as MLOps?
&lt;/h3&gt;

&lt;p&gt;Overlapping, not identical. MLOps covers the full lifecycle of ML systems (training, deployment, monitoring, retraining) and predates agentic systems. Evaluation-led development focuses on the testing and judgment discipline specifically, and applies to agent systems that often don’t involve model training at all. EDDOps is closer to TDD for agents than to MLOps for models.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>devops</category>
    </item>
    <item>
      <title>AI Meta + Google Ad Monitoring Platform</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Wed, 03 Jun 2026 18:12:27 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/ai-meta-google-ad-monitoring-platform-52cg</link>
      <guid>https://dev.to/sebastian_chedal/ai-meta-google-ad-monitoring-platform-52cg</guid>
      <description>&lt;p&gt;Unleashed Consulting runs paid advertising for roughly 80 local pet-services businesses across Google Ads, Meta, and Local Services Ads. Their media buyers are good at what they do. The challenge isn’t skill: it’s math. Each client runs on multiple platforms. Each platform has its own dashboard. A buyer doing deep optimization on one account is, by definition, not watching the other seventy-nine at that moment.&lt;/p&gt;

&lt;p&gt;They wanted to change the ratio. Instead of deep-diving a handful of accounts per day and sampling the rest, they wanted every client getting expert-level attention on every cycle, and they wanted that coverage to scale as the client book grows, without scaling headcount to match.&lt;/p&gt;

&lt;p&gt;We built &lt;strong&gt;Pepper Ad Coach&lt;/strong&gt; to give them just that. The system monitors all connected accounts on a two-hour cycle, scores campaign health against each client’s own targets and baselines, and pushes two kinds of output: alerts when a metric crosses a threshold worth acting on, and coaching — what a strong media buyer would look at next, prioritized by urgency. Critical alerts fire instantly to Slack and email. Everything else batches into a morning digest. A web dashboard rolls up the full portfolio so the team sees where today’s highest-value work is at a glance.&lt;/p&gt;

&lt;p&gt;New clients onboard through a self-service portal that connects their ad accounts. No engineering ticket, no setup delay. The agency’s upfront build cost was low, and ongoing costs scale per customer, their investment grows only as the client book grows, not before. That pricing predictability comes from a deliberate architecture choice: Pepper uses AI heavily at build time to author the system’s expert judgment, but makes zero LLM calls at runtime. The intelligence is compiled into lookup tables once, then the production system runs on pure arithmetic. No per-query AI costs, no output variability.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Pepper actually does
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiumstkyecjwab7f1gafb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiumstkyecjwab7f1gafb.png" alt="Pepper Ad Coach. -Dashboard" width="800" height="471"&gt;&lt;/a&gt;&lt;em&gt;Dummy data testing alerts on the dashboard&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every two hours during business hours, for every client, on every connected platform, Pepper pulls the latest campaign numbers, evaluates them against that client’s own targets and recent baseline, and produces two kinds of output.&lt;/p&gt;

&lt;p&gt;The first is &lt;strong&gt;alerts&lt;/strong&gt;: a campaign crossed a danger threshold. There are seven defined alert types: cost-per-lead spikes, zero-lead-after-spend stretches, budget overpacing, click-through-rate collapse, and so on. Critical alerts go out instantly. Lower-severity warnings batch into a single 8 a.m. daily digest so the team is not drowning in pings.&lt;/p&gt;

&lt;p&gt;The second is &lt;strong&gt;coaching&lt;/strong&gt;: nothing has broken yet, but here is what a good media buyer would do next. Coaching items are prioritized as &lt;em&gt;act today&lt;/em&gt;, &lt;em&gt;act this week&lt;/em&gt;, or &lt;em&gt;act this month&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Delivery sits where the team already works: Slack and email for push notifications, plus a web dashboard with a rolled-up view across all clients (critical rows auto-expand so fires surface to the top). New clients onboard through a self-service portal that connects their ad accounts. No engineering ticket required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Zero LLM calls at runtime
&lt;/h2&gt;

&lt;p&gt;Most products marketed as “AI-powered” call a large language model every time they do something. At agency scale, that approach hits three chronic problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost scales with volume.&lt;/strong&gt; Multiplying clients by evaluation cycles by campaigns produces a large, variable bill that’s hard to price a flat monthly fee against.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outputs are opaque.&lt;/strong&gt; The model says “lower your bids” and you often cannot fully explain why. It might say something slightly different on the next run with identical data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The system is hard to test.&lt;/strong&gt; The same input can produce different output, which makes regression testing fragile. How often are you okay with it saying the wrong thing?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pepper inverts the pattern. It uses LLMs heavily, &lt;em&gt;but only&lt;/em&gt; &lt;em&gt;once&lt;/em&gt;, at build time, to author the system’s judgment. At runtime, the hot path is &lt;em&gt;pure deterministic calculation&lt;/em&gt;. The mental model is “compile the expertise &lt;em&gt;once&lt;/em&gt;, run it cheaply &lt;em&gt;forever&lt;/em&gt;.” It’s the difference between hiring a consultant to write you a decision playbook and paying the consultant every time you have a question.&lt;/p&gt;

&lt;p&gt;The pattern has a name in the research literature. A &lt;a href="https://arxiv.org/html/2604.05150v1" rel="noopener noreferrer"&gt;recent paper on compiled AI&lt;/a&gt; defines the paradigm as one where LLMs generate executable artifacts during a compilation phase, after which workflows execute deterministically without further model invocation. The trade is runtime flexibility for predictability, auditability, cost efficiency, and reduced security exposure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inside the Decision Matrix
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-30-J-pepper-deterministic-execution-03.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-30-J-pepper-deterministic-execution-03.svg" alt="Flowchart diagram showing the zero-LLM runtime execution path" width="100" height="16.842105263157894"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The expert judgment lives in a set of lookup tables we call the &lt;strong&gt;Decision Matrix&lt;/strong&gt;. In Release 1 there are 15 matrices containing a total of 99 unique cells.&lt;/p&gt;

&lt;p&gt;We check the different input properties against 2 to 4 dimensional tables, and find the corresponding alert or recommendation. Each combination of factors is considered once, cached to the table and then returned when the same conditions are met again in the future. &lt;a href="https://data443.com/blog/deterministic-policy-vs-llm-filters/" rel="noopener noreferrer"&gt;Deterministic systems&lt;/a&gt; are advantageous because they always produce identical outputs.&lt;/p&gt;

&lt;p&gt;For the user of the system, what they see is a template in the format:&amp;nbsp; {customer.name}'s CPL on {platform.name} is {metric.cpl_value} ({metric.cpl_delta_pct}% above 7-day avg) which then renders as something like &lt;em&gt;“Spot Doggie Care on Google Ads is well above the 7-day average.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A coaching example might be something like:&lt;/p&gt;

&lt;p&gt;{Customer} Google Ads quality score is low (6/10) but rank-loss is modest.&lt;/p&gt;

&lt;p&gt;The bidding side is okay; the relevance side needs work.&lt;/p&gt;

&lt;p&gt;→&amp;nbsp;Refresh ad copy and landing-page relevance to lift quality score before raising bids.&lt;/p&gt;

&lt;p&gt;For the agency owners an approach like this has the payoff that they get a much lower monthly fee because runtime costs do not increase linearly with usage. If we want to change how Pepper reasons, we update the cache tables with a new round of test-data through the LLMs. One-time runs all baked in for maximum cost efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honesty discipline: what we dropped
&lt;/h2&gt;

&lt;p&gt;The team was disciplined about not faking sophistication. Where a “smart” signal needed data the system does not have, we either proxied it transparently or dropped the dimension entirely.&lt;/p&gt;

&lt;p&gt;Lead &lt;em&gt;quality&lt;/em&gt; is the clearest example of a transparent proxy. It needs CRM data Pepper does not have in Release 1, so it is approximated from the cost-per-lead ratio and labeled as an honest simplification. A “search-terms relevance” dimension was dropped because it required LLM judgment at runtime, violating the zero-runtime-LLM principle. An “audience size” dimension was dropped because it required hardcoded vertical guesses that turned out to be a poor signal.&lt;/p&gt;

&lt;p&gt;A product that admits what it does not know tends to read as more trustworthy than one that pretends to know everything. Every dropped dimension is a small surface where the system could have been wrong; cutting them raises the average quality of what is left.&lt;/p&gt;

&lt;h2&gt;
  
  
  Engineering decisions that shaped the system
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One OAuth connection per client, not a master connection.&lt;/strong&gt; Pepper authorizes each client’s ad accounts independently. It costs a bit more onboarding effort and buys fault isolation: if one client’s token expires, only that one client goes dark. For a system whose entire value is &lt;em&gt;aggregate&lt;/em&gt; coverage, blast-radius containment is essential.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idempotent reconnect over clever reconnect.&lt;/strong&gt; An early production bug taught this. A “Reconnect” button could collide with a stale database record and throw an error, forcing a support cycle. The fix made the operation always-succeeds idempotent. For client-facing surfaces, reliability beats elegance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Noise control as a design goal.&lt;/strong&gt; Critical alerts fire instantly; everything else batches into the 8 a.m. digest, with de-duplication so the same problem does not re-alert every two hours unless it materially worsened. A monitoring tool that cries wolf gets muted; restraint is the feature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security by elimination, not addition.&lt;/strong&gt; Rather than bolting on widgets, Pepper removes whole categories of risk. There are zero inbound network ports on the server and database; public traffic reaches the dashboard only through a Cloudflare Tunnel, with no open web port and no SSH to attack. Administrative access goes through AWS Session Manager. Client OAuth tokens are encrypted at rest with managed keys, and the database lives in a private network with no internet access. Pepper is also broadcast-only: it sends and never listens, so there is no inbound-message parser to harden against injection. The harness components that make a setup like this work apply to any production agent system; we cover them in &lt;a href="https://fountaincity.tech/resources/blog/anatomy-of-an-agent-harness/" rel="noopener noreferrer"&gt;Anatomy of an Agent Harness&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it was built: 11 business days, AI-driven, human in review
&lt;/h2&gt;

&lt;p&gt;Pepper was built by an agent-driven development pipeline with a human in a review-and-decide role rather than hand-coding every line. An AI coding system drove implementation with cross-model adversarial review (a second model independently critiques the first’s work before it’s accepted), and work was tracked as discrete units the human approved at defined quality gates.&lt;/p&gt;

&lt;p&gt;The numbers from the project-coordinator side:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pre-build planning ran across a handful of working sessions to lock the spec.&lt;/li&gt;
&lt;li&gt;Spec-done to UAT-ready took about &lt;strong&gt;11 business days&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The AI worked unsupervised in multi-hour stretches across the build.&lt;/li&gt;
&lt;li&gt;Spec conformity on the second pass through all the requirements was very high, with only a couple of minor issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfjz81cbo8lg6js7pbhg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfjz81cbo8lg6js7pbhg.png" width="800" height="747"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Coding steps were quick to supervise. The longest human-time investment was &lt;em&gt;application authorization with Meta and Google&lt;/em&gt;: OAuth app review, scope justification, and platform paperwork. The model-on-model implementation loop has gotten fast enough that the human bottleneck has moved upstream of code, into platform-integration paperwork and scope decisions. A similar pattern shows up in our &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-case-study-voice-intelligence-platform/" rel="noopener noreferrer"&gt;Voice Intelligence Platform case study&lt;/a&gt;, where AI drove implementation while a human directed architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The business payoff
&lt;/h2&gt;

&lt;p&gt;From the agency owner’s side, the system changes four things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Proactive instead of reactive.&lt;/strong&gt; Problems get caught within roughly two hours of starting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whole-portfolio coverage that scales.&lt;/strong&gt; One team can effectively oversee 80+ clients because the system does the continuous watching and the consistent first-pass diagnosis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent quality.&lt;/strong&gt; Every client gets the same expert-grade first look regardless of which buyer is assigned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Defensible advice.&lt;/strong&gt; Every recommendation traces to an explicit rule and real numbers, which is what you actually need in front of a client.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The deterministic architecture is what lets the pricing be predictable on both the agency side and the reseller side. Cost per end-customer scales with the size of the client book, and the math works for all three parties: the end client gets proactive monitoring that did not exist before, the agency gets a margin on a product they can stand behind, and we get a predictable recurring revenue line on infrastructure that does not get more expensive when usage goes up. Read more in our &lt;a href="https://fountaincity.tech/agency-reseller-solutions/" rel="noopener noreferrer"&gt;Agency &amp;amp; Reseller Solutions&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdhebje2ugwxmeuiw7m0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdhebje2ugwxmeuiw7m0.jpg" alt="Lattice wireframe fountain structure with glowing nodes" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The build-vs-buy trade is covered in &lt;a href="https://fountaincity.tech/resources/blog/build-dont-buy-ai-agents-practitioners-guide/" rel="noopener noreferrer"&gt;Build, Don’t Buy AI Agents&lt;/a&gt;, and the agency economics in &lt;a href="https://fountaincity.tech/resources/blog/white-label-ai-agents-agency-economics/" rel="noopener noreferrer"&gt;White-Label AI Agents for Agencies&lt;/a&gt;. Custom systems make sense when off-the-shelf SaaS covers the wrong 80% of the workflow. Ad monitoring at agency scale, with this aggregate view and pricing predictability, turned out to be one of those cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What does “zero AI calls at runtime” actually mean?
&lt;/h3&gt;

&lt;p&gt;It means that when Pepper does its work (pulling campaign data, evaluating it, producing alerts and coaching), it does not call a large language model at any point. The expert judgment was authored ahead of time, with LLM assistance, into lookup tables. The runtime sorts live numbers into buckets, looks up the matching cell, and renders a template. No model invocation, no token cost per request, no variability from one run to the next.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is this different from tools like Optmyzr or Revealbot?
&lt;/h3&gt;

&lt;p&gt;The category overlaps, but in our experience the core function differs. Optmyzr and Revealbot are PPC management platforms; they help you change bids, build campaigns, manage budgets, and report on activity. Pepper is a monitoring and coaching layer that sits above whatever campaign management you are already doing. It does not run your campaigns; it watches them on a two-hour cycle across the whole client book and tells you where to focus. There is an architectural split too: most modern “AI-powered” ad tools call an LLM on every analysis, with the cost and variability that brings. Pepper is built the opposite way, with the AI work done once at build time and the production system deterministic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Could this “compile the expertise” pattern work outside advertising?
&lt;/h3&gt;

&lt;p&gt;Yes, in domains where the judgment can be reasonably captured as a finite set of conditions and recommended actions. Compliance monitoring, support triage, policy routing, and quality assurance for repeatable processes are all candidates. The honest constraint: the more open-ended and context-dependent the judgment, the worse the lookup-table approach fits. The pattern works where a human expert could write a decision playbook; it does not work where every situation is genuinely novel and needs fresh reasoning. Ad metrics relative to a baseline are a bounded reasoning problem, which is why this architecture is the right call for it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>marketing</category>
      <category>automation</category>
    </item>
    <item>
      <title>The Future of Content Writing: Stages, Motivations, and Where the Writer Lands</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Mon, 01 Jun 2026 18:11:12 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/the-future-of-content-writing-stages-motivations-and-where-the-writer-lands-3nm3</link>
      <guid>https://dev.to/sebastian_chedal/the-future-of-content-writing-stages-motivations-and-where-the-writer-lands-3nm3</guid>
      <description>&lt;p&gt;Ask a content professional what worries them about AI and the answer is rarely about the technology. It’s about whether the craft itself still has a place. The craft is splitting in two, and the split has very little to do with which model you use. It has to do with why you were writing in the first place.&lt;/p&gt;

&lt;p&gt;Most arguments about AI and content stall at the tool layer. Which model is best, which prompt template wins, how to defeat “AI voice.” Those arguments tend to miss the more useful question: what is the writing for? The work, the process, and the tools all fall out of that one answer. Start there and the rest gets easier to read.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In this article:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why your motivation for writing, not the tool, sets your AI ceiling&lt;/li&gt;
&lt;li&gt;The quality spectrum from commodity content to novel insight, and what gets commoditized first&lt;/li&gt;
&lt;li&gt;An eight-stage evolution path writers walk as their systems mature&lt;/li&gt;
&lt;li&gt;The oil-painting-to-photography shift, and why it’s a useful parallel rather than a threat&lt;/li&gt;
&lt;li&gt;What writers become when the production work moves into a system&lt;/li&gt;
&lt;li&gt;FAQ on AI slop, SEO risk, knowing your stage, and what a real pipeline looks like&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7hldsv1f3dqeo2qa57eb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7hldsv1f3dqeo2qa57eb.jpg" alt="A writer at a desk with a warm, grounded holographic AI interface visible on a screen, working together creatively, glowing seeds of ideas." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why You Write Determines How You Use AI
&lt;/h2&gt;

&lt;p&gt;Sit with a room of content people and the motivations come out fast.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some are there to share knowledge.&lt;/li&gt;
&lt;li&gt;Some are mastering a topic by writing about it.&lt;/li&gt;
&lt;li&gt;Some are building authority, for themselves, their team, their company.&lt;/li&gt;
&lt;li&gt;Some are chasing search traffic, or the newer cousin of that, AI recommendation visibility.&lt;/li&gt;
&lt;li&gt;Some are writing because the act of writing is the work, the way a painter is in the painting.&lt;/li&gt;
&lt;li&gt;Some want to entertain.&lt;/li&gt;
&lt;li&gt;Some want to differentiate, to sound like themselves and no one else.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those aren’t soft distinctions. They run the whole stack underneath. If your motivation is authority and traffic, if you want the company to be cited and you want the search engine and the AI engine to point at you, you will converge eventually on a system that takes as little of your personal time as possible. You’re not writing to write. You’re writing to publish at a cadence and quality that wins the SERP and the recommendation engine. Delegating the production is the rational endpoint.&lt;/p&gt;

&lt;p&gt;If your motivation is mastery, or the creative process itself, you’ll move the other way. You’ll keep your hands on the keys even when you could automate it. The point isn’t the artifact. The point is what writing the artifact does to your thinking. Automation here doesn’t save you time; it skips the thing you came for.&lt;/p&gt;

&lt;p&gt;Writers carry several of these motivations at once, which is part of why the AI conversation gets muddled. A technical writer documenting a product and also building their own reputation has two motivations pulling in different directions. The first is a candidate for full system production. The second isn’t, and probably shouldn’t be. The shortest version: motivations set the direction; process is downstream of motivations; tools are downstream of process. Picking a tool first is the wrong order. It’s also the most common one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quality Spectrum: Commodity to Novel Insight
&lt;/h2&gt;

&lt;p&gt;Once you know what you’re writing for, the next useful question is what quality tier the work sits in. Content lives on a spectrum that runs roughly from commodity at one end to novel insight at the other. The further toward novel you go, the harder it is to commoditize. The further toward commodity, the faster AI systems catch up.&lt;/p&gt;

&lt;p&gt;Four rough bands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Commodity content.&lt;/strong&gt; Material that, if it vanished from the web, would be filled in by other pages from other sites without the world losing anything. “Top 10 CRMs for small business.” Definitional posts. Restated industry stats. Anyone with a research workflow can produce it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Novel framing.&lt;/strong&gt; The underlying research isn’t new, but the way you arrange it, contrast it, or name what you see is. A new mental model on top of public data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Novel research.&lt;/strong&gt; Work no one else has done. A study, a teardown, an experiment, a data set. The research exists because you ran it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Novel insight.&lt;/strong&gt; A reading of the world that a reader can’t get from anywhere else, because it comes out of your specific position, history, and access. The hardest to fake.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That spectrum isn’t a quality ranking in the moral sense. Commodity content has uses, including teaching the basics and serving direct-answer search. It’s a ranking of how exposed each tier is to commoditization by AI systems. &lt;a href="https://www.cnet.com/tech/services-and-software/slop-is-merriam-websters-2025-word-of-the-year-as-ai-content-floods-the-internet/" rel="noopener noreferrer"&gt;Merriam-Webster’s 2025 word of the year was “slop,”&lt;/a&gt; defined as digital content of low quality produced in quantity by means of artificial intelligence. The slop debate, in our reading, is really a debate about commodity content. AI systems are now producing it faster and cheaper than the writers who used to. That’s where the displacement is. Higher up the spectrum, the picture changes.&lt;/p&gt;

&lt;p&gt;For commodity work, AI doesn’t replace the writer because the writer was special. AI replaces the writer because the output was replaceable. The economic floor moves down. Anyone trying to compete with a system on commodity content using hand production is in the wrong race.&lt;/p&gt;

&lt;p&gt;For novel framing, novel research, and novel insight, AI tends to extend the writer rather than replace them. The system can run the research net wider, draft the structural scaffolding, surface counter-arguments, and free the writer’s attention for the parts that are genuinely theirs: the angle, the connection, the read of the situation. That work doesn’t get cheaper. It often gets more valuable, because the surrounding commodity layer is filling up with synthetic substitutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-29-J-future-of-writing-spectrum.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-29-J-future-of-writing-spectrum.svg" alt="Quality spectrum diagram showing Commodity, Novel Framing, Novel Research, Novel Insight." width="100" height="18.6046511627907"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Eight Stages of Writing with AI
&lt;/h2&gt;

&lt;p&gt;Writers who are honest about it find themselves somewhere on a longer evolution path than the four-stage maturity models you’ll see floating around. &lt;a href="https://amplience.com/blog/the-four-stages-of-ai-maturity-which-level-is-your-business/" rel="noopener noreferrer"&gt;Amplience’s 4 As model&lt;/a&gt;, assistant to augmentation to automation to agentic, is a clean public framing of the same arc. It’s useful, particularly for commerce content teams. The path below is finer-grained, because the failures and breakthroughs that move a writer up the path usually happen in steps small enough to live inside a single “stage” of the 4-stage view.&lt;/p&gt;

&lt;p&gt;The path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You write everything by hand.&lt;/strong&gt; AI is somewhere else, not in your workflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You use AI as a thinking partner.&lt;/strong&gt; Brainstorming, challenging your draft, asking you questions, helping you re-organize. The writing is still yours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You use AI to write, and the voice is off.&lt;/strong&gt; This is the stage that hooks people. The drafts come fast and the drafts read like a chatbot wrote them. You either iterate prompts forever or get frustrated and quit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You learn to get your voice aligned, and now the sources are hallucinated.&lt;/strong&gt; The output sounds like you. It also contains stats, quotes, and citations that don’t exist. Stage four is the first time you have to think about systems, not just prompts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You learn to ground sources, and you start using AI for novel research.&lt;/strong&gt; Retrieval, validation, citation checking. The system now produces work you can publish without rewriting from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research drives the topic selection, not just the support.&lt;/strong&gt; Your input pipelines, search trends, your own analytics, customer signals, feed the system upstream of the writing. The system suggests what to write before you ask.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You train a model on your voice itself.&lt;/strong&gt; Fine-tuning or a small model that has read enough of your prior work to draft in a register no general-purpose model produces by default. The voice question stops being a per-draft fight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The system interviews you, or interviews whoever it needs to.&lt;/strong&gt; The bottleneck stops being “get the AI to write what you’d write.” It becomes “get the source, you or an expert or a customer, to surface what they actually know,” which the system captures and produces around.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most published content right now is being produced somewhere between stages two and five, with the more ambitious operations stretching into six. The pattern we tend to see is most writers stuck at stage three try to fix stage-three problems forever, with better prompts, better personas, better instructions. The real move is usually up rather than sideways. If voice keeps coming out wrong, the answer isn’t a smarter prompt. It’s a structured pipeline that handles voice as a separate concern from drafting. That’s stage five thinking applied to a stage three problem. The problems at stage three don’t get solved at stage three; they dissolve when you move.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqn6le8uvbz10kht7fdn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqn6le8uvbz10kht7fdn.jpg" alt="A professional content strategist thoughtfully reviewing architectural workflow diagrams on a modern display." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Oil Painting to Photography Shift
&lt;/h2&gt;

&lt;p&gt;The historical parallel people reach for is photography killed portrait painting. The actual record is more interesting.&lt;/p&gt;

&lt;p&gt;Historian Hans Rooseboom, looking at nineteenth-century Dutch painters, &lt;a href="https://daily.jstor.org/did-photography-really-kill-portrait-painting/" rel="noopener noreferrer"&gt;found only one report of a painter being displaced&lt;/a&gt; by the camera. He also found reports of an artistic revival, a resurgence of portrait work, and painters who used photography as a side gig, as a reference aid, and as a way to reproduce their own work for sale. The frame of “photography killed painting” doesn’t survive contact with the data. What actually happened: photography did what photography is good at, capturing a likeness fast and cheap, and painting kept what painting was good at, which had never really been likeness-capture in the first place.&lt;/p&gt;

&lt;p&gt;The same pattern is plausible here. Commodity content, the writing equivalent of a passable likeness, is moving into systems. That’s where the photography analogy lands. The work that was always more than likeness, the novel framing, the original research, the insight that comes from being in a specific seat at a specific moment, stays with the writer. Very often it gets sharper because the writer’s attention is no longer eaten by &lt;em&gt;commodification&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Photography also produced an entirely new craft that didn’t exist before: the photographer. The future of content writing has the same shape. The person who builds and runs the system that produces content is a new role, somewhere between a writer, an information architect, and an operator. We don’t have a clean job title for it yet. The closest analogue is the difference between cooking dinner and running a kitchen. Both involve food. Only one scales.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzggdxlexkesnc9h2bcwf.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzggdxlexkesnc9h2bcwf.jpg" alt="An artist's traditional oil painting studio subtly blending and transforming into a modern digital creator's workspace." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Writers Become When the System Takes the Writing
&lt;/h2&gt;

&lt;p&gt;If a serious portion of commodity content production moves into systems, the obvious question is what the writers do? The less obvious and more useful answer is that the craft doesn’t disappear. It elevates.&lt;/p&gt;

&lt;p&gt;Three pillars stay with the human: &lt;strong&gt;expertise, relationships, and ownership.&lt;/strong&gt; Expertise is the thing the system needs as input, your read of the topic, your synthesis of public information, your judgment about what matters. Relationships are how you stay connected to the people who consume your work and the people who source it. Ownership is the call on the work and the decisions about what gets shipped and what doesn’t. Those three are not going into a system anytime soon, because they’re not production tasks; they’re judgment tasks.&lt;/p&gt;

&lt;p&gt;What moves into the system is the production: research gathering, citation validation, drafting, voice alignment, editing for length, formatting for the channel. Each of those is commodifiable on its own. Together, they’re most of what a content team’s time goes to today. Pull them into a system and the writer’s job changes shape, closer to a senior editor with a research assistant who never sleeps, or to a system designer with strong opinions about how the work should read.&lt;/p&gt;

&lt;p&gt;The writers we’ve watched move into system-running roles describe it more like compression than loss. The work they liked, the angle, the synthesis, the call on what to publish, gets a larger share of their week. The production grind gets handled. Whether the broader shift is good for the profession depends on who can make the move and on what terms, which is a real conversation worth having and which most “AI will or won’t replace writers” coverage skips.&lt;/p&gt;

&lt;p&gt;Build the system, do not compete with it. The hand-crafted-belt-maker who switches to designing the belt-making machine doesn’t lose the craft. They get more of it, applied at a different level. The window for making the move is before the production layer fully commoditizes, not after.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Content Operations Are Going
&lt;/h2&gt;

&lt;p&gt;The end state of the eight-stage path isn’t a single tool. It’s a content operation that runs as a system, with the human handling guidance, source material, and taste. &lt;a href="https://www.siegemedia.com/strategy/ai-writing-statistics" rel="noopener noreferrer"&gt;According to Siege Media’s 2026 survey&lt;/a&gt;, 97% of content marketers plan to use AI to support content marketing in 2026, up from 90% in 2025 and 64.7% in 2023. The direction is settled. The question is how far each operation moves up the path, and how fast.&lt;/p&gt;

&lt;p&gt;What “moves up” looks like, concretely: research happens through retrieval against a curated source set rather than freeform model output. Voice gets handled through fine-tuning or a structured pipeline of voice and style passes rather than per-draft prompting. Editorial decisions come out of analytics signals that flow back into topic selection. Quality control runs as a series of validation gates rather than a single human reviewer reading every word.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9qh0ulebzpauoum33x1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9qh0ulebzpauoum33x1.jpg" alt="An ornate water fountain with cascading jets in a bright modern courtyard, water transforming into glowing digital butterflies." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Future-writers
&lt;/h2&gt;

&lt;p&gt;The future of content writing isn’t a single future. It’s at least two. For commodity content, the volume layer that powers search visibility, AI recommendation, and basic enablement, the path is toward &lt;em&gt;systems&lt;/em&gt;. The writers who built careers on producing that work at hand-craft speed are looking at the steepest adjustment, and the most useful move is upstream into system design, editorial direction, or further up the quality spectrum.&lt;/p&gt;

&lt;p&gt;For content that lives further up the spectrum, novel framing, novel research, novel insight, AI &lt;em&gt;extends&lt;/em&gt; the writer rather than replacing them. The production grind gets handled. The angle, the read, the call on what to publish stays with the person. How that person goes from idea, to finalized output will change, writers will adapt, the (AI) systems created will be in service of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Will AI replace content writers entirely?
&lt;/h3&gt;

&lt;p&gt;No, but it will replace a large share of commodity content production, the layer where the work is replaceable by any equivalent source. Writers producing novel framing, original research, or insight tied to their specific position will see AI systems extend their reach, the quality of their work and research, rather than displace them. The honest version: the role is splitting, not vanishing. The writers most exposed are those whose output was always commoditizable; the ones least exposed are those whose value comes from judgment, relationships, strong differentiation,&amp;nbsp; methodology, and ownership of a specific point of view.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is “AI slop” and how do you avoid producing it?
&lt;/h3&gt;

&lt;p&gt;Merriam-Webster defines slop as “digital content of low quality that is produced usually in quantity by means of artificial intelligence.” The term collapses several distinct problems: bad voice, hallucinated sources, no original framing, generic structure, weak audience fit. Avoiding it isn’t a prompting problem; it’s a pipeline-system problem. Single-prompt generation will keep producing slop forever. A structured workflow that separates research, voice alignment, source validation, and editorial review handles most of what people mean when they say “slop.” The fix is focusing on engineering your systems (the harness) rather than getting better at one prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I know which stage I’m at?
&lt;/h3&gt;

&lt;p&gt;Look at where your most consistent failures show up. If you’re producing all your content by hand and AI isn’t in the workflow, you’re at stage one. If you use AI for thinking and editing but write the drafts yourself, stage two. If you’re drafting with AI and fighting voice, stage three. If voice is solved but sources keep needing manual verification, stage four. If both are handled and you’re starting to drive topic selection from research signals, stages five to six. Past that, you’re in territory that requires either fine-tuning or a system that captures source material from interviews, which most operations aren’t running yet. The stages aren’t a status ranking. They’re a diagnostic for where to put the next month of work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is AI-generated content bad for SEO?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://developers.google.com/search/docs/fundamentals/using-gen-ai-content" rel="noopener noreferrer"&gt;Google’s guidance&lt;/a&gt; is that generative AI can be useful for research and structure, but that producing many pages without adding user value may violate scaled content abuse policy. In practice, content quality and originality matter more than authorship method. AI-assisted content that’s well-sourced, original in framing, and useful to readers tends to perform; AI-flooded content thin on substance tends to get filtered out by the search engine, by AI recommendation systems, and by readers. The risk isn’t AI use. The risk is shipping commodity volume with no editorial layer on top.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does a content production system actually look like in practice?
&lt;/h3&gt;

&lt;p&gt;Concretely: a series of stages with hand-offs, not a single prompt. Research runs first and gathers material against a curated source set. A story spine or outline gets generated and approved before drafting. The draft is written against the spine. A self-review pass checks voice and structure against a style guide. A deduplication pass checks the work against your existing library so you’re not repeating yourself across articles. An art-direction pass plans images. A final pass handles polish and validation. Each stage has a defined input and output. The system isn’t a model; it’s the steps around the model. &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-teams-business-operations/" rel="noopener noreferrer"&gt;Our walkthrough of running this in production&lt;/a&gt; covers the costs and the team shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://fountaincity.tech/resources/blog/ai-agent-teams-business-operations/" rel="noopener noreferrer"&gt;AI Agent Teams for Business Operations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fountaincity.tech/resources/blog/inside-autonomous-ai-content-pipeline/" rel="noopener noreferrer"&gt;Inside Our Autonomous AI Pipeline&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fountaincity.tech/resources/blog/future-digital-agencies/" rel="noopener noreferrer"&gt;Future of Digital Agencies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fountaincity.tech/resources/blog/ai-readiness-evaluation/" rel="noopener noreferrer"&gt;AI Readiness Evaluation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fountaincity.tech/resources/blog/making-your-business-visible-to-ai-a-strategic-guide-to-appearing-in-ai-recommendations/" rel="noopener noreferrer"&gt;Making Your Business Visible to AI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>contentmarketing</category>
      <category>automation</category>
      <category>agents</category>
    </item>
    <item>
      <title>Anatomy of an Agent Harness: 7 Components You Should Audit</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Thu, 28 May 2026 00:44:02 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/anatomy-of-an-agent-harness-7-components-you-should-audit-4nfk</link>
      <guid>https://dev.to/sebastian_chedal/anatomy-of-an-agent-harness-7-components-you-should-audit-4nfk</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-26-J-anatomy-of-an-agent-harness-hero.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-26-J-anatomy-of-an-agent-harness-hero.svg" alt="7-component anatomy ring diagram — model at the center, execution sandbox, identity, memory/context, tool calls, orchestration, cost governance, observability" width="100" height="52.333333333333336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You’re past the pilot. The agent works in demos and probably in staging, and now somebody is asking the real buying question: will it hold up when nobody is watching? That question doesn’t resolve at the model layer. It resolves in the layer of code, configuration, and execution logic that sits around the model, what the industry has started calling the &lt;em&gt;harness&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;There are seven components every agent harness has. We built these seven components after reviewing eight published articles by our peers between March and April 2026 from &lt;a href="https://www.langchain.com/blog/the-anatomy-of-an-agent-harness" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt;, &lt;a href="https://www.salesforce.com/agentforce/ai-agents/agent-harness/" rel="noopener noreferrer"&gt;Salesforce&lt;/a&gt;, &lt;a href="https://www.firecrawl.dev/blog/what-is-an-agent-harness" rel="noopener noreferrer"&gt;Firecrawl&lt;/a&gt;, &lt;a href="https://atlan.com/know/what-is-harness-engineering/" rel="noopener noreferrer"&gt;Atlan&lt;/a&gt;, &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;Fowler &amp;amp; Boeckeler&lt;/a&gt;, &lt;a href="https://addyosmani.com/blog/agent-harness-engineering/" rel="noopener noreferrer"&gt;Osmani&lt;/a&gt;, &lt;a href="https://www.philschmid.de/agent-harness-2026" rel="noopener noreferrer"&gt;Schmid&lt;/a&gt;, and &lt;a href="https://handsonarchitects.com/blog/2026/the-harness-model-ai-engineering-maturity-matrix/" rel="noopener noreferrer"&gt;Hands on Architects.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By the end of this article I hope you will have a deeper understanding of what a harness is, the different ways it can fail, what it does and how to assess the quality of your harness (existing, or when shooping for someone to build an agent from a vendor).&lt;/p&gt;

&lt;p&gt;There’s a lot of interest in this topic right now: &lt;a href="https://nathanbenaich.substack.com/p/state-of-ai-april-2026-newsletter" rel="noopener noreferrer"&gt;Anthropic’s annualized revenue&lt;/a&gt; grew from $14B in mid-February to over $30B by April 2026. The market is &lt;em&gt;buying&lt;/em&gt;. What it’s buying is &lt;strong&gt;models&lt;/strong&gt;. But what’s deciding whether those models earn their keep in production is the &lt;strong&gt;harness layer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One clarification before the components, because the word &lt;em&gt;agent&lt;/em&gt; is doing too much work in 2026. When we say “agent” here, we mean model plus harness running self-directed work, not workflow-LLM patterns where every step is human-scheduled. A workflow with an embedded LLM call needs prompt management and an error handler; an agent doing self-directed work needs the entire harness, which is what we are discussing in this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  The seven components of the harness
&lt;/h2&gt;

&lt;p&gt;The eight sources above name different subsets of components, the common agreement and synthesis of all the harnesses comes down to:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;What it is&lt;/th&gt;
&lt;th&gt;How it can fail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Execution Sandbox&lt;/td&gt;
&lt;td&gt;What the agent runs as and its permissions.&lt;/td&gt;
&lt;td&gt;Broad permissions + long-horizon agent creates outsized risk radius.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auth Identity&lt;/td&gt;
&lt;td&gt;Who the agent is to external systems.&lt;/td&gt;
&lt;td&gt;Shared API keys prevent auditing; child agents break revocation chains.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory &amp;amp; Context&lt;/td&gt;
&lt;td&gt;What persists, what compacts, what discards.&lt;/td&gt;
&lt;td&gt;Uncompacted context growth leaks cost; no garbage collection.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool Calls&lt;/td&gt;
&lt;td&gt;How the agent interacts with and reaches the world.&lt;/td&gt;
&lt;td&gt;Transient tool failures trigger runaway retry storms.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;Single-agent loop vs multi-agent handoff, and who owns state.&lt;/td&gt;
&lt;td&gt;Multiple agents conflict over stale views of unowned shared state.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost Governance&lt;/td&gt;
&lt;td&gt;What stops a runaway charge before the credit card bill tells you.&lt;/td&gt;
&lt;td&gt;Lack of pre-flight circuit breakers allows sudden, massive token spend.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;What you can answer the morning after.&lt;/td&gt;
&lt;td&gt;Logs confirm a failure occurred but lack structure to explain why. Or no logs at all.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Component 1: Execution sandbox
&lt;/h2&gt;

&lt;p&gt;Your execution sandbox decides what the agent runs as, where it runs, and what it can reach: filesystem, network, processes, databases, and infrastructure. The decision is your risk radius, and it has to be made before deploy because retrofitting sandboxing later is rip-and-replace work.&lt;/p&gt;

&lt;p&gt;The architectural choices fall along a spectrum: container-level isolation, process-level isolation, OS-level isolation, or hardware-level isolation with policy engines on top. See, for example, &lt;a href="https://fountaincity.tech/resources/blog/nemoclaw-enterprise-autonomous-agents/" rel="noopener noreferrer"&gt;NVIDIA’s NemoClaw approach&lt;/a&gt; with its OpenShell and scoped permissions.&lt;/p&gt;

&lt;p&gt;The clearest recent worked example sits in the &lt;a href="https://incidentdatabase.ai/cite/1442/" rel="noopener noreferrer"&gt;AI Incident Database, citation 1442&lt;/a&gt;: in mid-December 2025, AWS Cost Explorer in one mainland China region reportedly had an approximately 13-hour interruption after Kiro, an internal Amazon AI coding tool, was reportedly allowed to delete and recreate part of the working environment. Amazon disputed the AI-causation account and attributed the issue to user error and misconfigured access controls. In both cases, the root of the issue was the same: the AI’s sandbox permissions were too broad for what the agent could do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask yourself:&lt;/strong&gt; What can this agent do that I’d be unwilling to let a junior engineer do on day one, and what stops them from doing that?&lt;/p&gt;

&lt;h2&gt;
  
  
  Component 2: Identity and authentication
&lt;/h2&gt;

&lt;p&gt;Identity and authentication answers a question most teams skip in the rush to ship: who is the agent? And who is it more practically as it relates to external systems, what credentials does it carry, and what’s its audit trail when it acts? The decision is whether to give each agent a dedicated service account with scoped permissions, run it under a shared API key, or impersonate a human user.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.gravitee.io/blog/state-of-ai-agent-security-2026-report-when-adoption-outpaces-control" rel="noopener noreferrer"&gt;Gravitee 2026 State of AI Agent Security report&lt;/a&gt; is the cleanest 2026 data on what production teams are actually doing here. The picture is sobering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only &lt;strong&gt;21.9%&lt;/strong&gt; of teams treat AI agents as independent, identity-bearing entities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;45.6%&lt;/strong&gt; rely on shared API keys for agent-to-agent authentication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;25.5%&lt;/strong&gt; of deployed agents can create and task another agent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When an agent on a shared key spawns child agents and one of them does something costly, the chain of command becomes harder to control and audit. The potential failure pattern to watch out for here is the combination (shared key plus multi-agent, plus the ability to spawn), not any single decision. In our experience, this tends to be the component team’s promise to “fix later” and then discover later means: after a costly incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask yourself:&lt;/strong&gt; If this agent did something costly in the next hour, could I tell which agent did it, and could I revoke just that agent’s access without breaking the others? How do I manage sub-agent spawning? Are my keys shared too broadly across multiple agents or systems?&lt;/p&gt;

&lt;h2&gt;
  
  
  Component 3: Memory and context
&lt;/h2&gt;

&lt;p&gt;Memory and context describes what persists across runs, what gets compacted into smaller representations, and what gets discarded. Context-rot and compaction are first-class harness primitives. From our experience operating memory and context controls: the coupling between memory and cost tends to be tighter than either treatment suggests; we’ll get to that in Component 6.&lt;/p&gt;

&lt;p&gt;A strong harness here requires that you answer the question of what stores state (vector retrieval, structured state, a hybrid). But there’s also a&amp;nbsp; token discipline at the prompt-construction layer, deciding what gets included on each turn. Plus a compaction policy, deciding when long histories collapse into summaries. Your context window is your “RAM”, and a harness with no compaction policy is a process that never frees memory.&lt;/p&gt;

&lt;p&gt;Failures here can look like your agent still gives correct answers, but each turn pulls more context than the last, and the per-task spend grows, while accuracy declines. The architectural fix sits in the memory layer, which is where teams typically look last because the agent is still “working.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask yourself:&lt;/strong&gt; Where does this agent’s state actually live, how am I managing memory and context in my agent network?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjbwlo53gl74txfju990c.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjbwlo53gl74txfju990c.jpg" alt="Abstract plexus visualization of agent memory and context" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Component 4: Tool calls
&lt;/h2&gt;

&lt;p&gt;Tool calls covers how the agent reaches the world: the tool registry, the calling protocol, the error-recovery behavior. Are tools exposed via an MCP-native registry, hand-wrapped APIs maintained internally, or framework-bundled tool packs you don’t control. The MCP server ecosystem expanded rapidly through early 2026, and most teams we work with end up with a mix of all three.&lt;/p&gt;

&lt;p&gt;A serious risk with tools is a retry storm. This is when the agent calls a tool, the call fails transiently (a rate limit, a 503, a malformed response), and the harness has no policy distinguishing retryable from non-retryable failure modes. So the agent retries. And retries. And retries. The cost shows up before the alert does, and the upstream tool sometimes degrades further under the retry pressure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask yourself:&lt;/strong&gt; How should my tools be built and called? When this agent calls a tool and the call fails, what does it do, and what stops it from doing the same failed call 50 more times?&lt;/p&gt;

&lt;h2&gt;
  
  
  Component 5: Orchestration
&lt;/h2&gt;

&lt;p&gt;Orchestration answers whether you have one agent in a loop or several agents handing off to each other, and whether the work is event-driven or scheduled. The load-bearing decision underneath is shared-state ownership: is there one canonical source of truth (a file, a database, a queue) that agents read and write through, or is state implicit and distributed across the agents themselves?&lt;/p&gt;

&lt;p&gt;Multi-agent systems that fail in production tend to fail here. Two agents act on stale views of the same state, the merge logic was never specified, and the bug is invisible until it’s expensive. Anthropic’s published work on multi-agent research systems is a useful reference for what production adds to the orchestrator-subagent pattern; we covered that ground in &lt;a href="https://fountaincity.tech/resources/blog/anthropic-multi-agent-blueprint-production/" rel="noopener noreferrer"&gt;our take on the multi-agent blueprint&lt;/a&gt;, which gets into the token-cost tradeoff for orchestration specifically.&lt;/p&gt;

&lt;p&gt;The orchestration component is also where “we’ll just add another agent” tends to become technical debt. Each agent you add multiplies the number of state transitions you have to reason about, and if the system isn’t built around an explicit state owner from the start, the debt compounds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask yourself:&lt;/strong&gt; If two agents disagree about what’s true, which one wins, and how do I know? Can misinterpretations from one agent carry forward down the chain to other agents? How are hand offs done between agents?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fck6hpyqktvwn43xji62y.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fck6hpyqktvwn43xji62y.jpg" alt="Professional team reviewing AI architecture" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Component 6: Cost governance
&lt;/h2&gt;

&lt;p&gt;Cost governance covers what stops a runaway: token budgets, rate limits, kill switches, spend caps, pre-flight budget enforcement. Cost governance is the second half of the architectural pairing we flagged in Component 3. Bad memory designs leak cost; cost circuit breakers can’t fix poor context discipline. They can only cap the downside while the upstream architecture is fixed. We’ve written about how the optimization sequence actually plays out (&lt;a href="https://fountaincity.tech/resources/blog/ai-cost-optimization-practitioner-framework/" rel="noopener noreferrer"&gt;script-first, caching last&lt;/a&gt;), and the same logic applies here: governance lives at the harness layer, not at the dashboard layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask yourself:&lt;/strong&gt; What’s the maximum spend (in dollars and in irreversible state changes) this agent can incur in the next time interval (hour, day) without human approval?&lt;/p&gt;

&lt;h2&gt;
  
  
  Component 7: Observability
&lt;/h2&gt;

&lt;p&gt;Observability is what you can answer the morning after. Structured event logs, traces, cost and latency metering, decision audit trails. Observability quality tends to decide how fast you can recover from anything that goes wrong in the other six components.&lt;/p&gt;

&lt;p&gt;The architectural decision is whether to emit structured event logs at the harness layer (queryable later), scrape ad-hoc logs from individual agents (slower, lossy), or rely on vendor-provided dashboards (good for some questions, bad for the questions you didn’t anticipate). The trade-offs and what they look like at each deployment stage are the topic of &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-deployment-operational-decisions/" rel="noopener noreferrer"&gt;our piece on operational decisions at each deployment stage&lt;/a&gt;. The three-monitoring-layers question, in particular, lives in this component.&lt;/p&gt;

&lt;p&gt;Your risk here: something goes wrong overnight, and the team can answer &lt;em&gt;that&lt;/em&gt; something went wrong (the bill, the alert) but not &lt;em&gt;why&lt;/em&gt;. The runbook says check the logs; the logs were never structured to answer this kind of question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask yourself:&lt;/strong&gt; What can I answer about what this agent did yesterday, and how long does the answer take to produce? How quickly do we get notified for issues?&lt;/p&gt;

&lt;h2&gt;
  
  
  The 30-minute audit checklist
&lt;/h2&gt;

&lt;p&gt;These seven questions can help you prevent the failure patterns while getting your harness ready for production:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-26-J-anatomy-of-an-agent-harness-checklist.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-26-J-anatomy-of-an-agent-harness-checklist.svg" alt="7-question vertical checklist for AI agent harness components" width="100" height="105.0"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Execution sandbox.&lt;/strong&gt; What can this agent do that I’d be unwilling to let a junior engineer do on day one, and what stops it from doing that?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identity and authentication.&lt;/strong&gt; If this agent did something costly in the next hour, could I tell which agent did it, and could I revoke just that agent’s access without breaking the others?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory and context.&lt;/strong&gt; Where does this agent’s state actually live, and what tells me when it’s growing in a way it shouldn’t?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calls.&lt;/strong&gt; When this agent calls a tool and the call fails, what does it do, and what stops it from doing the same failed call 50 more times?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration.&lt;/strong&gt; If two agents disagree about what’s true, which one wins, and how do I know?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost governance.&lt;/strong&gt; What’s the maximum spend (in dollars and in irreversible state changes) this agent can incur in the next hour without human approval?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability.&lt;/strong&gt; What can I answer about what this agent did yesterday, and how long does the answer take to produce?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A vendor demo or an internal architecture review that gets clean, specific answers has made a good start at designing an effective harness layer. A demo where two or three answers turn into “we’re planning to add that” is a system where the production-readiness work hasn’t been done yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;A bad harness is just a brain in a jar. You need a solid harness to give your agent the eyes, ears and system capable of operating in your business environment effectively. We hope that these questions give you a head start in your self-evaluation process as you evaluate your internal progress or that of a vendor when selecting your next partner to help you build your agentic applications.&lt;/p&gt;

&lt;p&gt;If running the harness layer yourself isn’t where you want to spend your time, we build and operate agentic systems for clients. You can learn more abour our &lt;a href="https://fountaincity.tech/services/managed-autonomous-ai-agents/" rel="noopener noreferrer"&gt;managed autonomous AI agents&lt;/a&gt;, or &lt;a href="https://dev.to/contact/"&gt;contact us&lt;/a&gt; to find out more.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is an agent harness in AI?
&lt;/h3&gt;

&lt;p&gt;An agent harness is every piece of code, configuration, and execution logic around the model. LangChain’s Vivek Trivedy describes it as “every piece of code, configuration, and execution logic that isn’t the model itself.” The model is the reasoning core; the harness is the operational software around it that handles tools, memory, identity, sandboxing, orchestration, cost controls, and observability. In production agent systems, the harness tends to determine whether the model’s output translates into reliable work.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between an agent harness and an agent framework?
&lt;/h3&gt;

&lt;p&gt;An agent framework (LangChain, LangGraph, AutoGen, CrewAI, and similar) is a library that gives you primitives for building agents: chains, tool-calling abstractions, memory interfaces. A harness is the integrated runtime that sits around the model in production, including everything the framework provides plus the things frameworks don’t: sandbox policies, identity boundaries, cost governors, observability pipelines. Firecrawl’s April 2026 piece draws this distinction clearly: a framework helps you build; a harness is what runs the result.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the components of an agent harness?
&lt;/h3&gt;

&lt;p&gt;The union view across the eight major published definitions consolidates into seven components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Execution sandbox:&lt;/strong&gt; where it runs, with what access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identity and authentication:&lt;/strong&gt; who it is to external systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory and context:&lt;/strong&gt; what persists and what compacts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calls:&lt;/strong&gt; how it reaches the world&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration:&lt;/strong&gt; single-agent loop vs multi-agent handoff, and who owns state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost governance:&lt;/strong&gt; what stops a runaway&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; what you can answer the morning after&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why does the harness matter more than the model?
&lt;/h3&gt;

&lt;p&gt;Through early 2026, eight major publishers (LangChain, Salesforce, Firecrawl, Atlan, Fowler and Boeckeler, Osmani, Schmid, Hands on Architects) independently shipped harness-definition pieces — convergence on the harness as the decisive layer for production reliability. The model handles reasoning; the harness handles whether that reasoning translates into reliable work.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I evaluate whether an agent system is production-ready?
&lt;/h3&gt;

&lt;p&gt;The 30-minute checklist above is the short version: seven operator questions, one per component. A system that answers all seven cleanly has been architected through the harness layer. A system that slides into “we’re adding that” on two or three components has work ahead, and the production-readiness timeline is probably longer than the demo suggests. The Gravitee 2026 report found 21.9% of teams treating agents as identity-bearing entities, which is a useful sanity check on what “ready” looks like across the field. Most production systems still have meaningful gaps, and naming them honestly is more useful than papering over them.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>security</category>
      <category>devops</category>
    </item>
    <item>
      <title>GEO Measurement: The KPIs That Generate Actual Results (Not just vanity metrics)</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Sat, 23 May 2026 10:52:04 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/geo-measurement-the-kpis-that-generate-actual-results-not-just-vanity-metrics-3hk5</link>
      <guid>https://dev.to/sebastian_chedal/geo-measurement-the-kpis-that-generate-actual-results-not-just-vanity-metrics-3hk5</guid>
      <description>&lt;p&gt;The dominant question in generative engine optimization right now is whether your brand shows up in AI answers. The harder, more useful question is whether the AI &lt;em&gt;recommends&lt;/em&gt; you when a buyer asks the comparison prompt that ends the decision. Those two outcomes are decoupled. The same AI conversation can pull a quote from your site and then, in the next breath, recommend a competitor to the same user.&lt;/p&gt;

&lt;p&gt;That gap between being cited and being recommended is what the published GEO measurement frameworks tend to overlook. They count citations, average them across engines, and report a single “visibility score.” All three moves erase the signal you actually need.&lt;/p&gt;

&lt;p&gt;I believe this is a leftover from SEO where getting cited was enough because then people would click search results. With GEO your customer is having an entire conversation with AI, and doing all the funnel stages off-line. They are learning about their problem/need, comparing competitors and then ultimately selecting their vendor without ever leaving a chat window.&lt;/p&gt;

&lt;p&gt;Getting cited isn’t enough, you need to be&amp;nbsp;*recommended&amp;nbsp;*by the AI as the best or at least one of the best options in class.&lt;/p&gt;

&lt;h2&gt;
  
  
  The measurement gap is a targeting problem, not a tooling problem
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3cpbwwpm6jv9u0woyby6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3cpbwwpm6jv9u0woyby6.jpg" alt="Marketing executive analyzing a digital dashboard for GEO recommendations" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If getting recommended or not wasn’t enough… 62% of marketing leaders say they cannot measure the ROI of their AI search optimization efforts, according to a &lt;a href="https://www.gen-optima.com/geo/how-to-measure-geo-roi-kpi-framework-2026/" rel="noopener noreferrer"&gt;2025 Conductor survey, reported via GenOptima&lt;/a&gt;. The default reading of that number is that the field is under-tooled — that better dashboards or more granular tracking would close the gap.&lt;/p&gt;

&lt;p&gt;However to add salt to the wound… the published frameworks are not failing to measure enough things. They are mostly measuring the wrong outcome.&lt;/p&gt;

&lt;p&gt;The leading guides (GenOptima’s six-KPI framework, UpGrowth’s seven KPIs, Stellar’s three-tier model, Digital Bloom’s ROI procedure) each capture a real piece of the measurement stack. Read together, they recommend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Citations&lt;/li&gt;
&lt;li&gt;Mentions&lt;/li&gt;
&lt;li&gt;Sentiment&lt;/li&gt;
&lt;li&gt;Share of voice&lt;/li&gt;
&lt;li&gt;Position&lt;/li&gt;
&lt;li&gt;Source coverage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…and a half-dozen named composites. What none of them resolve is the distance between an AI answer that quotes you, and an AI answer that recommends you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Being cited is not being recommended — and the AI knows the difference
&lt;/h2&gt;

&lt;p&gt;A buyer asks Claude: “What’s the best AI consulting firm for mid-market manufacturers in the Pacific Northwest?” The answer pulls a quote about regional manufacturing trends from your blog post. The same answer, two sentences later, recommends three competitors as the firms to actually contact.&lt;/p&gt;

&lt;p&gt;You got the citation. You did &lt;em&gt;not&lt;/em&gt; get the recommendation. The user closes the tab and starts emailing your competitors.&lt;/p&gt;

&lt;p&gt;This pattern is more common than the citation-counting frameworks acknowledge. Citation and recommendation are decoupled outcomes — they are produced by different parts of the AI’s reasoning, draw from different signals on your site, and respond to different optimizations. Most published frameworks treat citation rate as the headline KPI. It is a leading indicator at best. It tells you the AI &lt;em&gt;knows&lt;/em&gt; something about you. It does not tell you the AI &lt;em&gt;picks&lt;/em&gt; you when the question gets to the comparison stage.&lt;/p&gt;

&lt;p&gt;The right primary KPI is recommendation rate at buyer-intent prompts. Not “does the engine mention your brand somewhere in a 600-word answer about the industry” but “does the engine name you when a real buyer asks the question that ends in a purchase decision.” That requires building a prompt set that mirrors the comparison questions your buyers actually ask — not the head terms you would target in traditional SEO, and not the broad industry queries that produce friendly mentions without conversion intent.&lt;/p&gt;

&lt;p&gt;A useful working definition: track recommendation rate as the percentage of buyer-intent prompts in which an AI engine names your brand as a recommended option (not merely cites a source from your domain). Measure per engine, across a stable prompt set you can re-run monthly. For teams running thought-leadership programs that aim higher up the funnel, the same measurement works at the awareness stage — “what should I read about ____” prompts where the recommendation is to subscribe, watch, or follow rather than to buy. The mechanic is the same; the prompt set changes.&lt;/p&gt;

&lt;p&gt;Citation rate still matters as a leading indicator. It usually predicts which brands will eventually become recommendation candidates. But reporting citation rate without recommendation rate is reporting the dress rehearsal as if it were opening night.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-engine spread is the load-bearing KPI — aggregate scores lie
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.tryprofound.com/blog/citation-overlap-strategy" rel="noopener noreferrer"&gt;Profound’s analysis of 100,000 prompts&lt;/a&gt; across ChatGPT and Perplexity found that 89% of AI citations come from different sources depending on which engine the user queried, and only 11.0% of domain citations appeared in both models.&lt;/p&gt;

&lt;p&gt;Try to avoid tools that display only your “AI visibility score” across engines as an averaging. The aggregate number tells you nothing about which engine your buyers are using, which engine you are losing on, or which engine your next content investment should target.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-19-J-geo-measurement-framework-03.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-19-J-geo-measurement-framework-03.svg" alt="Per-engine citation divergence matrix showing how AI citations vary across engines" width="100" height="100"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What to track instead, in three slots: per-engine citation rate, per-engine recommendation rate, and the variance across them as its own metric. Call that last one per-engine spread. A brand with a 40% recommendation rate on Perplexity and a 5% rate on ChatGPT has a per-engine spread that tells you exactly where the optimization work needs to go.&lt;/p&gt;

&lt;p&gt;Per-engine spread also doubles as a noise check on vendor reports. If a tool gives you a single composite score and refuses to break it down by engine, the report is functionally unverifiable. You cannot act on a number you cannot decompose.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three KPIs that survive contact — and how to rank them
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F32l1canukfhnk70s8quu.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F32l1canukfhnk70s8quu.jpg" alt="Six architectural geometric nodes representing the six core GEO KPIs" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are only 3 core KPIs you can and should really be tracking. All the others: citation rate, share of voice etc. are often just vanity metrics that don’t result in actual conversions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recommendation rate at buyer-intent:&lt;/strong&gt; the conversion-stage signal. This is the synthesis layer, track it per engine and per topic.&lt;/li&gt;
&lt;li&gt;**Competitors mentioned: **How many competitors are mentioned, spread again per engine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sentiment:&lt;/strong&gt; the qualities and tone of how the AI describes you. How does the AI rank you against others? What are your known weaknesses and when does it recommend against your business?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Increasing Recommendation Strength
&lt;/h2&gt;

&lt;p&gt;One you have a solid understanding of who is being recommended, how often that is you, and in what light you are viewed by AI Engines you can start to take action.&lt;/p&gt;

&lt;p&gt;Common areas to focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your business is missing information people want to know&lt;/li&gt;
&lt;li&gt;AI doesn’t know the answer to a client’s question, so they can’t recommend you&lt;/li&gt;
&lt;li&gt;Your data on your website is not specific enough, leading to other vendors who have specific data being recommended&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some tips on how you can increase coverage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Load in all your customer sales inquiries and look at that data to determine if all their questions are also on your website&lt;/li&gt;
&lt;li&gt;Create customer profiles and use these to generate synthetic questions, then answer them on your website&lt;/li&gt;
&lt;li&gt;Run AI Agents with your client persona with the mission of finding a provider/seller and then audit the results of their journey, apply the learnings&lt;/li&gt;
&lt;li&gt;Check your site against schema validation and add elements to your website that are missing from the schema review&lt;/li&gt;
&lt;li&gt;Build a network of listicles and reviews from 3rd party sites that strengthen your brand&lt;/li&gt;
&lt;li&gt;Review all the fanout queries from all research performed by each AI query and then turn those into your SEO targets. These are often long-tail phrases with very low competition and they are the queries AI is using to research and make its determinations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;People are using AI to help them make purchasing decisions, this will only continue to increase both in the number of people using AI for this, as well as how much of their purchasing process they relegate to AI engines. AI is taking over the brain-power people used to use to filter and select their best options.&lt;/p&gt;

&lt;p&gt;This means the process of that decision making is becoming more opaque. Don’t get distracted by vanity KPI theater; where you start measuring how often your stats are quoted by an AI, only to wonder why your sales are down.&lt;/p&gt;

&lt;p&gt;Understanding how AI makes decisions, and then being able to demonstrate to your customers the value you are bringing them is challenging now. But I believe if you focus first and foremost on the money (what actually makes a difference) you can use this as your beacon to navigate through the wires of AI noodle-brains and get the results your customers actually want and need.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How many AI engines should you track for GEO measurement?
&lt;/h3&gt;

&lt;p&gt;Track the engines your buyers actually use, then layer in the engines whose citations propagate to other models. For most B2B audiences in 2026 that means ChatGPT, Perplexity, Claude, Gemini, and Google AI Overviews as the core five, with Copilot and Grok as secondary depending on audience. Tracking fewer than three means you cannot measure per-engine spread, which is the whole point.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a good GEO citation rate to aim for?
&lt;/h3&gt;

&lt;p&gt;Citation rates are only a measure of how often your brand is mentioned, they do not track how often your brand is recommended. If your goal is to get actual recommendations, shift away from trying to get more citations and instead focus on getting recommended.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can you measure GEO without expensive tools?
&lt;/h3&gt;

&lt;p&gt;Yes, especially for a single business. A weekly spreadsheet covering 10 prompts across three engines produces enough signal to learn the shape of your engine-by-engine picture. Most of the work you need to do is foundational, GEO tracking tools are useful to then know how often you are being recommended, per prompt and per customer-category.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does GEO measurement differ from traditional SEO measurement?
&lt;/h3&gt;

&lt;p&gt;Traditional SEO measurement assumes a single engine and a clickable ranking as the outcome. GEO measurement runs on different assumptions. The practical differences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multiple engines, no consensus.&lt;/strong&gt; The engines disagree with each other on which sources to cite, so per-engine reporting becomes non-negotiable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recommendation events replace clicks&lt;/strong&gt; as the conversion signal, because the primary outcome no longer produces a click.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attribution requires explicit channel-grouping&lt;/strong&gt; work in analytics, because AI-referred traffic does not always carry a recognizable referrer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyword ranking still matters&lt;/strong&gt; for fanout queries that AI conversations trigger, but it stops being the headline number. If you want to get recommended, show up in the fan-outs.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>seo</category>
      <category>marketing</category>
      <category>business</category>
    </item>
    <item>
      <title>AI Cost Optimization: A Practitioner Framework</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Mon, 18 May 2026 18:07:06 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/ai-cost-optimization-a-practitioner-framework-2i11</link>
      <guid>https://dev.to/sebastian_chedal/ai-cost-optimization-a-practitioner-framework-2i11</guid>
      <description>&lt;p&gt;An AI system that’s starting to cost real money is a different problem from an AI prototype, whose job was to prove a model could do the thing. The production system’s job is to do the thing at a margin that justifies its existence. Teams usually cross that line without noticing. The bill climbs steadily, then jumps, then someone runs the math and the project is suddenly under cost review.&lt;/p&gt;

&lt;p&gt;This is some of the work we do for clients. We get hired to come, review an AI system that’s working but expensive, find the architectural waste, and bring the spend down without dropping quality. The framework in this article is the approach we actually use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In this article:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why cost optimization is quality optimization in disguise, and how to tell when you’ve crossed into degradation&lt;/li&gt;
&lt;li&gt;The Script-vs-LLM Substitution Rule and the misallocation question&lt;/li&gt;
&lt;li&gt;Dispatcher-First Cost Architecture: the architectural decision that produces the largest savings&lt;/li&gt;
&lt;li&gt;Why agent decomposition lowers cost AND raises accuracy&lt;/li&gt;
&lt;li&gt;The Haiku scratchpad case: getting Sonnet-quality answers at Haiku prices by changing the prompt&lt;/li&gt;
&lt;li&gt;The optimization sequence, ordered by ROI per engineering hour&lt;/li&gt;
&lt;li&gt;The Accuracy-Speed-Cost Triangle: the ceiling you meet after the structural work is done&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If runaway cost is the failure mode you’re worried about, the &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-cost-circuit-breaker/" rel="noopener noreferrer"&gt;AI Agent Cost Circuit Breaker&lt;/a&gt; covers the reactive side. This article is the proactive side: how to design a system that doesn’t run away in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost optimization is quality optimization in disguise
&lt;/h2&gt;

&lt;p&gt;The most common framing of AI cost optimization treats cost and quality as a tradeoff dial: turn the cost down, accept some quality loss, find the spot you can live with. That framing is wrong, and it produces the wrong techniques.&lt;/p&gt;

&lt;p&gt;The goal of cost optimization is to make the process more efficient, more accurate, and often faster. The cost savings emerge from that. When you go deep on cost optimization, you end up doing a careful analysis of the process: what each step actually does, what model tier each step actually needs, which calls shouldn’t be model calls at all. That analysis improves the system on every axis. Lower cost emerges from that work as a consequence of the deeper process analysis.&lt;/p&gt;

&lt;p&gt;Cost optimization that drops quality below tolerance is just the wrong solution. That’s degradation of service. If a “savings” plan ends with the system producing worse outputs, it didn’t optimize. It switched to a different, worse system.&lt;/p&gt;

&lt;p&gt;This lens changes the question you ask of every technique. Instead of “how much cheaper does this make us?” the question is “does this improve the system or does it degrade it?” Techniques that improve the system on multiple axes (accuracy, speed, reliability, cost) are the ones to chase first. Techniques that trade quality for cost belong last, sparingly, and only when the quality drop is genuinely tolerable for the use case. The industry literature corroborates the connection. &lt;a href="https://aisuperior.com/llm-cost-optimization-strategies-2026/" rel="noopener noreferrer"&gt;aisuperior.com&lt;/a&gt; frames systematic optimization as producing both cost reductions and quality improvements together. The same analysis that finds the waste also finds the quality bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Script-vs-LLM Substitution Rule
&lt;/h2&gt;

&lt;p&gt;The largest savings in most AI systems aren’t hiding in model selection. They’re hiding in calls that should never have been LLM calls at all.&lt;/p&gt;

&lt;p&gt;The heuristic is the &lt;strong&gt;Script-vs-LLM Substitution Rule&lt;/strong&gt;: scripts for determinism, LLMs for judgment. If a task has a defined input shape and a defined output shape, and the transformation between them is mechanical, a script does it exactly, in milliseconds, for fractions of a cent. The moment you put an LLM in that spot, you’ve added cost, latency, and a non-zero error rate to a task that didn’t need any of them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4qf6aex1bujo9co7cvqh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4qf6aex1bujo9co7cvqh.jpg" alt="Technical lead looking at dual screens with code in a modern office" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The substitution candidates show up in almost every AI system once you go looking. File-existence checks, status notifications, structured-data comparisons, format conversions, date math, URL canonicalization. Every one of these running on a premium reasoning model is dollar-bleed without quality justification, and the failure modes (hallucinated dates, off-by-one comparisons) are worse than the script equivalents.&lt;/p&gt;

&lt;p&gt;The boundary case matters. When judgment is genuinely required (ambiguous input, context-dependent interpretation, decisions that require reading subtext or weighing trade-offs), the direction reverses. Don’t script what genuinely needs an LLM. Scripts for the deterministic stuff, LLMs for the judgment stuff, and don’t mix them up.&lt;/p&gt;

&lt;p&gt;This is the same insight at the center of our &lt;a href="https://fountaincity.tech/resources/blog/four-axes-ai-agent-efficiency/" rel="noopener noreferrer"&gt;Four Axes of AI Agent Efficiency&lt;/a&gt; framework. The Script-It axis specifically targets entire sessions that shouldn’t have been LLM calls in the first place. In production audits we’ve found this is consistently the largest single cost lever, bigger than model downgrades, prompt compression, or caching.&lt;/p&gt;

&lt;p&gt;The stakes for getting this wrong are non-trivial. &lt;a href="https://fountaincity.tech/resources/blog/four-axes-ai-agent-efficiency/" rel="noopener noreferrer"&gt;Gartner has projected&lt;/a&gt; that over 40% of agentic AI projects will be canceled by 2027 due to escalating costs and unclear value. A large share of that escalation traces back to LLM-everywhere architecture, putting an expensive reasoning model into spots where a five-line script would have served. The substitution rule is the cheapest, fastest fix for a runaway bill. And there’s no trade hiding under it: the script is cheaper, faster, and more accurate than the call it replaces.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dispatcher-First Cost Architecture
&lt;/h2&gt;

&lt;p&gt;The single highest-leverage architectural decision in AI cost optimization is putting a lightweight dispatcher in front of every premium-model call. We call this &lt;strong&gt;Dispatcher-First Cost Architecture&lt;/strong&gt;: every inbound task routes through a gatekeeper (a script or a low-cost model) that decides which downstream agent or model handles it. No speculative engagement of high-cost models.&lt;/p&gt;

&lt;p&gt;The academic backbone is well-established. Stanford’s &lt;a href="https://openreview.net/forum?id=cSimKw5p6R" rel="noopener noreferrer"&gt;FrugalGPT&lt;/a&gt; paper showed that a cascade architecture (try cheaper models first, escalate on failure) can match GPT-4 performance with up to a 98% cost reduction across natural language tasks. The &lt;a href="https://lmsys.org/blog/2024-07-01-routellm/" rel="noopener noreferrer"&gt;RouteLLM&lt;/a&gt; framework from LMSYS reached similar territory on MT Bench, with 85% cost reduction at production-equivalent quality.&lt;/p&gt;

&lt;p&gt;The lesson under the numbers is more useful than the percentages themselves. The majority of queries don’t need the most expensive model. A trained dispatcher classifies task complexity and routes accordingly; the premium model gets engaged only when the cheaper tier fails or the complexity score crosses a threshold.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe2oevkxqgox900nbm41j.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe2oevkxqgox900nbm41j.jpg" alt="Holographic flow network with a central gatekeeper node routing data" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s how this looks in our own content pipeline. We run an autonomous agent stack on Anthropic Claude Opus, Sonnet, and z.ai GLM-5, with daily spend in the $15-20 range. Each pipeline stage is pinned to the model tier the task actually needs: GLM-5 for data gathering, Opus only when synthesis or judgment is required, Sonnet for art direction. The dispatcher isn’t a separate service; it’s the stage definition itself, because we pre-classified each stage during architecture. A config bug that sent all six content stages to Opus tripled the per-article cost before we caught it. Per-stage model pinning is what makes that recoverable.&lt;/p&gt;

&lt;p&gt;Dispatcher architecture earns its complexity when task complexity varies significantly. On a uniform workload, the dispatcher adds latency, code surface, and a place for bugs to hide without giving you a savings lever to pull. The decision rule: if your workload has at least two distinguishable complexity tiers (and most do, once you look), the dispatcher pays for itself. If everything is genuinely a high-end reasoning task, route directly and skip the dispatcher.&lt;/p&gt;

&lt;p&gt;Model pinning at the dispatcher layer is also a governance control. The &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-governance-practitioners-guide/" rel="noopener noreferrer"&gt;governance practitioner’s guide&lt;/a&gt; covers this overlap in more detail. Runtime model selection is one of the controls that protects against unintended escalation, security as well as cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent decomposition lowers cost AND raises accuracy
&lt;/h2&gt;

&lt;p&gt;If one technique deserves to be at the top of the priority list once script substitution is done, it’s agent decomposition. The pattern: take a single task you’re sending to a large model and split it into a sequence of smaller subtasks, each running on a smaller model tier appropriate to that subtask.&lt;/p&gt;

&lt;p&gt;The economics are direct: if one large model is doing a process, that can be very expensive. Break it down into several smaller sub-steps with small models, and each one of those small models might cost a tenth or even a twentieth of the price of the larger model. Multiply that across the steps and the per-task spend drops dramatically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2cn2jeac98ki1kkq328e.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2cn2jeac98ki1kkq328e.jpg" alt="Large glowing block dissolving into several smaller glowing blocks" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The non-obvious second benefit is the one most cost-optimization guides miss. Smaller models on focused subtasks often outperform a single large model on the bundled task. The reasons are mechanical: each subtask has narrower context, narrower failure modes (each step has one job, and you can evaluate it in isolation), and easier debugging. Accuracy goes up because the system is easier to reason about, not because the smaller models are individually smarter.&lt;/p&gt;

&lt;p&gt;Decomposition also frees you to run independent subtasks in parallel where the data flow allows it, which pulls latency down on top of cost. Three things move together: cost down, accuracy up, often speed up too. No trade-off.&lt;/p&gt;

&lt;p&gt;Decomposition has a cost of its own. It adds coordination overhead: state passing between steps, error handling at each boundary, monitoring across the chain. For single-call workflows or short pipelines, the overhead isn’t worth it. The threshold is roughly: if the task has at least three distinct phases that could plausibly run on different model tiers, decomposition pays. For a one-shot answer task with a uniform reasoning load, keep it monolithic.&lt;/p&gt;

&lt;p&gt;Our &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-deployment-operational-decisions/" rel="noopener noreferrer"&gt;deployment operational decisions&lt;/a&gt; article covers the lifecycle questions around when to decompose and when to consolidate. Decomposition is one of the moves you make as a system matures.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Haiku scratchpad case: make cheaper models smarter before escalating
&lt;/h2&gt;

&lt;p&gt;Sometimes you can get the answer quality of a higher tier at the price of a lower tier, not by switching models but by changing the prompt. The technique is to force the cheaper model to reason in writing before it answers. Give it a scratchpad (a file, a structured output field, anywhere it can lay out its thinking step by step) and require it to write reasoning before producing the final answer.&lt;/p&gt;

&lt;p&gt;Here’s a direct case: We ran a large-volume sandbox test on Haiku and another on Sonnet, measuring how often the model produced a failure (wrong decision, wrong recommendation) using a secondary LLM as evaluator against a fixed control criteria. Haiku failed 4% of the time. Sonnet failed 0% of the time. Per-call, Haiku was substantially cheaper, but the error rate made it look like Sonnet was the right choice.&lt;/p&gt;

&lt;p&gt;Then we changed the Haiku instructions: before producing an answer, write your reasoning to a scratchpad file. Only after that, give the answer. We re-ran 250 tests. The Haiku error rate moved from 4% to 0%. The per-run cost rose trivially, a few hundred extra output tokens of reasoning, and Haiku stayed substantially cheaper than Sonnet for the same volume of work. Sonnet-quality answers at Haiku prices.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fakv572vrmxa45gr0sjgg.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fakv572vrmxa45gr0sjgg.jpg" alt="Two professionals looking at analytical data on a shared screen" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The same approach works between Sonnet and Opus on harder tasks. Force the mid-tier model to write reasoning before answering, and the gap to the premium tier closes for some workloads. Not all. Scratchpad-forcing has limits. Some tasks genuinely need Opus-tier reasoning and no prompt design closes that gap.&lt;/p&gt;

&lt;p&gt;Before reaching for a model upgrade on high-volume tasks where the per-call cost delta is large, run the scratchpad test. The cases where it works are the cases where you save the most — and once again, all three axes move the right way: cost down, accuracy up, with a small speed cost from the extra output tokens that’s typically dwarfed by the spend reduction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The optimization sequence
&lt;/h2&gt;

&lt;p&gt;In rough order of priority, here are the optimization levers you should look to start pulling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Script substitution.&lt;/strong&gt; Audit the system for LLM calls that don’t require judgment. Replace them with scripts. Biggest savings, lowest complexity, fastest to ship. Days of work for sustained spend reduction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model pinning by stage.&lt;/strong&gt; If different parts of your system have different complexity requirements, pin each to the right model tier. Don’t run everything on Opus. Moderate complexity, large savings, weeks of work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dispatcher architecture.&lt;/strong&gt; Once stages are pinned, formalize the routing layer. A lightweight dispatcher in front of premium calls multiplies the savings from steps 1 and 2 and prevents future drift back to expensive defaults.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent decomposition.&lt;/strong&gt; Split monolithic tasks into focused subtasks running on appropriate tiers. Hits the cost+accuracy dual benefit, and unlocks parallelism on top. Higher engineering effort but the highest ceiling on savings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scratchpad-forcing on the smaller tier.&lt;/strong&gt; Before escalating to a larger model, force the cheaper one to write reasoning before answering. Often closes the quality gap at a trivial output-token cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context trimming and prompt compression.&lt;/strong&gt; Tools like Microsoft’s &lt;a href="https://www.microsoft.com/en-us/research/blog/llmlingua-innovating-llm-efficiency-with-prompt-compression/" rel="noopener noreferrer"&gt;LLMLingua&lt;/a&gt; compress long prompts by single-digit multiples with minimal semantic loss. Lower-leverage unless your prompts are unusually long, but worth measuring once the architectural moves are done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching layers.&lt;/strong&gt; Prompt caching for repeated context and semantic caching for near-duplicate queries. Pure-cost wins when repeated context is common in your workload; cache hit rate is the predictor of value. You can also create fun hypercubes by caching the output of a multi-dimensional query struct and then cache each answer in higher order geometry and reduce your LLM costs to zero by serving identical outputs from identical inputs where the conditions are identical and skip your AI costs entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch API and subscription balancing.&lt;/strong&gt; Discounts for non-time-sensitive workloads and subscription versus pay-as-you-go decisions. Real but modest savings, lowest engineering effort. Do these last.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The sequence above is what we’ve used across cost-optimization engagements with PrograMate.ai, Unleashed Consulting, Black Gazelle, AI Governance Portland Organization, and the Wiseman Group. In each case, the largest savings came from steps 1-4: substitution, pinning, dispatching, decomposition. The lower-leverage moves closed the remaining fraction of savings but were never where the heavy lifting happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Accuracy-Speed-Cost Triangle: the ceiling, not the starting point
&lt;/h2&gt;

&lt;p&gt;Once the structural moves above are done — calls that shouldn’t have been LLMs replaced with scripts, stages pinned to the right model tier, monolithic tasks decomposed and parallelized where possible, smaller models given scratchpads — you arrive at the &lt;strong&gt;Accuracy-Speed-Cost Triangle&lt;/strong&gt;. This is the end state. Up to this point, the right techniques made the system faster &lt;em&gt;and&lt;/em&gt; cheaper &lt;em&gt;and&lt;/em&gt; more accurate at the same time. From this point on, that stops being true.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-15-J-ai-cost-optimization-practitioner-framework-02-v2.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-15-J-ai-cost-optimization-practitioner-framework-02-v2.svg" alt="The Accuracy-Speed-Cost Triangle: trade-offs at the optimization ceiling (model downgrade, quantized models, context truncation, capping retries on the cost-accuracy edge; batch API on the cost-speed edge)" width="100" height="69.11764705882352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The triangle has three corners — accuracy, speed, cost — and at the ceiling, every additional lever you pull moves two of them in opposite directions. To get cost down further, you have to give up speed, accept some quality drop, or both. Examples of choices that genuinely sit on the triangle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch API for non-time-sensitive work.&lt;/strong&gt; Real cost savings, but the request now takes hours or a day instead of seconds. Trade: cost ↓, speed ↓.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model downgrade beyond what scratchpads can recover.&lt;/strong&gt; When you’ve already tried prompt design and the smaller tier still fails on a measurable share of your workload, taking the downgrade anyway buys cost at the price of accuracy. Trade: cost ↓, accuracy ↓.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantized or distilled in-house models for high-volume routine work.&lt;/strong&gt; Cost falls, output quality narrows on edge cases. Trade: cost ↓, accuracy ↓ at the tails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context truncation past the safe threshold.&lt;/strong&gt; The lossless compression already happened in the structural phase. Pushing further trades quality for incremental savings. Trade: cost ↓, accuracy ↓.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capping retries, fallbacks, or self-correction loops.&lt;/strong&gt; Saves call volume, increases the rate at which the system ships a wrong answer. Trade: cost ↓, accuracy ↓.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All is not lost though once you reach the ceiling, because the ceiling itself moves. New model releases that match a higher tier’s quality at a lower price shift the triangle outward. A model capable enough to consolidate two stages of your decomposition into one moves it again. Provider pricing changes can move it overnight. Ideally you have the time to review your cost structure over time, especially after a major movement in the market.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting it together
&lt;/h2&gt;

&lt;p&gt;Teams that try cost optimization without an organizing framework may run into the following failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reaching for the triangle before the structural moves.&lt;/strong&gt; Treating cost and quality as a tradeoff dial from day one, when most of the savings sit in techniques that improve both at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimizing the wrong layer.&lt;/strong&gt; Caching when the real waste is misallocated LLM calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chasing token price without checking quality.&lt;/strong&gt; Downgrading to a model that produces worse outputs and calling it a win. Or worse, downgrading the model and not testing sufficiently to validate the quality remained the same.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hidden ops costs in self-hosting.&lt;/strong&gt; The math rarely works at small or mid scale once you account for engineering time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dispatcher overhead on uniform workloads.&lt;/strong&gt; Adding routing complexity where there’s no complexity variance to benefit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fir2073dm3ydy3r497rct.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fir2073dm3ydy3r497rct.jpg" alt="Ornate fountain in a sunset-lit plaza with holographic data mist" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you want to model the savings on your own system before changing anything, the &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-roi-calculator/" rel="noopener noreferrer"&gt;AI Agent ROI Calculator&lt;/a&gt; walks through the inputs that determine where your spend actually is. If you’d rather have someone come in and do the audit, that’s what our &lt;a href="https://fountaincity.tech/services/managed-autonomous-ai-agents/" rel="noopener noreferrer"&gt;managed autonomous AI agents&lt;/a&gt; service exists for. Either way, the same framework applies: find the architectural waste first, then the token waste, then the trade-offs at the ceiling, in that order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How much can a typical AI system reduce costs through optimization?
&lt;/h3&gt;

&lt;p&gt;Industry benchmarks land in the 40-70% range for systematic optimization applied to a production system. When optimization compounds with process improvements, when the analysis reveals waste that was hiding in architectural decisions, order-of-magnitude reductions (200-1,000%) are achievable but not typical. Set expectations at 40-70% as the base case.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s the cheapest model that still produces production-quality output?
&lt;/h3&gt;

&lt;p&gt;It depends on the task, and the question is usually asked too early. Before picking a model tier at all, run the structural sequence: replace misallocated LLM calls with scripts, decompose monolithic tasks into smaller-tier subtasks, and try scratchpad-forcing on the smaller tier. After that, the cheapest model that hits your quality bar on a representative sandbox test is the answer — and it’s typically smaller than the one you’d have chosen without the structural pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should I switch from a frontier model to a smaller one?
&lt;/h3&gt;

&lt;p&gt;After a sandbox test shows the smaller model meets your quality bar on a representative workload. Before tier-jumping down, try scratchpad-forcing on the smaller model. Sometimes you get the quality you need at the lower price without the switch.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I decide between an LLM call and a deterministic script?
&lt;/h3&gt;

&lt;p&gt;Apply the Script-vs-LLM Substitution Rule. Scripts for determinism (defined inputs, defined outputs, mechanical transformation). LLMs for judgment (ambiguous input, context-dependent decisions, reasoning about trade-offs). If a task has a single right answer that doesn’t depend on context, it’s a script.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is self-hosting cheaper than paying API fees?
&lt;/h3&gt;

&lt;p&gt;Rarely at small or mid scale. The math looks tempting (GPU hours versus API fees) but the hidden costs (engineering time, MLOps tooling, model updates, downtime, security) dominate the bill in practice. Self-hosting starts paying off at scale levels most production systems don’t reach. At the scale where it does pay off, you usually want a hybrid: hosted for the high-volume routine work, API for spike-load and frontier-capability calls. This could change though over time as performance of self hosted models meet and exceed current higher tier models.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does dispatcher routing actually work?
&lt;/h3&gt;

&lt;p&gt;A lightweight component (often a smaller model or a deterministic classifier) receives every inbound task and decides which downstream agent or model handles it. Stanford’s FrugalGPT cascade is the academic reference: try cheaper models first, escalate on failure or low confidence. RouteLLM trains the router on Chatbot Arena data to classify task complexity and pick the model tier. In production, the dispatcher can be a routing script that maps task type to model tier, or a trained classifier.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s the right balance between subscription pricing and API pay-per-use?
&lt;/h3&gt;

&lt;p&gt;Volume threshold. If your monthly usage consistently exceeds the breakeven point of a subscription tier, lock in. If it’s variable or below the breakeven, stay pay-as-you-go. For systems with mixed workload (steady baseline plus spike load), a hybrid often works: subscription for the baseline, API for the spikes. Re-evaluate quarterly as usage patterns shift.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I optimize cost without sacrificing quality?
&lt;/h3&gt;

&lt;p&gt;Yes — and it’s the default, not the exception, until you reach the triangle. Cost optimization that drops quality below tolerance is degradation of service, not optimization. The techniques that pull cost without dropping quality (substitution of misallocated calls, model pinning by stage, decomposition, scratchpad-forcing, prompt caching) are the ones to start with. Techniques that genuinely trade quality for cost belong at the ceiling, sparingly, and only with measurement.&lt;/p&gt;

&lt;h3&gt;
  
  
  How long does it take to see ROI from AI cost optimization work?
&lt;/h3&gt;

&lt;p&gt;Model pinning: a week or two. Script substitution and dispatcher architecture: weeks to a month, depending on workload complexity. Full sequence including decomposition, caching, and batch processing: a few months for a mature production system. The savings start showing up in the bill immediately after the first deployment, which makes the work easier to justify than most engineering projects.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the most common AI cost optimization mistakes?
&lt;/h3&gt;

&lt;p&gt;Starting at the wrong layer, going after caching and batch APIs before checking for misallocated LLM calls. Chasing token price without measuring quality, so you discover later that you switched to a cheaper model that fails more often. Hidden self-hosting costs that aren’t visible until the engineering time bill arrives. Adding dispatcher complexity on workloads that don’t have the complexity variance to benefit from routing. Every one of these traces back to reaching for tactical levers before doing the structural audit — treating the Accuracy-Speed-Cost Triangle as the diagnostic tool when it’s actually the ceiling.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>business</category>
    </item>
    <item>
      <title>Hermes Agent vs OpenClaw: When to Use Which (and When to Use Both)</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Fri, 15 May 2026 18:07:52 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/hermes-agent-vs-openclaw-when-to-use-which-and-when-to-use-both-6e5</link>
      <guid>https://dev.to/sebastian_chedal/hermes-agent-vs-openclaw-when-to-use-which-and-when-to-use-both-6e5</guid>
      <description>&lt;p&gt;Businesses comparing Hermes Agent and OpenClaw treat it as a winner-loser question. That framing is wrong. They are not competing for the same job. They are different layers of the same stack, and the right architecture for most agentic systems runs both, nested together, with Hermes driving and OpenClaw containing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-13-J-hermes-vs-openclaw-02.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-13-J-hermes-vs-openclaw-02.svg" alt="Architecture diagram: OpenClaw Gateway outer container with blue infrastructure boxes and teal workflow agents on the left, Hermes Agent self-improving loop (Execute, Evaluate, Extract, Refine, Retrieve) on the right, connected via ACP" width="100" height="62.82051282051282"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architectural disagreement
&lt;/h2&gt;

&lt;p&gt;Hermes Agent and OpenClaw share a lot of surface area. Both run on your own devices, connect to messaging channels, schedule cron jobs, store persistent memory, delegate to subagents, and integrate browser and terminal tools. Read the feature lists side by side and you would conclude they are competitors.&lt;/p&gt;

&lt;p&gt;They are not, because they disagree on what the center of an agent system should be. Hermes is built around &lt;a href="https://hermes-agent.nousresearch.com/docs/" rel="noopener noreferrer"&gt;a closed learning loop&lt;/a&gt;: the agent executes a task, evaluates how it went, extracts a skill, refines it during subsequent runs, and retrieves the relevant pieces on future tasks. The agent is the load-bearing element.&lt;/p&gt;

&lt;p&gt;OpenClaw inverts that. The center of OpenClaw is the Gateway, &lt;a href="https://docs.openclaw.ai/gateway/protocol" rel="noopener noreferrer"&gt;the single control plane and node transport&lt;/a&gt; for the whole system. Agents are containers the Gateway routes work to. The framework is the load-bearing element, and agents are interchangeable workers inside it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where OpenClaw wins
&lt;/h2&gt;

&lt;p&gt;OpenClaw is the right call when the system needs strong containment and predictable workflows more than it needs deep reasoning inside any single agent. Three strengths matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Workflow state control.&lt;/strong&gt; The Gateway gives you an explicit, inspectable control plane for routing work between stages. When work fails, you know where it failed and what state it was in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent containerization.&lt;/strong&gt; Each agent is isolated — its own workspace, scoped tools, scoped permissions. One agent cannot accidentally run another agent’s code or read its files.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool and skill scoping.&lt;/strong&gt; You declare which tools each agent can call. A research agent does not get write access to your CRM. A social-media agent does not get shell access to production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The shape of the work matters more than the agent’s IQ. If the job is “run this five-stage pipeline every day, route failures to a human, and never let stage three write to production without approval,” OpenClaw is built for that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Hermes wins
&lt;/h2&gt;

&lt;p&gt;Hermes is the right call when the value of the system depends on what happens inside a single agent’s reasoning, not on the workflow that connects multiple agents.&lt;/p&gt;

&lt;p&gt;The differentiator is the self-reflective execution loop. Hermes does not just run tasks — it captures what worked, packages it as a reusable skill, improves the skill over time, and recalls the right piece of memory at the right moment. Nous Research describes Hermes as &lt;a href="https://hermes-agent.nousresearch.com/docs/" rel="noopener noreferrer"&gt;“the only agent with a built-in learning loop”&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That difference compounds on two task types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-horizon work picked up across sessions.&lt;/strong&gt; Multi-week projects where context drifts and “what did we decide last time?” is the most-asked question. Hermes is built to remember.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher-order reasoning with tight tool chaining.&lt;/strong&gt; Tasks where the agent has to plan, execute a tool, evaluate the result, choose a different tool, and iterate. OpenClaw can do this, but the loop is not first-class. In Hermes, the loop &lt;em&gt;is&lt;/em&gt; the agent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0h9pe9w6ealzgrwztua.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0h9pe9w6ealzgrwztua.jpg" alt="Illustrated visualization of the Hermes Agent self-improving execution loop — a luminous brain-orb surrounded by five glowing feedback arcs representing continuous learning and skill refinement" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is ACP
&lt;/h2&gt;

&lt;p&gt;The Agent Communication Protocol is what makes running both frameworks together a real architectural choice rather than a duct-tape job.&lt;/p&gt;

&lt;p&gt;ACP is a standard for how one piece of software talks to an AI agent. The agent runs in one process. Something else — an editor, a framework, an orchestrator — runs in another. ACP defines the message format between them, so the client can send work, watch progress, see which tools the agent is using, approve sensitive actions, and receive responses. Hermes adopted ACP early and can run as &lt;a href="https://hermes-agent.nousresearch.com/docs/user-guide/features/acp" rel="noopener noreferrer"&gt;an ACP server&lt;/a&gt; any ACP-compatible client can drive.&lt;/p&gt;

&lt;p&gt;That last detail is the unlock. If Hermes can run as an ACP server, anything that speaks ACP — OpenClaw included — can use a Hermes agent as a node inside a larger system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3e5g9lpgw50lli39pq6l.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3e5g9lpgw50lli39pq6l.jpg" alt="Illustrated bridge connecting two distinct AI framework architectures — blue geometric orchestration structure on the left linked to warm amber neural lattice on the right via the Agent Communication Protocol" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The “Hermes drives, OpenClaw contains” pattern
&lt;/h2&gt;

&lt;p&gt;In an agent system where you need both workflow containment and self-reflective reasoning, you nest them.&lt;/p&gt;

&lt;p&gt;OpenClaw is the outer container — control plane, messaging channels, scheduled jobs, multi-agent routing, tool and skill permissions. Inside, most agents are focused workflow executors. For the agents whose value depends on reasoning and learning over time, you run a Hermes agent as a node, exposed over ACP.&lt;/p&gt;

&lt;p&gt;A concrete example: an outbound ABM system. The orchestration — sequencing stages, managing timing between touches, handling bounces, routing hot responses to a human — is OpenClaw’s job. The reasoning inside research and personalization is where Hermes earns its place. For each target account, Hermes builds a living profile: who the real influencers are, what language resonates, which angles have gotten traction. Each interaction feeds back into the profile. Over time, Hermes develops a sharper model of each account.&lt;/p&gt;

&lt;p&gt;Hermes drives the reasoning inside the work. OpenClaw contains it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A two-question decision framework
&lt;/h2&gt;

&lt;p&gt;If you are deciding what to build, separate two questions before any vendor pitches you a framework.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Does the system need workflow containment, or higher-order reasoning inside a single agent?&lt;/strong&gt; Containment means predictable stages, isolated agents, scoped tools, explicit hand-offs. Higher-order reasoning means a single agent that gets smarter at your specific job over time. Different problems, different solutions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do you need both?&lt;/strong&gt; If the answer is “actually, both” — which for most production systems past a certain complexity it is — then a nested architecture is the answer. Hermes inside OpenClaw, communicating over ACP.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a vendor pitches “we just use [single framework] for everything,” ask which of the two needs they are choosing not to meet. There is always a tradeoff, and a vendor who does not know what they are giving up is not the vendor you want building your agent system. &lt;a href="https://fountaincity.tech/resources/blog/top-ai-agent-development-companies/" rel="noopener noreferrer"&gt;Picking a builder&lt;/a&gt; is at least as consequential as picking a framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;Hermes packages a gateway around an agent; OpenClaw packages agents inside a gateway. The difference is which load-bearing element your system needs. For a single specialist agent that learns one domain over time, pick Hermes. For a multi-stage workflow with several agents, different permissions, and broad channel reach, pick OpenClaw. For sophisticated systems that need both — pick a vendor who knows how to nest them. &lt;a href="https://fountaincity.tech/services/agentic-development/" rel="noopener noreferrer"&gt;Agentic development&lt;/a&gt; is the discipline of architecting the whole system: frameworks, agents, tool scopes, deployment, monitoring, and recovery. The framework is the floor. The rest is the discipline.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>architecture</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Anthropic&amp;#8217;s Multi-Agent Blueprint: What Production Constraints Add</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Mon, 11 May 2026 18:06:52 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/anthropic8217s-multi-agent-blueprint-what-production-constraints-add-2fm7</link>
      <guid>https://dev.to/sebastian_chedal/anthropic8217s-multi-agent-blueprint-what-production-constraints-add-2fm7</guid>
      <description>&lt;p&gt;&lt;a href="https://www.anthropic.com/engineering/multi-agent-research-system" rel="noopener noreferrer"&gt;Anthropic’s engineering team published one of the cleanest write-ups available on how a multi-agent system actually works in practice&lt;/a&gt;. The post is about Claude Research, an orchestrator-subagent pattern built for breadth-first research. The architecture is optimized for a particular task class, and the price of admission is a roughly fifteenfold token cost compared to a chat conversation. That cost is the tradeoff the system makes on purpose.&lt;/p&gt;

&lt;p&gt;Most production systems make different tradeoffs. They run under cost ceilings, accuracy SLAs, speed budgets, and error rates that the research context does not impose. The blueprint’s patterns travel — orchestrator delegation, parallel subagents, condensed-return artifacts, end-state evaluation — but the architecture that emerges from applying them under production pressure is rarely the architecture in the post. The choices look the same up close and different at the system level.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-08-J-anthropic-multi-agent-blueprint-production-02-fixed.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-08-J-anthropic-multi-agent-blueprint-production-02-fixed.svg" width="100" height="40.816326530612244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The blueprint is for breadth-first research, and the cost multiplier travels with it
&lt;/h2&gt;

&lt;p&gt;Anthropic’s system is built for a specific kind of work: research where the question is large, the directions are independent, and the answer is worth a lot of tokens. The lead agent plans an approach, spins up subagents to explore in parallel, and reconciles their findings against citations. On Anthropic’s internal evaluation, a multi-agent setup with Claude Opus 4 as lead and Claude Sonnet 4 subagents &lt;a href="https://www.anthropic.com/engineering/multi-agent-research-system" rel="noopener noreferrer"&gt;outperformed single-agent Claude Opus 4 by 90.2%&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The number that matters more: multi-agent systems use about 15x more tokens than chat interactions. The cost multiplier is the price of admission to the architecture. If the task does not decompose into parallel directions, you pay it without earning it.&lt;/p&gt;

&lt;p&gt;Anthropic is direct about the limit: “domains that require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today.” That is the boundary of where the architecture earns its keep. Tasks with tightly-coupled state, sequential dependencies, or shared mutable context will hit coordination overhead faster than they hit parallelism gains.&lt;/p&gt;

&lt;p&gt;The first decision is whether the task is in the right shape for the pattern. If it is a research-style problem with independent directions, parallel subagents are doing real work. If it is a workflow with chained dependencies, a single agent or a deterministic pipeline with smaller agents inside it usually wins on cost and reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token budget, not prompt cleverness, is the dominant performance lever
&lt;/h2&gt;

&lt;p&gt;Anthropic’s variance analysis is the more useful diagnostic. In their BrowseComp evaluation, &lt;a href="https://www.anthropic.com/engineering/multi-agent-research-system" rel="noopener noreferrer"&gt;token usage by itself explained 80% of performance variance&lt;/a&gt;. Tool-call count and model choice were the other two factors. Prompt phrasing, instruction style, and the things teams typically iterate on did not show up as primary drivers.&lt;/p&gt;

&lt;p&gt;The implication is practical. When a single-agent system plateaus on a complex task, the first question is whether it is context-bound, not whether the prompt needs more polish. A polished prompt cannot exceed the model’s working context. A multi-agent system, with separate context windows for each subagent, can. That is the mechanism, more than better instruction-following or any cleverness in the orchestrator.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzttm5kkn1kf99wor2hue.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzttm5kkn1kf99wor2hue.jpg" alt="Abstract isometric data lattice showing concentrated data flow representing token overhead in multi-agent orchestration" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Multi-agent’s main contribution to performance is parallel reasoning across more aggregate context than a single agent can hold. If the task fits inside one agent’s effective working window, the multiplier is rarely worth it. If the task genuinely needs more context than one agent can hold and the directions are independent, parallelism earns the cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Orchestrator delegation is a four-part contract that prevents agentic drift
&lt;/h2&gt;

&lt;p&gt;The orchestrator-subagent split looks simple from a diagram and gets complicated in practice. Anthropic’s contract for each subagent: an objective, an output format, guidance on which tools and sources to use, and clear task boundaries. Miss any of the four and the subagent drifts — not because the model is poorly behaved, but because the orchestrator did not specify enough for it to know what done looks like.&lt;/p&gt;

&lt;p&gt;Effort-scaling is part of that contract. Anthropic’s prompts embed concrete rules: &lt;a href="https://www.anthropic.com/engineering/multi-agent-research-system" rel="noopener noreferrer"&gt;1 agent for simple fact-finding, 2 to 4 subagents for direct comparisons, and more than 10 subagents for complex research&lt;/a&gt;. Without rules like these, the lead agent over-scales — spinning up subagents for problems a single call could answer — and the cost multiplier compounds against you.&lt;/p&gt;

&lt;p&gt;Tool ergonomics is the other load-bearing piece. The contract is only as good as the tool surface it points to. Anthropic ran a tool-testing agent that exercised flawed MCP tool descriptions, identified the failure patterns, and rewrote the descriptions; future agents using the rewritten tools cut task completion time by 40%. The orchestrator’s instructions assume the tools they describe behave the way the descriptions claim. When tool descriptions are vague or misleading, every downstream agent pays the tax.&lt;/p&gt;

&lt;p&gt;Order of operations: get the four-part contract right, embed effort-scaling rules in the orchestrator prompt, then audit your tool descriptions before iterating on anything else. The contract and the tools are upstream of every other lever.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context handling is external-memory-first, not bigger-context-first
&lt;/h2&gt;

&lt;p&gt;The instinct on context limits is usually to ask for a larger window. Anthropic’s architecture does the opposite. The lead researcher saves its plan to memory before context fills, because &lt;a href="https://www.anthropic.com/engineering/multi-agent-research-system" rel="noopener noreferrer"&gt;past 200,000 tokens the context window can be truncated&lt;/a&gt; and the plan needs to survive. The architectural choice is to externalize early, not to chase larger windows.&lt;/p&gt;

&lt;p&gt;The artifact pattern earns its place here. Instead of subagents reporting findings back through chat-style returns — long, lossy, expensive on lead-agent tokens — they write to a shared filesystem and return a lightweight reference. The lead agent does not re-read every detail; it gets a pointer and pulls what it needs. The pattern is not unique to Anthropic, but their post implies it through the memory system; practitioners across the industry have been naming it the artifact pattern because it solves a specific failure mode: the game of telephone, where information loses fidelity each time it passes from subagent to lead.&lt;/p&gt;

&lt;p&gt;Fresh-context resets between sub-tasks are a deliberate design choice. If state lives outside the agents, the agents do not need to carry it in their context windows. “Bigger context” also stops being the answer to most context problems; the right move when an agent struggles with a long task is usually to externalize state and reset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fea2npu2zg5e5xjqv47l1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fea2npu2zg5e5xjqv47l1.jpg" alt="Two developers in a modern office looking at architecture diagrams, reflecting on multi-agent ai system design" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation grades outcomes, not the path the agent took
&lt;/h2&gt;

&lt;p&gt;Evaluation is where multi-agent systems get the strangest. The path the agent takes through a complex task is rarely the path you would have prescribed in advance. Anthropic’s guidance: “judge whether agents achieved the right outcomes while also following a reasonable process.” Outcomes are graded; paths are observed but not required to match a template.&lt;/p&gt;

&lt;p&gt;The mechanism most teams reach for is LLM-as-judge with a structured rubric — factual accuracy, citation accuracy, tool efficiency — producing a 0.0 to 1.0 score per output. The score does not substitute for human review; it scales review across thousands of runs without reading every trace by hand.&lt;/p&gt;

&lt;p&gt;For state-mutating agents, end-state evaluation is the cleaner framing. Ignore the path entirely. Compare the final environment state to the goal state. Did the document get written, the ticket get closed, the file get moved? If yes, the agent succeeded — even if the trace looks meandering. Letting the agent iterate over its own process tends to produce better runs than prescribing the process up front, because the right path is often not knowable in advance.&lt;/p&gt;

&lt;p&gt;Scoring is necessary but not sufficient. Production agents need traces, audit trails, and the ability to investigate a failure that scored well on the rubric but cost too much or used the wrong tool. The governance layer for production agents sits underneath evaluation, supplying the visibility scoring alone cannot provide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production constraints reshape the decisions the blueprint leaves to defaults
&lt;/h2&gt;

&lt;p&gt;The blueprint and production part company here. Anthropic’s research context has no fixed daily cost ceiling, no hard accuracy SLA, no sub-second response budget, no error-rate threshold tied to revenue. Most production systems have at least one, often all four. The architecture decisions a team makes under those pressures are not the decisions the blueprint defaults to.&lt;/p&gt;

&lt;p&gt;A few of the gaps the blueprint leaves to the reader:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-running state across sessions.&lt;/strong&gt; The Claude Research system is session-bounded. A research run starts and finishes. Production agents often need to operate across days or weeks: a content pipeline that watches for new briefs, an operations agent that monitors a system continuously, an integration agent that processes events as they arrive. State across sessions is a different problem than state within one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure cascades when a subagent fails mid-orchestration.&lt;/strong&gt; The blueprint describes the happy path. Production has to handle a subagent that times out, returns malformed output, hits a rate limit, or fails its tool call. The lead agent needs to know whether to retry, fail over, partial-result, or abort the whole run, and that logic is not in the blueprint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model pinning.&lt;/strong&gt; Anthropic uses one model family throughout. Production teams often need a specific model version pinned for a specific job — partly for accuracy stability across runs, partly for cost control, partly because behavior changes between model versions can break workflows that depended on the old behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runaway-spend protection.&lt;/strong&gt; The 15x cost multiplier compounds quickly when something misbehaves. A subagent that recursively spawns or a tool that returns oversized results can burn through a daily budget in minutes. The blueprint does not address circuit breakers, budget caps, or per-run cost ceilings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateful resumption.&lt;/strong&gt; When a long-running agent fails, restarting from scratch is wasteful. Checkpointing so the agent can resume from its last decision point, not its first, changes the cost economics of long jobs significantly. The blueprint mentions resumption in passing but does not treat it as a first-class architectural concern.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One example of how production pressures push toward different choices: in a content pipeline that runs autonomous agents end-to-end, fixed downstream crons were replaced with &lt;a href="https://fountaincity.tech/resources/blog/completion-triggered-orchestration-ai-pipeline/" rel="noopener noreferrer"&gt;completion-triggered orchestration&lt;/a&gt; so that downstream stages fire the moment the previous stage finishes, instead of waiting for a scheduled tick. That is not a choice the blueprint suggests, because the blueprint is not session-spanning; production constraints make it obvious. Different pressures, different decisions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3t4nfml6y3vwiws81iik.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3t4nfml6y3vwiws81iik.jpg" alt="Technical art showing a central processing node protected by thick amber hexagonal shields, symbolizing runaway-spend protection in ai agent architecture" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The general pattern across these gaps: the blueprint optimizes for a single bounded run with a research outcome as the deliverable, while production systems usually optimize for repeated runs with reliability, predictable cost, and operational containment as the deliverables. Those are not opposing goals, but they push the architecture toward different shapes. A research system can afford to retry an entire run when something goes wrong; a production system that does that on every failure burns its budget and its SLA. A research system can afford to use the strongest available model throughout; a production system often pins a smaller model for the subagent tier because the cost difference compounds across thousands of calls per week.&lt;/p&gt;

&lt;p&gt;Read the blueprint as a high-quality reference architecture for the task class it targets. Treat the patterns as primitives (orchestrator delegation, parallel subagents, condensed-return artifacts, end-state evaluation) and let the production constraints you are actually operating under decide how those primitives compose. The architecture lives in the composition, with each pattern earning its place in context.&lt;/p&gt;

&lt;h2&gt;
  
  
  When not to go multi-agent, and the question that comes first
&lt;/h2&gt;

&lt;p&gt;Before “should I use a multi-agent architecture?” comes a different question: what job am I trying to remove from human supervision?&lt;/p&gt;

&lt;p&gt;Multi-agent systems earn their keep when they reduce work; they fail when they multiply things to manage. A team running a single agent that already does its job well does not need a multi-agent architecture; it needs a clearer success metric and maybe a better tool surface. A team that has identified a research-shaped problem with independent directions and budget headroom for the cost multiplier is in the right place for the pattern.&lt;/p&gt;

&lt;p&gt;A few heuristics for when single-agent or deterministic-workflow architectures are usually the right call:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tightly-coupled context.&lt;/strong&gt; If every agent needs the same shared state and changes propagate across the system, the coordination cost will exceed the parallelism gain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential dependencies.&lt;/strong&gt; If step B requires step A’s output and step C requires step B’s output, you have a pipeline, not a parallel workload. A pipeline of small agents is usually simpler and cheaper than an orchestrator-subagent decomposition for the same work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic workflow surface.&lt;/strong&gt; If the steps are knowable in advance and the failure modes are predictable, a deterministic workflow with self-improvement scoped to skill optimization will be more reliable than a general-purpose agent picking between dozens of tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Insufficient budget for the cost multiplier.&lt;/strong&gt; If the daily or per-run budget cannot absorb the token overhead, the architecture is the wrong tool for the budget.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For mid-market teams, complexity is its own failure mode. Every additional agent is another component to manage, debug, monitor, and pay for. Lower-order simple agents nested inside larger loops often produce better outcomes than a general-purpose multi-agent system trying to do everything. The mistake to avoid is adding agents because the architecture diagram looks impressive; the goal is to remove jobs from human supervision, never to create more agents for a human to supervise.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtnsb7gwtal3i3249n8e.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtnsb7gwtal3i3249n8e.jpg" alt="Lattice wireframe fountain structure with emerald and amber nodes cascading downward, representing fluid data pathways" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sharper than “single or multi”: if I did not need to supervise this work, and the agent did it as well as or better than a person doing it today, what would that unlock? When the answer is concrete — a person freed up for higher-value work, a process that runs overnight, a backlog that clears without intervention — the architecture that earns its keep is the one that delivers that outcome with the fewest moving parts. The shape of the answer often points at &lt;a href="https://fountaincity.tech/resources/blog/level-5-ai-maturity-goal-directed-autonomous-agents/" rel="noopener noreferrer"&gt;where you are on the autonomy spectrum&lt;/a&gt; and what the next step is.&lt;/p&gt;

&lt;p&gt;Anthropic’s blueprint documents one such point well. For any team adopting it, the work is to know which pressure the system is being built under, and to let that pressure shape the architecture that emerges. Same patterns, different production constraints, different decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Anthropic’s multi-agent research system?
&lt;/h3&gt;

&lt;p&gt;Anthropic’s multi-agent research system, used in their Claude Research product, is an orchestrator-subagent architecture for breadth-first research. A lead agent plans the research approach and saves its plan to memory; it then spins up parallel subagents to explore independent directions, each with its own context window and tool access. Subagents return condensed findings, often via a shared memory store rather than long chat-style returns, and the lead agent reconciles them into a final answer with citations. On Anthropic’s internal evaluation, this setup outperformed a single Claude Opus 4 agent by 90.2% on their research eval.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the orchestrator-subagent (orchestrator-worker) pattern?
&lt;/h3&gt;

&lt;p&gt;The orchestrator-subagent pattern, sometimes called orchestrator-worker, is a multi-agent design where one agent decomposes a task and delegates pieces of it to other agents. The orchestrator does not do the work itself; it plans, dispatches, and integrates results. Each subagent receives an objective, an output format, guidance on which tools and sources to use, and clear task boundaries. The pattern fits tasks that decompose naturally into independent directions and where parallel exploration is faster than sequential execution. It does not fit tasks with tightly-coupled context or heavy dependencies between subagents.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should I use a multi-agent architecture vs. a single agent?
&lt;/h3&gt;

&lt;p&gt;Use multi-agent when the task is breadth-first, the directions are independent, the aggregate context exceeds what a single agent can hold, and the budget can absorb the cost multiplier. Use single-agent when the task fits inside one context window, when steps are sequential, when the workflow is deterministic enough to specify, or when the budget is tight. The blueprint itself flags shared-context and high-dependency domains as poor fits for multi-agent. Most production tasks land closer to single-agent or deterministic-pipeline shapes than to research-style multi-agent shapes.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does Anthropic’s multi-agent system handle context limits?
&lt;/h3&gt;

&lt;p&gt;Anthropic’s system handles context limits by externalizing state to memory rather than chasing larger context windows. The lead researcher saves its plan to memory before context fills, because the context window can be truncated past a certain length. Subagents write findings to a shared filesystem and return lightweight references — the artifact pattern — so the lead agent does not re-read every detail through chat-style returns. Fresh-context resets between sub-tasks are part of the same strategy: state lives outside the agents, so agents can reset without losing it.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much more expensive is a multi-agent system than a single agent?
&lt;/h3&gt;

&lt;p&gt;Anthropic reports that multi-agent systems use roughly 15x more tokens than a chat conversation on the same surface task. The multiplier is the cost of running parallel subagents with their own context windows and tool calls. If the task is breadth-first and decomposes into independent directions, the multiplier buys parallelism that exceeds a single context window. If the task does not decompose, you pay the multiplier without earning it. Production teams often add cost circuit breakers and per-run budget caps because the multiplier compounds quickly when something misbehaves.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does Anthropic’s blueprint not cover about production agent systems?
&lt;/h3&gt;

&lt;p&gt;The blueprint focuses on session-bounded research and leaves several production concerns to the reader: long-running state across days or weeks, failure cascades when a subagent fails mid-orchestration, multi-model pinning for accuracy stability and cost control, runaway-spend protection through circuit breakers and budget caps, and stateful resumption from a checkpoint instead of a full restart. These are not flaws in the blueprint; they are concerns that emerge when the same patterns are applied under production constraints — cost ceilings, accuracy SLAs, speed budgets, error rates — that the research context does not impose.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Building autonomous agent systems under production constraints is the work we do every day. If you’re evaluating multi-agent architecture for a real job and want a practitioner’s view on where the patterns earn their keep, our &lt;a href="https://fountaincity.tech/services/managed-autonomous-ai-agents/" rel="noopener noreferrer"&gt;managed autonomous AI agents&lt;/a&gt; service is the closest place to start.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>AI Agent Deployment: The Operational Decision at Each Stage</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Fri, 08 May 2026 18:07:21 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/ai-agent-deployment-the-operational-decision-at-each-stage-5cn1</link>
      <guid>https://dev.to/sebastian_chedal/ai-agent-deployment-the-operational-decision-at-each-stage-5cn1</guid>
      <description>&lt;p&gt;Most teams running an AI agent pilot are being asked the same question right now: what do we build next? The published guidance is a stack of vendor maturity models that name the stages without naming the decisions inside them, and the team ends up debating models, prompts, and platforms while the pilot stalls.&lt;/p&gt;

&lt;p&gt;A March 2026 &lt;a href="https://www.digitalapplied.com/blog/ai-agent-scaling-gap-march-2026-pilot-to-production" rel="noopener noreferrer"&gt;Digital Applied&lt;/a&gt; survey found that 78% of surveyed enterprises had at least one agent pilot running and only 14% had scaled an agent to production-grade, organization-wide operation.&lt;/p&gt;

&lt;p&gt;The same dataset surfaced something that reframes the problem: organizations with production-scale deployments did not have larger AI budgets than the organizations whose pilots stalled. They allocated the budget differently. Less on model selection and prompt engineering, more on evaluation infrastructure, monitoring tooling, and operational staffing. The teams that crossed into production reallocated. They did not outspend.&lt;/p&gt;

&lt;p&gt;That finding changes what the deployment stages are for. Each stage has one operational decision that either reinforces the misallocation or breaks it. Get the decision right and the next stage gets cheaper. Get it wrong and you spend the next quarter rediscovering the same problems at higher volume.&lt;/p&gt;

&lt;p&gt;This article walks the four operational decisions: workflow scope at pilot, monitoring placement at single-agent production, shared-state ownership at multi-agent coordination, and completion triggers at autonomous orchestration. It also covers the shape of governance cost across the stages, when to stay one stage longer, and the mechanism we run at each stage in our own production pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-06-J-ai-agent-deployment-operational-decisions-02.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F05%2F2026-05-06-J-ai-agent-deployment-operational-decisions-02.svg" alt="Four AI agent deployment stages diagram — Pilot, Single Agent, Multi-Agent, and Orchestration with operational decisions and governance layers" width="100" height="67.3076923076923"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The deployment problem is mostly an allocation problem
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8q7mtm6glap2k9emlpdu.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8q7mtm6glap2k9emlpdu.jpg" alt="Business professional reviewing data analytics dashboard showing budget allocation metrics in a modern office environment" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Digital Applied survey is the first dataset we have seen that quantifies what production-scale AI agent teams did differently. It is not what most vendor decks would predict. The teams that made it across had comparable AI budgets to the teams that stalled. The difference was where the dollars went.&lt;/p&gt;

&lt;p&gt;The blocking factors stalled organizations cited are mostly operational, not modeling. Output quality at volume, monitoring and observability, and organizational ownership are all the work that happens after a model is chosen, after a prompt is tuned, after the demo is approved. The single most-cited operational gap was monitoring and observability, named by 54% of stalled organizations as a blocking factor. That figure shows up again in the Dynatrace work cited later, and it is the one to anchor on: more than half of stalled deployments cannot see what their agents are doing.&lt;/p&gt;

&lt;p&gt;The misallocation pattern is recognizable. A team finishes a successful pilot. The next quarter’s budget conversation centers on which model to upgrade to, which prompt strategy to standardize on, which platform to consolidate on. The evaluation harness, the monitoring layer, and the operational headcount are deferred to “after we get the architecture right.” By the time the architecture is settled, the budget for the deferred work is gone, and the agents are running in production without the operational scaffolding they need to scale.&lt;/p&gt;

&lt;p&gt;Each of the four deployment stages has one decision that breaks this pattern. Each decision puts a load-bearing piece of operational scaffolding in place before the misallocation can compound. The decisions are not abstract. We have made each of them in our own production agent pipeline, watched the failure modes when we got each one wrong, and rebuilt accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pilot stage: the decision is workflow scope
&lt;/h2&gt;

&lt;p&gt;Most pilots are scoped for demo appeal. Someone picks a workflow that will produce a compelling video, the team ships an agent that handles the happy path, and the pilot is declared a success. Then production handoff begins, and integration complexity, the most-cited scaling gap in the Digital Applied data, surfaces all at once. The pilot was never scoped to the messy edges of the workflow it claimed to automate.&lt;/p&gt;

&lt;p&gt;The pilot decision is workflow scope. Scope governs every downstream cost. Pick a workflow with a clean input boundary, a measurable success metric, and a defined incident response, and the next three stages inherit a workable foundation. Pick a workflow that looks good in a slide deck, and you are paying for that scope decision for a year.&lt;/p&gt;

&lt;p&gt;The mechanism is to define exit criteria at pilot start, not at production handoff. Three concrete criteria, written down before the agent runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task volume threshold.&lt;/strong&gt; What rate of work does the agent need to handle to be worth running in production? If the answer is “we will figure it out,” the pilot is not scoped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality measurement.&lt;/strong&gt; What does a wrong answer look like, and how is it caught? The answer cannot be “the user will tell us.” Production agents cost money per run; you need a quality signal that does not depend on a human checking every output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident response.&lt;/strong&gt; When the agent fails, what happens? Who gets paged? What runs in its place? “We will roll back” is not a plan if the agent is the only thing producing the work.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the pilot cannot answer those three questions, the next stage is going to be operational firefighting. Worth pairing this stage with an honest &lt;a href="https://fountaincity.tech/resources/blog/ai-readiness-evaluation/" rel="noopener noreferrer"&gt;AI readiness evaluation&lt;/a&gt; across data, governance, and culture before you commit to scaling the agent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1z5z5buonq86zvo5wmct.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1z5z5buonq86zvo5wmct.jpg" alt="Single white AI robot at a workstation — representing a solo AI agent in a pilot deployment" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Single-agent production: the decision is monitoring placement
&lt;/h2&gt;

&lt;p&gt;The pilot’s quality gate was a human in the loop. Production needs a different gate, and “we will add observability later” is the dominant failure pattern at this stage. A separate &lt;a href="https://www.dynatrace.com/news/blog/agentic-ai-report-new-observability-strategy/" rel="noopener noreferrer"&gt;Dynatrace survey&lt;/a&gt; reports that a substantial share of leaders still rely on manual methods to monitor agent interactions — not an artifact of small deployments, but the operating reality of organizations that already have agents in production.&lt;/p&gt;

&lt;p&gt;The single-agent production decision is monitoring placement. It has to be set before the agent goes live, not bolted on after the first incident. Three layers belong in place at deploy time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traces.&lt;/strong&gt; Every agent run produces a structured trace: inputs, tool calls, outputs, duration, cost. Without traces, you cannot diagnose a failure that did not raise an exception.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation harness.&lt;/strong&gt; A reference set of inputs and expected behaviors that runs before any change to the prompt, the model, or the tooling. Without an eval harness, every change is a guess.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost circuit breaker.&lt;/strong&gt; A spending threshold that alerts at one level and halts the agent at another. Agents fail in directions that traditional monitoring does not catch. They keep running, just badly and expensively. Our own production pipeline holds to a predictable daily AI infrastructure baseline only because the cost-defense layers were built before the agents were turned on, not after the first runaway.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The order matters. Traces are the diagnostic substrate. The evaluation harness sits on top, using traces to score behavior. The cost circuit breaker is the last-resort guard for the failure modes that the evaluation harness does not catch in time. Build them in that order, and the next stage, multi-agent coordination, has the diagnostic data it needs. Skip the order, and the next stage is debugged from log files. The per-layer architecture is in the &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-cost-circuit-breaker/" rel="noopener noreferrer"&gt;cost circuit breaker article&lt;/a&gt;. It is the single piece of single-agent infrastructure we would not deploy without.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1tv62e1io5oacf5phv4f.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1tv62e1io5oacf5phv4f.jpg" alt="Business professional monitoring AI agent system performance at a multi-screen workstation with observability dashboards" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-agent coordination: the decision is shared-state ownership
&lt;/h2&gt;

&lt;p&gt;Multi-agent failures look different from single-agent failures. They are not crashes. They are agents stepping on each other’s work, losing track of items in flight, and producing results that contradict each other because each agent inferred the state of the system from a different source. The loss is operational drift rather than catastrophic failure, which is harder to detect.&lt;/p&gt;

&lt;p&gt;The multi-agent decision is shared-state ownership. Most of these failures trace to a single cause: agents are assumed to be isolated when they are context-coupled. They touch the same work, but no one named the canonical source of truth.&lt;/p&gt;

&lt;p&gt;The mechanism is to name one explicit state owner for each piece of shared context, and require every agent to read and write through it. A file, a table, a queue, a database row: the form does not matter. What matters is that there is one place where the system’s state lives, and no agent infers state from another agent’s output.&lt;/p&gt;

&lt;p&gt;In our own pipeline, the canonical state lives in two structured files: one tracks the production status of every content item, and the other tracks topic-level metadata across the inventory. Every agent in the pipeline reads from those files at the start of its work and writes to them at the end. No agent guesses where the work is by reading another agent’s draft. That single architectural decision, a named state owner, eliminated an entire class of failure that had been showing up as “missing items” and “duplicate work” before we made it. The broader pipeline architecture is documented &lt;a href="https://fountaincity.tech/resources/blog/inside-autonomous-ai-content-pipeline/" rel="noopener noreferrer"&gt;in detail&lt;/a&gt;, but the load-bearing decision at this stage is the state-ownership one, not the pipeline shape.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7obph18k2zkypvtiddms.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7obph18k2zkypvtiddms.jpg" alt="Two white AI robots at adjacent workstations coordinating tasks — representing multi-agent AI deployment" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The reason this works: shared state is the point at which multi-agent systems either become a coordinated team or a set of agents producing parallel inconsistent outputs. The investment goes into one well-designed shared structure, not into many ad-hoc handoffs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Autonomous orchestration: replace fixed schedules with completion triggers
&lt;/h2&gt;

&lt;p&gt;By the time a system has multiple agents in production, the orchestration layer becomes the bottleneck. Variable-duration AI work breaks fixed-schedule orchestration. The symptom is items waiting between stages: a research stage finishes at 11:14am, but the writing stage runs at noon, so the item sits for 46 minutes for no operational reason. Multiply that across a dozen stages and the lag compounds.&lt;/p&gt;

&lt;p&gt;The autonomous orchestration decision is to move from fixed schedules to completion triggers. Only the entry point of the pipeline runs on a clock. Every downstream stage fires when the previous stage signals completion. The plumbing is straightforward: a stage finishes, writes its output, and calls the next stage.&lt;/p&gt;

&lt;p&gt;The numbers are concrete. Under our previous fixed-schedule design, a piece of work that could move through the pipeline in two to three hours was taking six to twelve. After replacing the fixed crons with completion triggers, the two-to-three-hour window held. The full design and the failure modes that drove it are in the &lt;a href="https://fountaincity.tech/resources/blog/completion-triggered-orchestration-ai-pipeline/" rel="noopener noreferrer"&gt;completion-triggered orchestration piece&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One caveat that matters more than the orchestration win itself: completion triggers compound failures faster than fixed schedules do. A bug in stage three under fixed scheduling waits until tomorrow’s run to surface. A bug under completion triggering fires the next stage immediately, which fires the next, which can produce a cascade of bad outputs in minutes. So this stage’s decision has a dependent decision attached: pair completion triggers with anti-loop guards, retry caps, and the cost circuit breaker from the single-agent stage. The orchestration speed-up is real. So is the failure speed-up. Both have to be designed for at the same time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost of governance is per-stage, and the curve is steeper than vendors imply
&lt;/h2&gt;

&lt;p&gt;Governance dollars do not scale linearly across the four stages. They scale by what the stage requires you to monitor. A single-agent production system needs evaluation and alerting. A multi-agent system adds shared-state audit and per-agent identity. An autonomous orchestration system adds completion-trigger guards, recovery infrastructure, and an anti-loop layer.&lt;/p&gt;

&lt;p&gt;The shape matters more than the dollar figure. Our own ranges are useful as a reference example, with the caveat that the reader’s numbers will differ based on agent count, workload, and model mix: across nine production agents and sixty-two scheduled jobs at the autonomous-orchestration stage, our daily AI infrastructure cost runs roughly $15-20. That is operational AI infrastructure cost. It is not the full cost of running the system. The curve shape matters more than the dollar figure.&lt;/p&gt;

&lt;p&gt;What the curve looks like, by stage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single-agent production.&lt;/strong&gt; Evaluation harness, alerting, traces, cost circuit breaker. The cost is mostly tooling and the operational time to maintain reference sets and tune thresholds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent coordination.&lt;/strong&gt; Add shared-state audit and per-agent identity. The identity-visibility gap that surveys keep surfacing is theoretical until the multi-agent stage; once two agents share work, it becomes operational.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous orchestration.&lt;/strong&gt; Add completion-trigger guards, recovery crons, and per-stage cost limits. This is where agents can do the most damage in the shortest time, and the governance investment reflects that.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The allocation thesis applies again here. Governance dollars belong in evaluation, monitoring, and identity. They do not belong in picking a different model. The per-control breakdown is in the &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-governance-practitioners-guide/" rel="noopener noreferrer"&gt;agent governance practitioners guide&lt;/a&gt;, mapped to the production stages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Most teams should stay one stage longer than the vendor pitch implies
&lt;/h2&gt;

&lt;p&gt;Vendors are selling autonomy. Most organizations are mid-curve and are being pushed forward before the decisions at their current stage are settled. The published survey data on enterprise-wide mature adoption is consistently a small minority of the field; the much larger group is the one that has shipped some agents but has not finished the operational scaffolding around them.&lt;/p&gt;

&lt;p&gt;Staying longer at a stage is not stalling. It is finishing the operational decision at the current stage before adding the next layer of failure modes. A team that has not settled monitoring placement at single-agent production will find the multi-agent stage harder, not easier. A team that has not named shared-state ownership in multi-agent will find autonomous orchestration produces faster cascades, not faster work.&lt;/p&gt;

&lt;p&gt;The question worth asking at the end of a quarter is not “are we ready for the next stage?” It is “have we settled the operational decision at the current stage?” If the answer is no, the next stage is going to be debugged on top of an unsettled one, and the cost of that compound failure shows up later as the kind of stall that the survey data is measuring.&lt;/p&gt;

&lt;p&gt;This is also where the conceptual maturity layer lives. &lt;a href="https://fountaincity.tech/resources/blog/level-5-ai-maturity-goal-directed-autonomous-agents/" rel="noopener noreferrer"&gt;The five levels of AI maturity&lt;/a&gt; name what each level looks like. The four operational decisions in this article name what to build at each level so the next one becomes possible. The two layers are companions, not duplicates. The decisions in this article are the work that has to happen for an organization to actually move up the maturity curve, rather than describing where they currently sit on it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxhg9sb07u0gx9eepgzbn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxhg9sb07u0gx9eepgzbn.jpg" alt="AI robot in a vast server room corridor representing autonomous orchestration — AI agent deployment at production scale" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to go from here
&lt;/h2&gt;

&lt;p&gt;If you have a working pilot, the next operational decision is not which model to upgrade to. It is which workflow to harden, where to place monitoring before the agent goes live, who owns shared state when two agents touch the same work, and how to replace fixed schedules with completion triggers when orchestration starts to drag. Those four decisions, made deliberately, are what the production-scale teams in the Digital Applied survey did with their reallocated budgets.&lt;/p&gt;

&lt;p&gt;If you want a partner who has already made each decision in a running production system and can build the infrastructure for your team, our &lt;a href="https://fountaincity.tech/services/managed-autonomous-ai-agents/" rel="noopener noreferrer"&gt;managed autonomous AI agents&lt;/a&gt; service runs the full operational stack: evaluation, monitoring, shared-state, orchestration, and governance, at a published price. The decisions are the same whether we run them or you do. The article above is the framework. The service is the implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do I know when my AI agent pilot is ready to move to production?
&lt;/h3&gt;

&lt;p&gt;The pilot is ready when three exit criteria are met: the agent reliably handles a defined task volume, there is a quality measurement that does not depend on a human reviewing every output, and there is a defined incident response when the agent fails. If any of those is missing, production handoff will surface the gap as an integration failure rather than a pilot finding. Production-scale teams in the Digital Applied data wrote those criteria at pilot start, not at handoff.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s the operational difference between single-agent and multi-agent deployment?
&lt;/h3&gt;

&lt;p&gt;A single agent fails in directions that traditional monitoring catches: error rates, latency, output quality. Multi-agent systems fail through coordination drift. Agents lose track of each other’s work, step on each other, or produce inconsistent outputs because each inferred the state of the system differently. The operational shift is from instrumenting the agent to instrumenting the shared state the agents read and write through. If you cannot point to one canonical state owner that every agent uses, you are running multiple agents, not a multi-agent system.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does AI agent governance actually cost at each stage?
&lt;/h3&gt;

&lt;p&gt;The shape is more useful than the figure. At single-agent production, governance is tooling and operational time for evaluation and alerting. At multi-agent it adds shared-state audit and per-agent identity — closing the visibility and containment gap that &lt;a href="https://cloudsecurityalliance.org/" rel="noopener noreferrer"&gt;Cloud Security Alliance research&lt;/a&gt; has documented across organizations running agents. At autonomous orchestration it adds completion-trigger guards and recovery infrastructure. The curve, with costs concentrated in evaluation, monitoring, and identity rather than in model and prompt, is the part that generalizes across teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I scale AI agents without ballooning ongoing costs?
&lt;/h3&gt;

&lt;p&gt;Build the cost defense before the agents go live, not after the first runaway. Daily and per-job spending limits, alerting thresholds set lower than halt thresholds, and an evaluation harness that catches behavioral drift before it shows up as a budget overrun. Cloud Security Alliance research found that 92% of organizations lack full visibility into AI agent identities, and most doubt they could detect or contain a compromised agent — that visibility deficit is what makes runaway costs expensive to catch later. Build identity, audit, and cost-defense into the deploy step. Our daily AI infrastructure cost has stayed in a predictable range as we have added agents and jobs because the limits were in place before the volume was.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should I add a recovery or anti-loop layer to my agent system?
&lt;/h3&gt;

&lt;p&gt;At the autonomous orchestration stage, before the first completion-triggered run. Completion triggers move work faster, and they also propagate failures faster. A recovery layer of retry caps, anti-loop guards, and cost ceilings tied to the per-stage budget is the dependent decision that has to ship with completion triggering, not after it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why do most AI agent pilots never reach production?
&lt;/h3&gt;

&lt;p&gt;The Digital Applied survey found that pilots stall within months on average. The blocking factors named (integration complexity, output quality at volume, monitoring deficit, organizational ownership, domain training data) are consistent with pilots scoped for demo appeal rather than for a workflow with measurable success criteria, scaled into production without monitoring placement decided, and operated without a clear shared-state owner. Each of those is the absence of a decision at the corresponding stage. The cumulative result is the pre-production failure rate that maturity-model coverage keeps surfacing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>business</category>
    </item>
    <item>
      <title>Agent Memory &amp;#038; Knowledge Systems Compared (2026 Guide)</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Mon, 04 May 2026 18:07:06 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/agent-memory-038-knowledge-systems-compared-2026-guide-568p</link>
      <guid>https://dev.to/sebastian_chedal/agent-memory-038-knowledge-systems-compared-2026-guide-568p</guid>
      <description>&lt;p&gt;Most companies deploying AI agents hit the same wall about two months in: the agent forgets everything between sessions, can’t read the company’s actual knowledge (strategy docs, pricing logic, customer notes), and has no clean way to write what it learns back to the team’s knowledge base for human review. The toolkit for solving this is strong, but the question that matters for a mid-market team is different from the question developers ask. It isn’t “which API surface is cleanest.” It’s “how does a company actually maintain its knowledge, feed it to agents, let agents add to it, and keep humans in the loop?”&lt;/p&gt;

&lt;p&gt;As of April 2026, there are five named systems worth comparing (Mem0, Zep, Letta, Cognee, and Cloudflare Agent Memory) plus a sixth path: maintaining knowledge as plain markdown and giving agents read/write access through a semantic search index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In this article:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The five questions to ask before you pick a memory system&lt;/li&gt;
&lt;li&gt;What’s off the shelf in 2026 — and what you can build yourself&lt;/li&gt;
&lt;li&gt;Mem0, Zep, Letta, Cognee, and Cloudflare Agent Memory, compared on the same scaffolding&lt;/li&gt;
&lt;li&gt;The markdown-vault path nobody else writes about&lt;/li&gt;
&lt;li&gt;A 4-step workflow for letting agents propose knowledge updates that humans review&lt;/li&gt;
&lt;li&gt;A decision framework matched to mid-market deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Bidirectional Sync&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mem0&lt;/td&gt;
&lt;td&gt;Vector + graph + KV&lt;/td&gt;
&lt;td&gt;Apache 2.0 / managed&lt;/td&gt;
&lt;td&gt;Partial (API only)&lt;/td&gt;
&lt;td&gt;Personalization, returning end-users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zep / Graphiti&lt;/td&gt;
&lt;td&gt;Temporal knowledge graph&lt;/td&gt;
&lt;td&gt;Open source / managed&lt;/td&gt;
&lt;td&gt;Partial (API only)&lt;/td&gt;
&lt;td&gt;Entity + time queries, CRM agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Letta&lt;/td&gt;
&lt;td&gt;Tiered RAM/disk (agent-managed)&lt;/td&gt;
&lt;td&gt;Apache 2.0 / managed&lt;/td&gt;
&lt;td&gt;Weak&lt;/td&gt;
&lt;td&gt;Long-horizon agents, unlimited memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cognee&lt;/td&gt;
&lt;td&gt;Vector + knowledge graph from docs&lt;/td&gt;
&lt;td&gt;Open core / managed&lt;/td&gt;
&lt;td&gt;Partial (doc curation)&lt;/td&gt;
&lt;td&gt;Unstructured document ingestion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloudflare Agent Memory&lt;/td&gt;
&lt;td&gt;Typed (Facts/Events/Instructions/Tasks)&lt;/td&gt;
&lt;td&gt;Managed only (private beta)&lt;/td&gt;
&lt;td&gt;Partial (shared profiles)&lt;/td&gt;
&lt;td&gt;Teams already on Cloudflare&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown vault + search&lt;/td&gt;
&lt;td&gt;Files + semantic index&lt;/td&gt;
&lt;td&gt;Infrastructure cost only&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Strong&lt;/strong&gt; (humans edit directly)&lt;/td&gt;
&lt;td&gt;Full ownership, humans as first-class authors&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The memory problem every mid-market deployment hits in month two
&lt;/h2&gt;

&lt;p&gt;The first month of an agent deployment usually goes fine. Then three things start happening at once.&lt;/p&gt;

&lt;p&gt;First, the session reset. The agent forgets yesterday’s conversation and the user re-explains context every time. By week three, people are typing the same paragraph of background into the prompt every morning.&lt;/p&gt;

&lt;p&gt;Second, the knowledge gap. The agent doesn’t know the company’s pricing logic, brand voice rules, approved vendor list, or customer service notes. Those documents live in Notion, Obsidian, Google Drive, an internal wiki, or scattered Slack threads. The agent has no path to any of them.&lt;/p&gt;

&lt;p&gt;Third, the learning leak. The agent figures something out during a session (a customer preference, a corrected spec, a new policy detail) and the moment the session ends, that learning is gone.&lt;/p&gt;

&lt;p&gt;These three failures are usually framed as a context-window problem. They aren’t. They’re an organizational-knowledge problem. The question is not “how does the agent’s brain hold more information,” it is “where does the company’s knowledge live, who maintains it, and how does the agent participate in that loop without quietly rewriting things humans haven’t reviewed?” Every system below is a different answer to that question.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five questions to ask before you pick a memory system
&lt;/h2&gt;

&lt;p&gt;A buyer needs a self-diagnostic, a short list of questions to score against any candidate. Five questions cover the field:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Context management.&lt;/strong&gt; How does the agent decide what fits in its working memory right now? Some systems keep the last N messages, some retrieve relevant memories on every turn, some compress conversations into running summaries. The right answer depends on how long your sessions are.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Connected knowledge body.&lt;/strong&gt; Where does the agent’s knowledge come from, and who maintains it? If the only knowledge the agent has is what users say during sessions, the system is closed-loop. If the agent can read the company wiki, customer records, or a curated knowledge graph, it’s connected. Mid-market deployments almost always need the connected version, because the team already has its knowledge somewhere and the agent needs to plug into it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Automatic vs engineered memory.&lt;/strong&gt; Does the system decide what to remember on its own, or do you tell it explicitly? Automatic extraction is faster to deploy and harder to audit. Explicit memory is slower to set up and easier to control. Most mid-market teams want explicit at first and automatic only after they trust the system’s judgment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Human-agent merge.&lt;/strong&gt; Can humans read what the agent has learned, edit it, and contribute to the same knowledge base outside the agent loop? The agent should not be the only writer to its own memory. The human team needs a seat at the same table, ideally using normal tools (text editors, wikis, IDEs) rather than a separate “memory dashboard.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Current limits.&lt;/strong&gt; What does this system &lt;em&gt;not&lt;/em&gt; do today? Every memory system has gaps. Some don’t handle entity changes over time, some don’t support multi-tenant scoping, some are private beta with no published pricing. Naming the limits before you commit saves the second deployment from fighting the first one’s blind spots.&lt;/p&gt;

&lt;p&gt;These five run as a checklist against every system below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-30-J-agent-memory-knowledge-systems-02.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-30-J-agent-memory-knowledge-systems-02.svg" alt="Five questions to ask before picking an AI agent memory system — context management, connected knowledge body, automatic vs engineered, human-agent merge, current limits" width="100" height="61.578947368421055"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 2026 landscape — what’s off the shelf, what you build yourself
&lt;/h2&gt;

&lt;p&gt;There are two paths through this market.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Off the shelf.&lt;/strong&gt; Opinionated APIs and managed infrastructure. Integration time is days. Trade-offs are vendor lock-in, less control over how memory gets extracted and stored, and pricing models that are usually opaque until you scale. The named players are Mem0, Zep (with its open-source component Graphiti), Letta (formerly MemGPT), Cognee, and Cloudflare Agent Memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build it yourself.&lt;/strong&gt; Maintain the company’s knowledge as files, usually markdown, in a versioned folder. Index them with a local semantic search tool. Give agents a query interface and, optionally, a write-to-a-review-folder interface. Integration is longer up front, you own the operational complexity, and no vendor will support you. The advantages: knowledge stays portable, humans use normal tools to maintain it, and the cost is essentially infrastructure-only.&lt;/p&gt;

&lt;p&gt;There’s also an architectural axis that cuts across both paths. Memory systems tend to fall into one of three patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vector-only.&lt;/strong&gt; Embed everything, retrieve by similarity. Fast, simple, weak on temporal and relational queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector plus knowledge graph.&lt;/strong&gt; Embed for similarity and extract entities/relationships for graph traversal. Better for “who owns what” and “what changed when” questions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tiered or agent-managed.&lt;/strong&gt; The agent itself decides what to keep in working memory and what to page out to longer-term storage. More flexible, harder to reason about.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vectorize’s &lt;a href="https://vectorize.io/articles/best-ai-agent-memory-systems" rel="noopener noreferrer"&gt;2026 framework comparison&lt;/a&gt; introduced this taxonomy in clean form, and it’s a useful overlay when reading the rest of this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five systems, compared
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mem0 — the personalization memory layer
&lt;/h3&gt;

&lt;p&gt;Mem0 is a vector + graph + key-value memory layer designed to give assistants and support agents persistent, scoped recall about end-users. Best for chatbots, support agents, and deployments where the same users return repeatedly.&lt;/p&gt;

&lt;p&gt;The architecture combines three storage layers (vector, graph, key-value) with a four-scope memory model: user_id, agent_id, run_id, app_id, plus an optional org_id. Memories are extracted automatically from conversations and stored against whichever scopes apply. According to &lt;a href="https://mem0.ai/blog/state-of-ai-agent-memory-2026" rel="noopener noreferrer"&gt;Mem0’s State of AI Agent Memory 2026 report&lt;/a&gt; (citing the &lt;a href="https://arxiv.org/abs/2504.19413" rel="noopener noreferrer"&gt;ECAI 2025 paper, Chhikara et al.&lt;/a&gt;), Mem0 scores 66.9% on the LOCOMO benchmark at 0.71s median latency using around 1,800 tokens per conversation, versus a full-context baseline of 72.9% at 9.87s and around 26,000 tokens — roughly 14x the token cost for under 6 points of accuracy. The graph-enhanced variant (Mem0g) scores 68.4% at 1.09s. Mem0 publishes both the benchmark and the comparators, so treat absolute numbers as vendor-favorable; the latency and token-cost gaps are directionally useful regardless.&lt;/p&gt;

&lt;p&gt;On the five questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context management:&lt;/strong&gt; retrieves relevant memories per turn, scoped by user/agent/run/app/org.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connected knowledge body:&lt;/strong&gt; partial. Mem0 holds what users say; pulling the company’s existing knowledge in is custom work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic vs engineered:&lt;/strong&gt; automatic extraction by default, with explicit add/update APIs available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-agent merge:&lt;/strong&gt; weak. Humans can call the API, but the workflow is developer-shaped, not knowledge-worker-shaped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current limits:&lt;/strong&gt; no native human-review workflow. The four-scope model is the closest the field gets to multi-stakeholder memory but it’s still agent-centric.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;License: Apache 2.0 with around 48,000 GitHub stars per &lt;a href="https://dev.to/nebulagg/top-6-ai-agent-memory-frameworks-for-devs-2026-1fef"&gt;dev.to’s 2026 framework roundup&lt;/a&gt;. &lt;a href="https://atlan.com/know/best-ai-agent-memory-frameworks-2026/" rel="noopener noreferrer"&gt;Atlan’s 2026 comparison&lt;/a&gt; also notes Mem0 has raised $24M in funding and holds SOC 2 compliance. Repo: &lt;a href="https://github.com/mem0ai/mem0" rel="noopener noreferrer"&gt;github.com/mem0ai/mem0&lt;/a&gt;. Managed cloud has a free tier; production pricing is usage-based.&lt;/p&gt;

&lt;h3&gt;
  
  
  Zep / Graphiti — the temporal knowledge graph
&lt;/h3&gt;

&lt;p&gt;Zep models memory as a temporal knowledge graph: facts have a time dimension, so “Alice owned the budget until February, then Bob took over” is a first-class query rather than a string-similarity guess. The open-source component is &lt;a href="https://www.getzep.com/" rel="noopener noreferrer"&gt;Graphiti&lt;/a&gt;; Zep Cloud is the managed product on top.&lt;/p&gt;

&lt;p&gt;The temporal dimension matters most for production CRM and project agents, anywhere entities change relationships over time and the agent needs “what’s true now” separated from “what was true six months ago.” Zep groups conversations into episodes, summarizes them, and indexes the resulting graph. It scores 63.8% on LongMemEval per &lt;a href="https://atlan.com/know/best-ai-agent-memory-frameworks-2026/" rel="noopener noreferrer"&gt;Atlan’s comparison&lt;/a&gt;, the strongest published number for temporal queries, versus Mem0’s 49.0% on the same benchmark.&lt;/p&gt;

&lt;p&gt;One trade-off worth flagging: &lt;a href="https://blog.devgenius.io/ai-agent-memory-systems-in-2026-mem0-zep-hindsight-memvid-and-everything-in-between-compared-96e35b818da8" rel="noopener noreferrer"&gt;DevGenius’s builder comparison&lt;/a&gt; reports that immediate post-ingestion retrieval often misses correct answers because Zep’s graph processing runs in the background; correct answers tend to surface hours later once the graph catches up. The same piece notes Mem0’s published critique that Zep’s memory footprint can exceed 600,000 tokens per conversation versus Mem0’s ~1,800. That critique comes from Mem0, but the order-of-magnitude gap is consistent across third-party reports.&lt;/p&gt;

&lt;p&gt;On the five questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context management:&lt;/strong&gt; episode-grouped, summarized, retrieved with temporal awareness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connected knowledge body:&lt;/strong&gt; partial. Strong inside the graph it builds, weak at pulling external markdown or wiki content in without custom ingestion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic vs engineered:&lt;/strong&gt; automatic extraction, explicit graph editing available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-agent merge:&lt;/strong&gt; weak. Humans interact with Zep through Zep’s tools, not their own.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current limits:&lt;/strong&gt; retrieval delay until graph processing completes. No native human-review workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;License: Graphiti is open source; Zep Cloud is usage-based. Around 24,000 GitHub stars per the dev.to roundup. SOC 2 compliant per Atlan.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-30-J-agent-memory-knowledge-systems-03.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-30-J-agent-memory-knowledge-systems-03.svg" alt="Three AI agent memory architecture patterns in 2026: vector-only, vector plus knowledge graph, and tiered agent-managed memory" width="100" height="48.83720930232558"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Letta (formerly MemGPT) — OS-inspired tiered memory
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://letta.com" rel="noopener noreferrer"&gt;Letta&lt;/a&gt; models agent memory after an operating system. Main context is RAM (what’s in the prompt right now). Archival memory is disk (long-term storage the agent can search). The agent itself decides what pages in and out via tool calls. Originally published as MemGPT, the project rebranded in 2024 and continues under the same architecture.&lt;/p&gt;

&lt;p&gt;Best for long-running agents that need effectively unlimited memory and where you’re willing to trust the agent with its own paging decisions: research assistants, coding assistants on multi-week projects, deployments running hundreds or thousands of turns. The trade-off is that “the agent decides what to remember” is harder to audit than “the system decides on rules you wrote.”&lt;/p&gt;

&lt;p&gt;On the five questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context management:&lt;/strong&gt; tiered RAM/disk model with agent-driven paging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connected knowledge body:&lt;/strong&gt; partial. Archival memory can hold ingested documents, but you’re operating Letta’s storage, not the company’s existing knowledge base.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic vs engineered:&lt;/strong&gt; agent-managed, a third path between fully automatic and explicitly engineered by the operator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-agent merge:&lt;/strong&gt; weak. Humans can call the API; no native co-edit workflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current limits:&lt;/strong&gt; auditing what the agent chose to remember (and discard) is harder than with explicit-rule systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;License: Apache 2.0, around 21,000 GitHub stars per the dev.to roundup. Managed cloud available; self-hosted deployment is well-documented.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cognee — knowledge graph from unstructured data
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.cognee.ai/" rel="noopener noreferrer"&gt;Cognee&lt;/a&gt; is the closest existing system to “feed the company’s documents in and let the agent reason over them.” Its pipeline ingests raw documents, conversations, and external sources, extracts entities and relationships, builds a knowledge graph, and retrieves by graph traversal combined with vector search. The entry point is unstructured documents (not conversation logs) and the graph is the primary retrieval surface, which makes Cognee strong for institutional knowledge and weaker for fast conversational personalization. Best for research-heavy agents and deployments where the inputs are messy documents rather than clean conversations.&lt;/p&gt;

&lt;p&gt;On the five questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context management:&lt;/strong&gt; graph traversal plus vector retrieval; long-form document support is the strength.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connected knowledge body:&lt;/strong&gt; stronger here than the conversational-memory peers. Ingestion is the design center.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic vs engineered:&lt;/strong&gt; automatic extraction with configurable pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-agent merge:&lt;/strong&gt; partial. Humans curate the input documents, but Cognee’s representation of them is opaque to non-engineers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current limits:&lt;/strong&gt; no native human-review workflow on agent-added knowledge; managed-service pricing not transparent at the time of writing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;License: open core with around 12,000 GitHub stars per the dev.to roundup. Managed cloud available.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloudflare Agent Memory — the April 2026 entrant
&lt;/h3&gt;

&lt;p&gt;Cloudflare announced &lt;a href="https://blog.cloudflare.com/introducing-agent-memory/" rel="noopener noreferrer"&gt;Agent Memory&lt;/a&gt; in private beta on April 17, 2026. It’s the most significant new entrant this year, shipping as a managed service running on Workers, Durable Objects, and Vectorize.&lt;/p&gt;

&lt;p&gt;Five operations (ingest, remember, recall, forget, list) cover the API surface. Ingestion runs as a two-pass pipeline at 10,000-character chunks with two-message overlap, with an eight-check verifier filtering extracted memories before they land. Memories are typed into one of four classes: Facts (atomic stable knowledge), Events (timestamped happenings), Instructions (procedures), and Tasks (ephemeral). A profile model can be shared across multiple agents and humans, the closest any managed service gets to a multi-stakeholder memory layer. Cloudflare also committed publicly that customer memory is exportable (“your memories are yours; every memory is exportable”), which most managed services don’t.&lt;/p&gt;

&lt;p&gt;On the five questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context management:&lt;/strong&gt; typed retrieval (Facts/Events/Instructions/Tasks) with verifier-gated ingestion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connected knowledge body:&lt;/strong&gt; partial. Designed primarily for conversational and event-driven inputs; document ingestion is supported but not the design center.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic vs engineered:&lt;/strong&gt; automatic with a strong verifier in the loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-agent merge:&lt;/strong&gt; the shared-profile model gestures toward this, but the example in the launch post is “two agents share memory,” not “humans write the source of truth.”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current limits:&lt;/strong&gt; private beta with no published pricing; Cloudflare-ecosystem dependency; production proof points are weeks old, not years.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;License: managed service, no open-source release. Pricing: not yet published as of April 2026. Best fit: teams already on Cloudflare who want the lowest-friction managed memory layer and are comfortable being early adopters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-30-J-agent-memory-knowledge-systems-05.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-30-J-agent-memory-knowledge-systems-05.svg" alt="Cloudflare Agent Memory operations flow: ingest, two-pass extraction, 8-check verifier, type classification into facts events instructions tasks, then remember recall forget list" width="100" height="64.21052631578948"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The build-it-yourself path: markdown vault plus semantic search
&lt;/h2&gt;

&lt;p&gt;A folder of markdown files plus a local semantic search index is a legitimate competitor to all five managed paths above, especially for mid-market companies that already maintain knowledge in Notion, Obsidian, or git repos. This is one of the patterns we’ve watched work in practice — see &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-teams-business-operations/" rel="noopener noreferrer"&gt;how production agent teams handle memory in practice&lt;/a&gt; for the operational shape.&lt;/p&gt;

&lt;p&gt;The pattern is simple. Maintain company knowledge as plain markdown in a versioned folder (an Obsidian vault, a git repo, a GitHub wiki, a Notion export). Index it with a local semantic search tool. Give agents read access through a query tool that returns matching files (or excerpts) with provenance. Optionally, give the agent write access to a designated subfolder where new notes go for human review before promotion into the canonical base.&lt;/p&gt;

&lt;p&gt;The advantages stack up quickly. Knowledge stays portable: no vendor owns your facts, and migrating to a different agent platform means changing the query tool, not exporting and reformatting a database. Humans edit knowledge using normal tools (text editors, Obsidian, IDEs, GitHub PR review), so there’s no separate “memory dashboard” anyone has to learn. The same knowledge base feeds multiple agents and the team simultaneously. Cost is infrastructure-only.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fa8ai8j8c5mhvppae59.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fa8ai8j8c5mhvppae59.jpg" alt="Professional reviewing knowledge documents and files at a modern office desk with natural light" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pattern has a documented public example. &lt;a href="https://eastondev.com/blog/en/posts/ai/20260227-openclaw-obsidian-sync/" rel="noopener noreferrer"&gt;A February 2026 walkthrough at eastondev.com&lt;/a&gt; describes configuring an agent platform’s Obsidian-vault skill to sync conversation memory as Markdown notes with bidirectional links and structured directories (session logs in one folder, knowledge base in another). When Perplexity is asked about bidirectional human↔agent knowledge sync in 2026, that walkthrough is the project it cites: the only documented end-to-end pattern at the time of writing. For a longer-form view of the same shape, see &lt;a href="https://fountaincity.tech/resources/blog/inside-autonomous-ai-content-pipeline/" rel="noopener noreferrer"&gt;how a real production pipeline uses memory&lt;/a&gt; across multiple stages.&lt;/p&gt;

&lt;p&gt;Tools that fit this lane: &lt;a href="https://obsidian.md" rel="noopener noreferrer"&gt;Obsidian&lt;/a&gt; for the markdown editor and graph layer; a local semantic search index combining BM25 and vector search over the vault; &lt;a href="https://github.com/langchain-ai/langmem" rel="noopener noreferrer"&gt;LangMem&lt;/a&gt; or &lt;a href="https://docs.llamaindex.ai" rel="noopener noreferrer"&gt;LlamaIndex memory modules&lt;/a&gt; when you want a memory abstraction pairable with a markdown backend instead of a SaaS layer.&lt;/p&gt;

&lt;p&gt;When this path is the wrong answer: temporal entity tracking is non-trivial to build (use Zep), agent-managed paging across very long sessions is also non-trivial (use Letta), and if you genuinely don’t want any infrastructure to operate, the managed services exist for a reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bidirectional sync question — how knowledge flows both ways
&lt;/h2&gt;

&lt;p&gt;Most teams treat agent memory as one-way. The agent reads from some knowledge, operates on it, and the work product evaporates. The systems that actually work in production close the loop: agent reads, operates, writes back to a holding area, human reviews, knowledge gets promoted into the canonical base. Four steps, all of them necessary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Source of truth lives with humans.&lt;/strong&gt; The canonical knowledge base, the place where the company’s strategy, pricing, customer details, and policies actually live, is something humans maintain primarily. An Obsidian vault, a Notion workspace, an internal wiki, a git repo of markdown files. Whatever it is, the humans on the team are the authoritative authors. This principle of &lt;a href="https://fountaincity.tech/resources/blog/how-can-my-business-own-and-control-its-own-ai-data/" rel="noopener noreferrer"&gt;building your own knowledge base&lt;/a&gt; rather than letting it live inside a vendor’s database is what makes the rest of the workflow possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Agent reads with provenance.&lt;/strong&gt; When the agent answers a question or makes a decision, it cites which document (or which memory record) the answer came from. No “trust me” responses. Provenance is non-optional, because without it humans can’t audit what the agent is doing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Agent writes to a review queue, not the source of truth.&lt;/strong&gt; When the agent learns something new (a customer corrected a fact, a project changed scope, a pricing exception was approved) it writes that new note to a pending/ or inbox/ folder. Never directly to the canonical base. The agent’s job is to propose, not to publish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Human review promotes or rejects.&lt;/strong&gt; A periodic review pass (daily for high-velocity environments, weekly for most) either promotes the agent’s proposed notes into the canonical base or rejects them. The canonical base only grows under human authority. The review interface is whatever the team already uses: a folder, a Pull Request, a Notion page with a checkbox.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-30-J-agent-memory-knowledge-systems-06.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-30-J-agent-memory-knowledge-systems-06.svg" alt="Four-step bilateral knowledge sync workflow: human canonical base, agent reads with provenance, agent writes to review queue, human review promotes or rejects" width="100" height="57.5"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How each system maps to these steps tells you the most about whether it’s a fit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mem0:&lt;/strong&gt; step 2 strong (four-scope provenance), step 1 partial, steps 3 and 4 require custom work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zep:&lt;/strong&gt; step 2 strong (episode-level provenance), step 1 partial, steps 3 and 4 require custom work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Letta:&lt;/strong&gt; step 2 harder (paging decisions aren’t always traceable), steps 3 and 4 require careful tool wrapping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cognee:&lt;/strong&gt; step 1 strongest (document ingestion is the design center), step 2 partial, steps 3 and 4 require custom work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare Agent Memory:&lt;/strong&gt; typed classification and shared profiles gesture at multi-stakeholder memory; step 4 is the gap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Markdown vault plus semantic search:&lt;/strong&gt; step 4 is just “humans editing a folder” or “merging a Pull Request.” That’s where this path quietly wins. Steps 1–3 require operational discipline rather than a vendor.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No system natively implements step 4. All of them assume the agent has authority to update memory directly. The systems that come closest do so by accident (Cloudflare’s shared profiles, Mem0’s scoped models) not by design. The markdown-vault path makes step 4 a workflow choice instead of a feature request.&lt;/p&gt;

&lt;h2&gt;
  
  
  A decision framework for picking the right system
&lt;/h2&gt;

&lt;p&gt;Read the framework as “if your situation is X, start with Y”:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Already on Cloudflare and want low-friction managed:&lt;/strong&gt; Cloudflare Agent Memory (private beta; confirm access first).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive personalization for end-users&lt;/strong&gt; (chatbot, support, returning users): Mem0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entities and relationships change over time&lt;/strong&gt; (“who owned this account in February”): Zep / Graphiti.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-horizon agents needing effectively unlimited memory:&lt;/strong&gt; Letta.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ingesting unstructured documents, reasoning over a knowledge graph:&lt;/strong&gt; Cognee.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full ownership, portability, humans as first-class authors:&lt;/strong&gt; markdown vault plus semantic search.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Already on LangChain/LangGraph or LlamaIndex:&lt;/strong&gt; use their memory modules first; revisit only if you outgrow them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most mid-market deployments end up combining a markdown vault for canonical knowledge with one of the off-the-shelf layers for transient session memory. The vault holds what the team owns; the SaaS layer holds what the agent needs to remember about an active conversation. That split keeps canonical knowledge portable while letting the agent operate at the speed users expect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open problems in the field
&lt;/h2&gt;

&lt;p&gt;The agent-memory category is roughly eighteen months old as a distinct discipline. A few caveats apply across all six paths above. No system natively implements the human-review-promotion gate; all assume the agent has authority to update memory directly. LOCOMO and LongMemEval are useful but easy to overfit (Cloudflare’s launch post says so directly) so treat scores as directional. Most managed services route conversation extraction through their own LLMs — fine for some businesses, a deal-breaker for others. None publish per-query pricing in a way that lets a buyer model real-world cost ahead of time. Cloudflare publicly committed to memory export; most others have not. Voice agent memory is a distinct emerging sub-problem.&lt;/p&gt;

&lt;p&gt;The market gap is wide enough that one of the major systems will likely close it within twelve months.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the best AI agent memory system in 2026?
&lt;/h3&gt;

&lt;p&gt;There isn’t a single best. Mem0 leads on personalization and benchmark scores. Zep / Graphiti leads on temporal queries. Letta leads on long-horizon agent-managed memory. Cognee leads on unstructured-document ingestion. Cloudflare Agent Memory is the most significant new managed entrant. For deployments where humans need to be first-class authors of the knowledge base, a markdown vault plus a semantic search index is often the right answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Cloudflare Agent Memory open source?
&lt;/h3&gt;

&lt;p&gt;No. &lt;a href="https://blog.cloudflare.com/introducing-agent-memory/" rel="noopener noreferrer"&gt;Cloudflare Agent Memory&lt;/a&gt; is a managed service in private beta as of April 17, 2026, running on Workers, Durable Objects, and Vectorize. Cloudflare has committed publicly to making customer memory exportable, but the service itself is closed-source.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s the difference between Mem0 and Zep?
&lt;/h3&gt;

&lt;p&gt;Mem0 is optimized for personalization, remembering things about end-users across sessions, with a four-scope memory model (user_id / agent_id / run_id / app_id). Zep is optimized for temporal knowledge, tracking how entities and relationships change over time using a knowledge graph. Mem0 is faster on retrieval; Zep is more accurate on “what was true when” questions. Per published benchmarks, Mem0 leads LOCOMO and Zep leads LongMemEval.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use Obsidian as memory for an AI agent?
&lt;/h3&gt;

&lt;p&gt;Yes. The pattern is to maintain company knowledge as markdown in an Obsidian vault, index it with a local semantic search tool, and give the agent a query interface. Optionally, give the agent write access to a review folder where humans promote or reject new notes. &lt;a href="https://eastondev.com/blog/en/posts/ai/20260227-openclaw-obsidian-sync/" rel="noopener noreferrer"&gt;A February 2026 walkthrough at eastondev.com&lt;/a&gt; documents one full implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I let an AI agent update my company’s knowledge base?
&lt;/h3&gt;

&lt;p&gt;Don’t let it write directly. Use a four-step bilateral sync workflow: humans maintain the canonical knowledge base, the agent reads with provenance, the agent writes new learnings to a review folder (not the canonical base), and a periodic human review promotes or rejects them. None of the major managed memory systems implement step four natively, which is why the markdown-vault path is often the easiest fit.&lt;/p&gt;

&lt;h2&gt;
  
  
  If you don’t want to build this
&lt;/h2&gt;

&lt;p&gt;If your business is hitting the memory wall and you don’t want to evaluate six options and stand up the bidirectional review workflow yourself, that’s the kind of work we do. &lt;a href="https://fountaincity.tech/services/managed-autonomous-ai-agents/" rel="noopener noreferrer"&gt;We can run the memory architecture and the human-review workflow with you&lt;/a&gt;, so the canonical knowledge stays yours and the agent participates in the loop you already trust.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>What MCP, A2A, and UCP Mean for Your Website in 2026</title>
      <dc:creator>Sebastian Chedal</dc:creator>
      <pubDate>Sat, 02 May 2026 18:06:58 +0000</pubDate>
      <link>https://dev.to/sebastian_chedal/what-mcp-a2a-and-ucp-mean-for-your-website-in-2026-3aij</link>
      <guid>https://dev.to/sebastian_chedal/what-mcp-a2a-and-ucp-mean-for-your-website-in-2026-3aij</guid>
      <description>&lt;p&gt;If you run a website in 2026, you have probably watched three different articles about MCP, A2A, and UCP scroll past in the last two weeks and wondered whether any of it changes what you should be doing this quarter. The short answer is yes, but probably less than the headlines suggest, and not in the direction the headlines point. The agentic protocol stack is real infrastructure that is now mainstream conversation, and most of the work the average website owner needs to do about it can be done in an afternoon.&lt;/p&gt;

&lt;p&gt;Three sources published the same underlying observation within roughly two weeks of each other. &lt;a href="https://backlinko.com/agentic-ai-protocols" rel="noopener noreferrer"&gt;Backlinko&lt;/a&gt; released a six-protocol primer on MCP, A2A, NLWeb, WebMCP, ACP, and UCP, framing them as “what robots.txt and XML sitemaps were to 2005 Google.” Addy Osmani, Google Cloud’s Director of Engineering, published an Agentic Engine Optimization framework along with an &lt;a href="https://github.com/addyosmani/agentic-seo" rel="noopener noreferrer"&gt;open-source audit tool&lt;/a&gt;. Conductor analyzed 13,770 domains and 17 million AI responses and named the resulting visibility layer “the parallel surface.” Three independent signals, same conclusion. Agentic protocols are now part of how websites get discovered, queried, and (eventually) transacted with by AI agents on behalf of their users.&lt;/p&gt;

&lt;p&gt;This article is the version for the person who runs a website and wants to know which of these protocols matter for their site, which ones they can ignore, and what is reasonable to actually do about any of it before the end of the quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What “Protocol-Ready” Means
&lt;/h2&gt;

&lt;p&gt;Protocol-ready means an AI agent can discover, query, and (where it makes sense) transact with a website through a standardized interface, instead of scraping HTML and guessing at structure. That is the whole definition.&lt;/p&gt;

&lt;p&gt;The closest historical parallel is the one Backlinko reaches for and gets right. Their verified framing: &lt;em&gt;“Think of how robots.txt and XML sitemaps became table stakes for search crawlers. Agentic protocols are shaping up to be that for AI agents.”&lt;/em&gt; Robots.txt was a quiet text file that turned into existential SEO infrastructure within three years of nobody caring about it. The trajectory of the agentic protocol stack looks similar, though earlier on the curve.&lt;/p&gt;

&lt;p&gt;The signal that this is now mainstream rather than speculative is convergence. &lt;a href="https://www.digitalapplied.com/blog/ai-agent-protocol-ecosystem-map-2026-mcp-a2a-acp-ucp" rel="noopener noreferrer"&gt;DigitalApplied’s ecosystem map&lt;/a&gt; reports 97 million MCP downloads as of March 2026. Backlinko’s count of the PulseMCP directory has more than 10,000 MCP servers live as of early 2026. &lt;a href="https://www.conductor.com/academy/aeo-geo-benchmarks-report/" rel="noopener noreferrer"&gt;Conductor’s 2026 benchmark&lt;/a&gt; finds AI referral traffic averaging around 1% of total website traffic and growing roughly 1% per month. The 1% number is small, but the growth rate is the part to watch. The infrastructure has reached the volume where ignoring it stops being defensible, even if acting on it is still optional for most sites.&lt;/p&gt;

&lt;p&gt;For the content-side companion to the infrastructure questions in this article, see our &lt;a href="https://fountaincity.tech/resources/blog/agentic-seo-practitioner-guide/" rel="noopener noreferrer"&gt;agentic SEO practitioner guide&lt;/a&gt;, which covers what to publish so AI agents can actually use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Protocols That Matter Now (and the Three to Watch, Not Build For)
&lt;/h2&gt;

&lt;p&gt;Backlinko enumerates six protocols. The count is correct as a taxonomy, and misleading as a buying recommendation. For 2026 website-scale decisions, three deserve real attention. Three more are worth tracking and nothing more.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-29-J-agentic-protocol-stack-agency-02.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-29-J-agentic-protocol-stack-agency-02.svg" alt="Comparison diagram: MCP, A2A, UCP agentic commerce protocols to build for now vs NLWeb, WebMCP, ACP to watch only" width="100" height="53.84615384615385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Build for now
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;MCP (Model Context Protocol).&lt;/strong&gt; The agent-to-tools layer. &lt;a href="https://backlinko.com/agentic-ai-protocols" rel="noopener noreferrer"&gt;Anthropic launched MCP in November 2024&lt;/a&gt;, and it is now governed by the Agentic AI Foundation under the Linux Foundation. The standard has been adopted by OpenAI, Google, and Microsoft. If your business has any internal system you would want AI tools to query (a product catalog, a CRM, a CMS, a support knowledge base, an inventory database), an MCP server is the standard interface for exposing that system to agents. It is the only protocol on this list that has cleared “is this real” status. If you have nothing for an agent to query, you do not need MCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A2A (Agent-to-Agent).&lt;/strong&gt; The agent-to-agent layer. Google launched A2A in &lt;a href="https://backlinko.com/agentic-ai-protocols" rel="noopener noreferrer"&gt;April 2025 with more than 50 technology partners&lt;/a&gt;, including Salesforce, PayPal, SAP, Workday, and ServiceNow. The Linux Foundation now maintains it under Apache 2.0. A2A becomes relevant when a website operates more than one agent that needs to coordinate with another agent (yours or someone else’s). Most websites are not running multiple agents yet. If you are running one agent or none, A2A is informational. If you reach three or more by the end of 2026, you will need it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UCP (Universal Commerce Protocol).&lt;/strong&gt; The agent-to-commerce layer. Sundar Pichai announced UCP at NRF 2026, co-developed by Google and Shopify with launch partners including &lt;a href="https://www.infoq.com/news/2026/01/google-agentic-commerce-ucp/" rel="noopener noreferrer"&gt;Target, Walmart, Wayfair, and Etsy, plus 20+ additional partners&lt;/a&gt; including Mastercard, Visa, Stripe, and American Express. UCP runs on top of OAuth 2.0 and PCI-DSS, with MCP and A2A bindings built in. &lt;a href="https://www.thestack.technology/walmart-target-join-google-to-launch-ecommerce-standard-for-ai-shopping/" rel="noopener noreferrer"&gt;UCP launched less than 14 weeks after OpenAI and Stripe announced ACP&lt;/a&gt;, the competing OpenAI-led commerce protocol. The two protocols overlap. UCP has the broader retailer coalition; ACP has live distribution inside ChatGPT. If your site sells products and you are picking one to keep on your radar today, UCP is the safer bet on coalition breadth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Watch, do not build for yet
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;NLWeb.&lt;/strong&gt; A natural-language interface for websites, created by R.V. Guha, who also created RSS, RDF, and Schema.org. Heavy pedigree. &lt;a href="https://backlinko.com/agentic-ai-protocols" rel="noopener noreferrer"&gt;Early adopters include TripAdvisor, Shopify, Eventbrite, O’Reilly, and Hearst, announced at Microsoft Build 2025&lt;/a&gt;. Interesting long-term. Most websites do not need it yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebMCP.&lt;/strong&gt; A &lt;a href="https://backlinko.com/agentic-ai-protocols" rel="noopener noreferrer"&gt;Google-and-Microsoft W3C Community Group proposal, with an early preview shipping in Chrome in February 2026&lt;/a&gt;. Pre-standard. Worth watching, not worth implementing this quarter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ACP (Agent Commerce Protocol).&lt;/strong&gt; OpenAI and Stripe’s commerce protocol. Live in &lt;a href="https://opascope.com/insights/ai-shopping-assistant-guide-2026-agentic-commerce-protocols/" rel="noopener noreferrer"&gt;ChatGPT Instant Checkout since September 2025&lt;/a&gt;, with 900 million weekly ChatGPT users and a reported 4% merchant fee per Opascope’s synthesis. Real, but overlapping with UCP. If you only have budget for one commerce protocol implementation, the broader-coalition standard wins on portability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run This on Your Own Site: A Five-Point Readiness Check
&lt;/h2&gt;

&lt;p&gt;Most websites only need to act on two or three of the five questions below. The point of running through all five is to know which two or three those are.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-29-J-agentic-protocol-stack-agency-03.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffountaincity.tech%2Fwp-content%2Fuploads%2F2026%2F04%2F2026-04-29-J-agentic-protocol-stack-agency-03.svg" alt="Five-step protocol readiness audit: structured data, content recency check, manifest decision, MCP tool exposure, citation baseline" width="100" height="65.8974358974359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Structured-data baseline.&lt;/strong&gt; Schema.org coverage for Organization, Product, Service, FAQPage, and Article at minimum. If your structured data is incomplete, no protocol implementation will compensate, because agents still need the structured signals underneath. Run Osmani’s &lt;a href="https://github.com/addyosmani/agentic-seo" rel="noopener noreferrer"&gt;agentic-seo audit tool&lt;/a&gt; against your own domain. The tool runs ten checks across five categories (Discovery, Content, Token Efficiency, Agent Context, AI Usability) and scores out of 100. Free, public, fifteen minutes. Run it against a competitor’s domain in the same session if you want a calibration point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Content recency check.&lt;/strong&gt; Amsive reported that &lt;a href="https://www.bigmoves.marketing/blog/ai-in-marketing-5-predictions-for-b2b-marketing-in-2026-and-beyond" rel="noopener noreferrer"&gt;50% of AI-cited content is less than 13 weeks old&lt;/a&gt;. If your last cornerstone publish was six months ago, fix that before anything else. Recency is the precondition; protocols are the amplifier. Cornerstone-content cadence is a bigger lever for AI visibility right now than any single manifest decision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. /.well-known/ manifest decision.&lt;/strong&gt; There are three possible manifests, and not every site should publish all three. A UCP manifest at /.well-known/ucp is relevant if you sell products online. An LLMs.txt file is relevant for content-heavy sites that want to expose a curated reading order to AI agents. An agents.md file at the repository root is relevant if your site or codebase is going to be navigated by coding agents. Most sites need one or two of these, not all three. Decide what to publish, not all of it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. MCP tool exposure decision.&lt;/strong&gt; Do you have an internal API, database, or system an agent should reach? If yes, an MCP server wrapping that system is the right pattern. If no, and most brochure-site businesses are in this category, skip MCP entirely this quarter. There is no point building infrastructure for agents to use when there is nothing for them to use it for. If you do expose an internal system, build a &lt;a href="https://fountaincity.tech/resources/blog/ai-agent-cost-circuit-breaker/" rel="noopener noreferrer"&gt;cost circuit breaker pattern&lt;/a&gt; in front of it before going live. Runaway agent calls produce surprise bills.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Citation baseline.&lt;/strong&gt; Before any protocol work, measure where your site is currently being cited in AI answers across Perplexity, ChatGPT, Gemini, Claude, and Google AI Mode. &lt;a href="https://www.conductor.com/academy/aeo-geo-benchmarks-report/" rel="noopener noreferrer"&gt;Conductor’s 2026 AEO/GEO Benchmarks&lt;/a&gt;, built on 13,770 domains and 17 million AI responses, give you the industry calibration. AI referral traffic averages around 1% of total and is growing roughly 1% per month. If you do not measure where you are cited today, you cannot tell whether anything you do tomorrow is working.&lt;/p&gt;

&lt;p&gt;Five questions, answerable in an afternoon. Most websites only need to act on two or three of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When you can skip this entirely.&lt;/strong&gt; Sites with fewer than 50 indexed pages, sites in regulated verticals where agent transactions are not yet legal (regulated financial advice, healthcare prescribing, anything that requires a licensed human in the loop), and sites whose current content strategy is not producing anything citable in the first place. The structured-data and content-recency checks above will surface this quickly. If both fail, fix those first; the protocol questions can wait.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8udisexsr7bgjxak54sh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8udisexsr7bgjxak54sh.jpg" alt="Two professionals reviewing multi-agent pipeline dashboards in a modern office — protocol deployment in practice" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Is Going (and What to Do About It)
&lt;/h2&gt;

&lt;p&gt;The trajectory is directionally certain and short-term modest, and that is the framing to take into your next planning meeting. &lt;a href="https://backlinko.com/agentic-ai-protocols" rel="noopener noreferrer"&gt;Backlinko&lt;/a&gt;, Pipe17, and the Google Developers Blog all published their protocol primers in Q1 2026. Search Engine Journal, SEMrush, and Ahrefs will follow this year. Conductor has already named “the parallel surface of visibility” as the canonical 2026 framing. Protocol-readiness is going to show up as a normal RFP requirement on a 12-to-24-month horizon, not a “by July” deadline. The current AI-referral share is small. The growth rate is the part that compounds.&lt;/p&gt;

&lt;p&gt;What is reasonable to do now if you run a website. Run Osmani’s &lt;a href="https://github.com/addyosmani/agentic-seo" rel="noopener noreferrer"&gt;agentic-seo tool&lt;/a&gt; on your domain (15 minutes). Audit your cornerstone content recency (1 hour). Decide whether you have an internal system that would benefit from MCP exposure (most websites do not, and “no” is a perfectly reasonable answer). If you sell products online, put a calendar reminder to revisit the UCP manifest question in Q3, when the retailer adoption curve will be clearer. None of this is a multi-quarter program. It is afternoon-scale work for most sites, and skip-entirely work for many of them.&lt;/p&gt;

&lt;p&gt;We are a technology studio that builds autonomous AI systems. The readiness work in this article sits in front of the platform layer we run for clients with bigger needs (clients running production agents, exposing internal systems through MCP, or building multi-agent workflows that coordinate over A2A) at &lt;a href="https://fountaincity.tech/services/managed-autonomous-ai-agents/" rel="noopener noreferrer"&gt;Fountain City’s managed autonomous AI agents&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is MCP (Model Context Protocol)?
&lt;/h3&gt;

&lt;p&gt;MCP is the standardized interface AI agents use to talk to tools and data sources. Anthropic launched MCP in November 2024, and it is now governed by the Agentic AI Foundation under the Linux Foundation, with adoption from OpenAI, Google, and Microsoft. According to Backlinko’s count of the PulseMCP directory, more than 10,000 MCP servers are live as of early 2026. Practically, if you have an internal system an AI tool should query, an MCP server is the standard wrapper.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is UCP (Universal Commerce Protocol)?
&lt;/h3&gt;

&lt;p&gt;UCP is the agent-to-commerce protocol announced by Google and Shopify at NRF 2026. Launch partners include Target, Walmart, Wayfair, Etsy, Mastercard, Visa, Stripe, and American Express, with 20+ additional partners endorsing the standard. UCP runs on OAuth 2.0 and PCI-DSS and includes MCP and A2A bindings. It exists so AI agents can complete purchases on behalf of shoppers using a standardized handshake instead of brittle scraping.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between MCP, A2A, and UCP?
&lt;/h3&gt;

&lt;p&gt;MCP connects agents to tools and data. A2A connects agents to other agents. UCP connects agents to commerce checkout. Different layers of the same stack, and most websites only need one or two of them.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does “protocol-ready” mean for a website?
&lt;/h3&gt;

&lt;p&gt;Protocol-ready means an AI agent can discover, query, and (where it makes sense) transact with the site through a standardized interface, instead of scraping HTML and guessing at structure. Concretely: structured-data coverage in place, recent cornerstone content, the right /.well-known/ manifest published, and (if internal systems are involved) an MCP server with auth and rate limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is this the same as GEO or AEO?
&lt;/h3&gt;

&lt;p&gt;Adjacent, not identical. GEO (Generative Engine Optimization) and AEO (Answer Engine Optimization) are about optimizing content to be cited by AI engines. Protocol readiness is the infrastructure layer underneath that. The standardized interfaces agents use to discover, query, and transact with a site. The five-point readiness check covers both, because the questions overlap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does my site need all six protocols?
&lt;/h3&gt;

&lt;p&gt;No. For 2026 decisions, three matter (MCP, A2A, UCP), and three are worth tracking but not building for yet (NLWeb, WebMCP, ACP). Most websites only need one or two of the build-for-now three. The five-point readiness check is the way to figure out which.&lt;/p&gt;

&lt;h3&gt;
  
  
  When can I skip this entirely?
&lt;/h3&gt;

&lt;p&gt;Sites with fewer than 50 indexed pages, sites in regulated verticals where agent transactions are not yet legal, and sites whose current content is not producing anything citable in the first place. If the structured-data and content-recency checks both fail, fix those first; the protocol questions can wait.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>webdev</category>
      <category>seo</category>
    </item>
  </channel>
</rss>
