DEV Community: Kento IKEDA

Unit Prices Are Falling, So Why Are the Bills Going Up? Tokenomics for AI Platform Owners

Kento IKEDA — Thu, 25 Jun 2026 21:28:23 +0000

"Model unit prices keep falling, yet our monthly AI bill keeps climbing." If you use AI personally, you can feel the creep of your subscription and metered charges. If you own AI usage inside a company, the gap is even more pronounced.

Overseas, this feeling has started getting a name: Tokenomics. On June 3, 2026, the Linux Foundation announced its intent to launch the Tokenomics Foundation, dedicated to open standards for AI cost management. Google, Microsoft, Oracle, JPMorganChase, and others — both providers and large buyers — are on board.

https://www.linuxfoundation.org/press/linux-foundation-announces-the-intent-to-launch-the-tokenomics-foundation-to-establish-open-standards-for-ai-cost-management

This post isn't an explainer of the word itself. It's an account of what changes for the people who own internal generative AI usage — the platform owners, the FinOps practitioners, the engineering leaders watching the bills — once you have this word in your vocabulary.

What Tokenomics gives you isn't another saving technique. It changes the unit of measurement and the lens through which you read AI cost.

Why Tokenomics, why now

Tokenomics sits in the lineage of cloud FinOps. The FinOps Foundation now classifies Tokenomics as the "AI Value" dimension within FinOps for AI. Where cloud FinOps tracked the variable infrastructure costs (compute, storage, networking) against value, Tokenomics tracks the variable cost of intelligence itself. It's not a replacement; it adds a probabilistic, non-deterministic layer of variable cost on top.

Tokens here means what you see on every API price sheet and usage dashboard — the smallest unit a language model reads and writes, the unit of compute. The word "tokenomics" also exists in the crypto world, but that one is about issuance, distribution, and incentives on a blockchain — tokens as units of ownership. Same word, different economies.

https://www.finops.org/insights/token-economics-the-atomic-unit-of-ai-value/

The term gained traction from spring 2026 onward. Generative AI and agents moved from pilots to production, and tokens became the largest and fastest-growing line item in many technical budgets. Per-token prices fell, but usage volume rose even faster, and bills became harder to read. The Foundation launch is industry's response: a venue to align on a common yardstick for tokens, the way cloud costs were once aligned.

As a follow-on, the annual FinOps X conference will be renamed Tokenomicon starting 2027. The word is settling into its own institutional shape.

From here, four shifts in how a platform owner sees AI cost.

Shift 1: Budget on the trajectory of consumption, not on the unit price

The first thing to change is where you anchor your budget. Stop drawing comfort from "unit prices keep dropping" and start watching the trajectory of total consumption.

Per-million-token prices for general-purpose models fell sharply from 2023 to 2025. Recently they've plateaued, while the top-tier and reasoning models have actually gone up. Yet enterprise spending keeps growing. The reason is demand elasticity: when prices drop, organizations widen modalities (text → images → video), increase agent autonomy, and lengthen reasoning chains. The volume grows faster than the price falls.

The scale shows in numbers companies publish openly. At Google I/O 2026, Google announced monthly processing of 32 quadrillion tokens across its AI products, roughly 7x the 4.8 quadrillion of the previous year. AT&T reported scaling its internal "Ask AT&T" GenAI platform from about 8 billion tokens/day to about 27 billion tokens/day after restructuring orchestration into a multi-agent setup — 3x the volume at about 90% lower cost. The IEA noted that AI-related data center electricity demand grew about 50% in 2025 alone (against overall electricity demand growth of about 3%), and attributed the gap to a surge in AI usage (roughly 3x monthly active users and 5x revenue at major model providers).

What matters: consumption is not linear in user-visible activity. A single query that triggers a RAG pipeline, hits a reasoning model, and makes several tool calls can consume tens to hundreds of times the tokens of a direct prompt to a small model. Agent-to-agent communication is itself a cost. The research community has started calling this overhead "communication tax".

https://openreview.net/forum?id=0iLbiYYIpC

Breaking down where consumption accumulates, one request typically stacks up across five elements:

These multiply rather than add, which is why the total is unreadable from surface-level activity.

For a platform owner, the action is clear: stop projecting budgets from last quarter's actuals and price trendlines. Assume that any expansion of use case will spike consumption, and put the trajectory itself on the dashboard. Unit price is no longer the subject of the budget conversation. Total consumption is.

Shift 2: Treat tokens as an invisible cost category

The next shift is to see tokens as a hidden cost category and start watching it deliberately.

Cloud instances can be resized. Storage can be audited. Tokens lack that tactile feedback. They flow quietly through every agent loop, every retrieval call, every reasoning step, and pile up as a cost no one budgeted. This is the property the Tokenomics discussion keeps pointing at.

What amplifies the invisibility is metered billing hidden inside SaaS subscriptions. What looks like a flat monthly subscription to a developer tool or business app is, in reality, a token meter waiting to spin up. Roll out AI tools, and you can get bills the seat count can't explain. The examples are not hypothetical:

Cursor moved to usage-based pricing in June 2025. With long-context agent usage, effective spend ballooned by orders of magnitude for some users. On July 4, the CEO had to issue a public apology and offer refunds.

https://cursor.com/blog/june-2025-pricing

Kiro launched with a pricing model that charged spec and vibe requests at a 5:1 ratio, immediately drew criticism, and the company officially acknowledged a bug that caused requests to be over-consumed.

https://kiro.dev/blog/important-pricing-updates/

The common pattern: subscription prices no longer signal your budget. The seat fee is a floor. What you actually pay is determined by usage, not seat count.

What a platform owner should do first is finish visibility before reaching for optimization techniques. Build a state where you can break down — by model, by product, by team, by environment — who is consuming how much. Surface the tokens hiding inside SaaS, too. Without that foundation, the optimization conversation has nothing to stand on.

Shift 3: Solve reduction by design, not by discipline

The third shift is in how you think about cost reduction. Reducing tokens isn't a matter of restraint; it's a design problem. And the levers from the supply side have arrived.

1. Model routing. Instead of sending every query to the top-tier model, route to the cheapest model that can still answer. FrugalGPT, an academic approach, tries smaller models first and only escalates when needed — reporting up to 98% cost reduction vs GPT-4. RouteLLM (UC Berkeley) reports up to 85% cost reduction while preserving conversational quality. Amazon Bedrock offers this as a managed service (intelligent prompt routing) with up to 30% reduction officially advertised. Routing is no longer research-only; it's a real option from both research and managed services.

https://arxiv.org/abs/2305.05176

https://arxiv.org/abs/2406.18665

https://aws.amazon.com/bedrock/intelligent-prompt-routing/

2. Tool calls as code. Hand an agent a list of tool definitions and the definitions ride in the context every turn. Cloudflare's "Code Mode" has the agent write code that calls the tools instead. They report compressing the tool definitions of an MCP server exposing 2,500 APIs from about 1.17M tokens to about 1,000 tokens — 99.9% compression. Anthropic independently presented the same pattern as "Code Execution with MCP." This isn't a vendor-specific quirk anymore.

https://blog.cloudflare.com/code-mode-mcp/

https://www.anthropic.com/engineering/code-execution-with-mcp

3. Context compression. In a RAG pipeline, only a small fraction of the retrieved text contributes to the answer; the rest is noise that wastes tokens. If you prune it, you cut the tokens the LLM sees. Zilliz, a vector database vendor, reports 70–80% token reduction by sentence-level relevance filtering that drops weakly related sentences.

https://milvus.io/blog/semantic-highlighting-model-for-rag-context-pruning-and-token-saving.md

4. Data format choice. The serialization format you hand the LLM directly affects token volume. Microsoft's Data Science engineering blog shows that function-calling-based structured output is more token-efficient than free-form JSON for the same result. For tabular data, CSV/TSV or newer LLM-oriented formats like TOON can use 30–60% fewer tokens than JSON. Data format is a functional decision and a cost decision at the same time.

https://medium.com/data-science-at-microsoft/token-efficiency-with-structured-output-from-language-models-be2e51d3d9d5

Lining these up by reported savings and ease of adoption (difficulty is a rough indicator):

Lever	Reported reduction	Adoption difficulty
Data format choice	30–60% vs JSON	Low
Model routing	up to 98% (FrugalGPT), 85% (RouteLLM)	Medium
Context compression	70–80%	Medium
Tool calls as code (Code Mode)	~99.9% on MCP definitions	Medium–High

For a platform owner, the takeaway is the recognition that savings opportunities live in design, not in operations. Most of these can be set as organizational policy — pick a default output format, install routing, decide how tools are exposed. Not "try harder" at the team level, but "decide the standard" at the platform level. Of the four, choosing a default output format is probably the lowest-friction starting point.

Shift 4: Measure by outcome, not by volume

The last shift is in what you measure. Move from raw consumption to cost per outcome.

Counting tokens as if they were uniform misses something real. Tokens spent on a retry due to insufficient quality versus tokens in a first-shot usable response carry the same cost but different value. Tokens an agent burns going in circles look like tokens but don't translate into outcome. LLM inference research has a name for this: goodput — the throughput that meets your SLOs (latency, quality targets). Benchmarks like SemiAnalysis's InferenceX have adopted this view. What an enterprise actually buys isn't raw token volume but the usable-output portion of it.

https://bentoml.com/llm/inference-optimization/llm-inference-metrics

https://inferencex.semianalysis.com/

When you only chase volume, cost judgment goes off. What you should be watching is the fraction of tokens that yielded usable results (the yield after retries and quality misses) and cost per inference / per workflow / per outcome.

What matters most for a platform owner is keeping the balance between volume and value. Using 10x the tokens for 100x the value is economically right. Cutting tokens to a tenth and getting unusable output is not a saving. Conversely, token spend that doesn't translate into value is plain waste: verbose system prompts, oversized contexts, overuse of expensive models, tool design that ships full documents when MCP could extract only what's needed. There's also an organizational failure mode — using token usage itself as a performance metric encourages meaningless AI use just to game the number, as several reports have documented. Cost-per-outcome as the indicator prevents both directions of failure: the cost-cutting order that kills quality, and the value-disconnected consumption that gets ignored.

What the four shifts share

The four shifts look distinct, but they collapse into two underlying moves.

The first is changing what unit you look at. From unit price to consumption trajectory (Shift 1). From token volume to cost per outcome (Shift 4). Both reset the meter.

The second is making it visible, then putting your hands on it. Token spend hides inside SaaS and variable cost, so visibility is the prerequisite (Shift 2). Once visible, design levers — not team effort — drive the reduction (Shift 3).

Changing how you measure without acting changes nothing. Acting without changing how you measure tends to overshoot, killing quality in the name of savings. Each half alone falls short. When both arrive, AI cost shifts from something to watch by intuition to something to operate with grounding.

Pushback worth pre-empting

Four objections are worth addressing up front.

Isn't this just FinOps for AI? Largely yes. The FinOps Foundation itself positions Tokenomics within FinOps for AI, specifically in the "AI Value" topic. Tokenomics is not a new methodology; it's a chunk of FinOps for AI with its own name. That said, getting a proper name and an institutional vessel does something on its own. It doesn't mean cross-team discussion and cross-vendor comparison suddenly work — internal vocabulary takes time to spread, and shared data formats need adoption. But laying the foundation for a shared language is itself worth tracking. Think of it less as a new technique and more as infrastructure for agreement starting to form.

https://www.finops.org/topic/ai-value/

Doesn't Tokenomics narrow vision down to just tokens? A real concern. Tokens are the most measurable layer of AI cost. Beneath that sit SaaS-embedded variable costs and operational/governance costs. If you self-host models, you also carry GPU/compute/storage, data transfer, and training costs underneath.

Tokens get the spotlight because they're growing fastest, hiding hardest, and have the most-formed vocabulary. A reasonable starting point — not the whole story. Worth holding that distinction.

We don't use that many tokens. Possibly true. Possibly just invisible. The SaaS-embedded portion shows up as a flat monthly fee or a rolled-up invoice, not as itemized token usage. "Don't use" vs. "don't see" only separates when you visualize. Building visibility while scale is small beats chasing it after the bill explodes.

Unit prices keep falling — why not just wait? Falling prices apply mostly to general-purpose models. Top-tier and reasoning models are a different story. Industry estimates consistently put agent-style workloads at 5–30x the token consumption of the same task in chat form. The lower-tier price drops get swallowed by the upper-tier consumption growth. Waiting works less well as your usage shifts toward the upper tiers.

https://www.bigeye.com/blog/how-to-track-ai-agent-costs-and-token-usage

https://arxiv.org/abs/2604.22750

Where to start

No universal recipe. The first step varies with maturity and with which layer (self-hosted API, SaaS-embedded, self-hosting) your AI usage sits on. Still, a common order exists.

Start with visibility. Before optimization techniques, build the state where you can break down — by user, model, product, environment — who's consuming how much. Without this, every later judgment is a guess. The tagging exercise itself raises questions worth surfacing: prod vs. staging splits, product and team boundaries, cost allocation logic that everyone can stomach. The setup work doubles as an on-ramp for FinOps awareness inside the organization.

Next, audit billing models. For each AI-bearing SaaS and API in use, lay out the floor (the recurring portion) and the variable behavior. Once you suspend the "subscription = fixed cost" assumption, the location of budget risk looks different. Provider-side moves matter too — for example, Anthropic's April 2026 pricing structure change. Decisions about extending the recurring footprint and managing variable-cost blow-up become separate agenda items.

Then set design levers as policy. The default output format, routing, how tools are exposed. Don't leave it to the field; pick the standard from the platform. As Shift 3 noted, the default output format is the lightest place to start exercising platform authority.

Finally, push the metric from volume toward outcome. Watching cost per outcome and token yield keeps the cost-cutting order from killing quality. It also blocks the gaming pattern where token usage as a KPI breeds meaningless AI use, as Shift 4 noted. The metric step comes last, but how you align it determines how well the previous three actually deliver.

Tokenomics isn't a new saving trick. It's an auxiliary line for reading AI cost as an economy — as the relationship between volume and value. With the word settling into shared use overseas, holding the lens early, while owning AI inside your organization, is itself the first step.

Not getting hooked on per-token price moves, but reading the relationship between volume and value — that's the kind of attention platform owners will be asked for going forward.

AWS WAF Brought Back 402 Payment Required for the Agent Economy

Kento IKEDA — Sat, 20 Jun 2026 21:15:57 +0000

AWS WAF added a feature called AI traffic monetization. It lets you charge AI bots that hit your site or API for access to your content. Until now, dealing with AI crawlers meant a binary choice, allow or block. This adds a third path: take payment and let them through.

I won't go deep into the feature itself here, how a content owner sets up monetization. The official announcement covers that.

https://aws.amazon.com/blogs/aws/aws-waf-adds-ai-traffic-monetization-capability-to-help-content-owners-charge-ai-bots-for-content-access/

What I want to look at is the other side. The story of the side defending content is, at the same time, the story of the side reading it, the side running autonomous agents. HTTP status code 402, reserved for years and rarely used, becoming practical means the cost structure of an agent reading the web is starting to change. For the payment protocol itself, I'll read it from the x402 spec.

The "let them read, recoup later" business model that AI broke

To set the stage: AI bot traffic is now too large to ignore. According to the AWS blog, for many content providers AI bots make up more than half of web traffic, and AI-specific crawlers have grown more than threefold year over year.

Traditional search crawlers built an index and sent readers back from the search results. AI bots, by contrast, use the same content to generate summaries and answers, and return almost no readers to the original site. The provider pays the infrastructure cost of serving the traffic while getting neither page views nor ad impressions. This asymmetry is where the idea of charging starts.

That is the view from the side defending content.

What Monetize returns with 402, and two constraints

First, the mechanics, just the essentials.

Monetization runs as part of AWS WAF's Bot Control. For a request judged to be a bot, you assign one of six actions (Monetize / Allow / Block / Count / CAPTCHA / Challenge); assign Monetize, and a request without payment receives an HTTP 402 Payment Required. Along with the 402 comes a manifest, written in a machine-parsable form, stating the price per request (in USDC), the accepted payment networks (Base and Solana), the destination address, and more. That format is x402, an open protocol Coinbase proposed for program-to-program payments.

https://x402.org

Two constraints are worth noting. First, the Monetize action is only available on web ACLs associated with Amazon CloudFront. Second, payment verification and on-chain settlement happen synchronously, inside the request path. The official docs state this explicitly as "synchronously," and the content is returned only after settlement is confirmed. That latency falls only on requests that involve payment; requests without payment are unaffected.

https://docs.aws.amazon.com/waf/latest/developerguide/waf-ai-traffic-monetization-how-it-works.html

Reading x402's payment internals from the source

x402 was proposed by Coinbase and is now developed as an open standard, with the repository moved to the x402 Foundation. Both the spec and the implementation are public on GitHub, so you can read directly what the AWS docs alone don't show. The official docs take you as far as the overall flow; beyond that, only the spec has it. So let's read the source and trace, in order, what goes into the 402, how it is signed, and where it is sent.

https://github.com/x402-foundation/x402

The 402 body is a manifest of payment terms

When a 402 Payment Required comes back, in x402 v2 the payment terms ride in the PAYMENT-REQUIRED header, Base64-encoded. Decode it and you get the object the spec calls PaymentRequired. It holds an array of payment methods the provider accepts (accepts), and each element declares which scheme, on which network, which currency, how much, and to which destination you pay to get through.

Here is what one looks like. This is a single element of the accepts array, taken from the v2 spec's example.

{
  "scheme": "exact",
  "network": "eip155:84532",
  "amount": "10000",
  "asset": "0x036CbD53842c5426634e7929541eC2318f3dCF7e",
  "payTo": "0x209693Bc6afc0C5328bA36FaF03C514EF312287C",
  "maxTimeoutSeconds": 60,
  "extra": { "name": "USDC", "version": "2" }
}

The fields mean:

scheme (the payment method, e.g. exact)
network (a chain ID such as eip155:84532)
amount (the amount, as an integer in the smallest unit; the 10000 here is 0.01 USD in USDC)
asset (the currency's contract address)
payTo (the receiving address)
maxTimeoutSeconds (the payment's expiry)

The paying side picks one set of terms it can pay from this array and builds the payment. Multiple networks or currencies may be listed, and which to choose is left to the payer.

The payment is a signed authorization, not the transfer itself

What the payer creates is a PaymentPayload object, made of a signature and an authorization. The authorization holds: from whom (from), to whom (to), how much (value), the validity window (validAfter / validBefore), and a single-use value to prevent reuse (nonce).

{
  "signature": "0x2d6a7588...148b571c",
  "authorization": {
    "from": "0x857b06519E91e3A54538791bDbb0E22373e36b66",
    "to": "0x209693Bc6afc0C5328bA36FaF03C514EF312287C",
    "value": "10000",
    "validAfter": "1740672089",
    "validBefore": "1740672154",
    "nonce": "0xf3746613...62f13480"
  }
}

The values are all from the v2 spec's example. In response to the earlier accepts (the payment request), the payer signs and returns this. It is not the execution of a transfer; it is a signed permission to pull this amount under these terms. It uses EIP-3009's transferWithAuthorization, and the fact that the amount, the expiry, and the single-use nonce are bound by the signature is what matters for the budgeting design later.

https://eips.ethereum.org/EIPS/eip-3009

The header names changed between v1 and v2

x402 has a v1 and a v2, and how the payment-exchange headers work changed in v2. It is a difference worth knowing if you touch the implementation.

Purpose	v1	v2
402 payment request	response body	PAYMENT-REQUIRED
client's payment	X-PAYMENT	PAYMENT-SIGNATURE
settlement result	X-PAYMENT-RESPONSE	PAYMENT-RESPONSE

In v1, only the payment request came back in the response body, while the payment and the result were headers. v2 organizes all three into headers, and the spec states that x402 carries all of its protocol information in headers, leaving the response body to the server implementation. The reason AWS's docs mention a payment-signature header is that it follows v2.

The facilitator handles verification and settlement but holds no funds

So that the provider does not have to implement on-chain verification and transfer itself, an intermediary service called a facilitator exists. It has two roles: verify (check that the signed payment meets the terms) and settle (send it to the blockchain and wait for confirmation). What the spec states explicitly is that the facilitator holds no funds, that is, it is not a custodian. It only receives the signed authorization and carries out verification and execution on the provider's behalf. Facilitators are already running in production on several networks including Solana, Polygon, and Base.

https://www.x402.org/ecosystem

Tracing the flow briefly from the payer's side:

Send the request
A 402 with PAYMENT-REQUIRED comes back
Pick one set of terms from accepts
Resend with the signed authorization in PAYMENT-SIGNATURE
The provider verifies via the facilitator's verify, fetches the content, confirms with settle, and returns it with PAYMENT-RESPONSE

What AWS WAF handles is the provider side of this flow.

For that settle, AWS WAF uses the x402 facilitator provided by Coinbase Developer Platform. Note that if the origin returns 4xx or 5xx, no settlement runs and the payer is not charged.

The agent runtime becomes a payment actor

A 402 coming back presupposes an agent that receives it, pays, and goes back for the content.

The flow just traced can be run almost autonomously by the paying side's agent runtime. The runtime handles receiving the 402 and resending with the signed authorization in PAYMENT-SIGNATURE, and the provider-side verification through settlement completes synchronously, inside the request path.

This is a new design concern for whoever runs the agent. Until now, the cost of an autonomous agent fetching the web was mostly just its own token usage and bandwidth. From here on, if the fetched destination is monetized, the content itself may carry a cost. The agent's runtime becomes a payment actor at the same time as being an HTTP client. That is the structural change.

If you run your own agent, this change brings in two new design decisions: whether to treat a 402 from a fetch destination as a failure or pay and get through, and if paying, with which wallet and up to how much automatically. The same kind of decision that choosing an auth path became now arises for the payment path.

You might feel, "I am not crawling at scale, so this does not concern me." But this is a matter of premise, not scale. The moment even one destination returns a 402, the agent's design grows a branch for how to handle 402. As more providers monetize, a 402 mixing into the reasons a fetch fails becomes unavoidable. Better to hold the assumption ahead of time than to scramble later over whether to give up, pay, or route around.

Three things the payer's harness needs to decide

Once you understand the spec, the payer's design points get concrete. There are roughly three.

First, where to put the branch after receiving a 402. When the agent's HTTP client receives a 402, does it return it as a failure up the stack, or enter the payment path on the spot? If you pay automatically right inside the fetch, the agent does not have to be aware of the 402, but where and how much you paid becomes hard to see. Conversely, if you bring it up to the planning layer where the agent decides its next action, you can make the payment decision in the context of the task, but the branches scatter across the call paths. The same decision as which layer attaches the auth token now stands up for payments too.

Second, how to cap the budget. This is where x402's signature design pays off. The authorization contains value (the amount), validBefore (the expiry), and nonce (single-use), fixed by the signature. So for each payment, you can decide the cap and the expiry at the moment of signing. Give the harness rules like "up to this much per request," "up to this cumulative amount per task," and "invalid past this time," then fold them into generating the authorization, and you prevent a runaway agent from paying without limit, by design. Being able to embed the cap in the signed authorization itself, rather than hard-coding it somewhere, is the strength of this design.

Third, verification before going to production. You cannot run real currency through payment-involved code from the start. x402 has test networks (the eip155:84532 from the earlier example is a testnet called Base Sepolia), and you can run the whole flow, a 402 returning, an authorization being created, and verification happening, without moving real assets. Whether you run through this once, as a stage before putting the payment feature into your implementation, changes how calm the first production rollout feels.

At the design stage, you consider these three: which layer receives the 402, whose wallet and how much to cap, and whether to run it through a testnet. No implementation yet. Just recognizing the premise has changed, and deciding where the judgment lives, means you will not scramble as more destinations to pay appear.

The gap robots.txt couldn't close

One more thing worth seeing from the payer's side is the relationship to robots.txt. robots.txt is only a gentleman's agreement; it had no force to stop crawlers that do not comply. There was no middle ground between allow and block, and blocking shuts out even the welcome bots that bring readers in via AI search. A 402, a mechanism that does have force, is moving to fill this gap.

Put from the payer's side: part of the web where "if you went to read it, you could read it" may turn into a web where "if you go to read it, you are asked for payment." According to the AWS blog, AWS WAF classifies over 650 types of AI bots and agents, splitting them into verified, confirmable by cryptographic signature or published IP range, and unverified. The provider can serve them differently, letting verified search crawlers through for free while setting a high unit price for unverified ones. In other words, if your own agent is classified as unverified, it may end up on the side asked for a relatively high price.

Where stablecoin settlement stands today

Next, the payment method. It is a fact that payment today is centered on stablecoins like USDC. The receiving side specifies a wallet address on a network such as Base or Solana, and AWS neither intermediates the payment nor takes a fee. Converting the received stablecoins into fiat and moving them to a bank account is, officially, something you manage yourself or leave to your wallet provider. What AWS WAF handles is up to receiving the payment; the conversion and bookkeeping beyond that are on you. It is not yet at a stage where it drops straight into an existing accounting flow.

Still, what matters for practice in the near term is a different point: before going to production, you can try the payment flow in test mode without running real currency. As a stage before putting a payment-involved feature into your implementation, this is a welcome design.

The industry's move toward monetization

A similar mechanism, Cloudflare's Pay Per Crawl, came first. With AWS following, a view emerges that payment mechanisms are splitting per cloud, with each vendor using them for its own lock-in.

https://developers.cloudflare.com/ai-crawl-control/features/pay-per-crawl/what-is-pay-per-crawl/

But the fact that two companies moved in the same direction, toward program-to-program payment using 402, looks less like each vendor's individual play and more like a sign the industry is starting to lean toward web reading becoming monetizable. x402 as a protocol is not closed to any one cloud. If so, the question moves from "which cloud to use" to "which agent runtime supports this payment flow first." Seen as an AWS WAF feature it is one company's story, but stand on the side that has agents read the web and pay, and it looks like part of a wider industry shift.

Reading the source further, you find x402's scope is wider than web reading. x402's transports are defined not only for HTTP but also for MCP (Model Context Protocol) and A2A (Agent-to-Agent Protocol). In the MCP transport, when an agent calls a paid tool, the result comes back as an error first, and resending with payment executes it. In the A2A transport, one agent pays for another agent's service. In the spec's words, agents monetize their own services through on-chain cryptocurrency payments.

So AWS WAF's AI charging is just x402 implemented on one face: reading content over HTTP. Beneath it spreads the idea of a payment layer where, whether agents are calling tools or trading with each other, they pay with the same 402 and the same signature. The reason 402 was brought back is less to monetize the web and more to give agents a common payment endpoint for acting as economic actors, and reading it that way fits better.

Where to start

If you hold content, the first step is not setting up charging but looking, in the AI traffic analytics dashboard, at the proportion of AI bots coming to your site. If you have a site served via CloudFront, make the current state visible first and gather the material to judge which of block, charge, or allow fits. You do not need to enable Monetize right away.

If you run agents, you will want to check whether, among the destinations your agent currently fetches, there is content likely to start returning 402. Then leave even a single line in your harness's design notes on the policy: whether to pay automatically or not, and if paying, where to set the cap. No implementation yet. Just recognizing that the premise has changed will change your later decisions.

402, a status code that lay dormant for a long time, has risen as the payment layer of the agent economy. You can read it as a story about defending content, but the same event is, for the side that has agents read the web, also a story of one more cost premise added. Which side you stand on changes the view, but being at the entrance where reading the web may no longer be free looks the same from either side.

What is AWS Blocks? How it differs from Amplify and App Studio, and what each one is aiming for

Kento IKEDA — Fri, 19 Jun 2026 22:12:14 +0000

On June 16, 2026, AWS announced AWS Blocks as a public preview. It is an open-source framework where the TypeScript you write for your backend becomes the AWS infrastructure that runs it.

https://github.com/aws-devtools-labs/aws-blocks

The first thing many AWS users will think is "another full-stack tool." If you want a tool for frontend developers to build full-stack apps in TypeScript, Amplify Gen2 already exists. For people who don't write code, there is App Studio. I compared those three earlier in a write-up on App Studio, Amplify Gen1, and Amplify Gen2. Now Blocks joins them.

This article organizes what AWS Blocks is from the official docs and the open-source code, lines it up against Amplify and App Studio, and finally sketches a map of what each one is aiming for. Rather than stopping at a feature-diff table, I want to get at why AWS is offering multiple entry points into full-stack development.

A note: this is based on the official docs right after the preview announcement and on reading the open-source code. It is not based on long-term production use of Blocks. Read it with the understanding that implementation details may change.

What is AWS Blocks

Here is how the official docs define it. AWS Blocks is a backend toolkit for full-stack applications, where each Block is a self-contained backend capability that bundles the application code, a local development setup, and the infrastructure to run it. Pick the Blocks you need and compose them, and the infrastructure following AWS best practices is defined automatically.

Type safety doesn't stop inside the backend. Types flow from the backend all the way to the client, reaching web frameworks (Next.js, Nuxt, Astro, React, Vue, Svelte, Angular) and native targets (Swift, Kotlin, Dart/Flutter). From a single backend, you can generate typed client code for both web and mobile.

https://github.com/aws-devtools-labs/aws-blocks

That said, at the time of the preview announcement the frontends officially listed are SPAs (Vite + React) and SSR frameworks (Next.js, Nuxt, Astro), with support expected to widen over time. Blocks itself adds no extra charge; you pay only for the AWS services you use, and you can deploy to all commercial AWS Regions.

https://aws.amazon.com/about-aws/whats-new/2026/06/aws-blocks-preview/

From here, let's walk through the concepts you can't skip to understand Blocks.

Block = an npm package bundling infrastructure, runtime, and local implementation

One Block is one npm package, holding the cloud resources, runtime code, and local implementation for a single capability. Instantiate one KVStore, for example, and you get all at once: a DynamoDB table auto-provisioned at deploy time, runtime code that runs on Lambda, and an in-memory implementation for local development.

https://docs.aws.amazon.com/blocks/latest/devguide/concepts.html

The available Blocks cover most backend needs: data (KVStore, DistributedTable, Database, DistributedDatabase, FileBucket), auth (AuthBasic, AuthCognito, AuthOIDC), async work (AsyncJob, CronJob), AI (Agent, KnowledgeBase), communication (Realtime, EmailClient), configuration (AppSetting), observability (Logger, Metrics, Tracer, Dashboard), and hosting (Hosting).

https://docs.aws.amazon.com/blocks/latest/devguide/what-is-blocks.html

What the official overview page doesn't give you, though, is a list of which AWS service each Block actually becomes inside. I was curious, so I read the open-source code (aws-devtools-labs/aws-blocks) to confirm the real services behind the main Blocks.

Block	AWS service inside
`KVStore` / `DistributedTable`	DynamoDB
`Database`	Aurora Serverless v2 (PostgreSQL 16.4)
`DistributedDatabase`	Aurora DSQL
`FileBucket`	S3
`AuthBasic`	no dedicated infrastructure
`AuthCognito` / `AuthOIDC`	Cognito
`AsyncJob`	SQS + Lambda
`CronJob`	EventBridge Scheduler
`Agent`	Strands Agents SDK + Bedrock
`KnowledgeBase`	Bedrock + S3 Vectors
`Realtime`	API Gateway v2 (WebSocket)
`EmailClient`	SES
`AppSetting`	SSM Parameter Store
`Logger` / `Metrics` / `Tracer` / `Dashboard`	CloudWatch
`Hosting`	CloudFront + S3 + WAF + Route 53 + ACM

A few findings the overview alone doesn't reveal:

DistributedDatabase is Aurora DSQL, so DSQL's own constraints surface directly in the development experience. You can't mix DDL and DML in the same transaction, and only one DDL statement is allowed per transaction. Blocks rejects these on the client side with validation.

KnowledgeBase uses S3 Vectors, which arrived in 2025, as the vector store for RAG.

Agent sits on top of Strands Agents SDK, AWS's open-source agent framework.

Hosting is less "CloudFront + S3" and more a heavier part that also bundles WAF, Route 53, and ACM.

IFC layer = the entry point where code becomes infrastructure

The backend entry point of Blocks lives in a single file, aws-blocks/index.ts. Instantiate Blocks and define your API there, and the infrastructure is derived directly from that code. No separate file for infrastructure definitions.

This idea of deriving infrastructure from code is called Infrastructure from Code (IFC), and in the source this backend part was named the IFC subpackage.

In other words, the code that describes infrastructure and the code that describes the app aren't separated. Infrastructure grows out of the app's code. This is the heart of Blocks' philosophy.

Conditional exports = the same import switches implementation by context

The reason a single file can play three roles at once, namely local development, deploy, and production runtime, is Node.js conditional exports, which route the same import statement to a different implementation per context.

Context	Resolved implementation	What happens
Local development	in-memory / filesystem	the app runs on localhost alone
CDK synthesis	CDK constructs	infrastructure is defined for CloudFormation
Lambda runtime	AWS SDK	real services are called in production
TypeScript / IDE	type definitions	completion and type checking work

The same line new KVStore(scope, 'todos') becomes a local store in development, a DynamoDB table at deploy time, and an SDK call in production. You never change the code. Without writing any configuration by hand, module resolution picks the implementation per context. Reading the source package.json, they are switched by conditions like cdk, aws-runtime, types, and default (the local mock).

ApiNamespace = type-safe RPC with no code generation

The part that calls the backend from the frontend is handled by ApiNamespace. The frontend imports and calls methods defined in the backend directly.

// Frontend: import the backend API directly
import { api } from '../aws-blocks/index.js';

const result = await api.greet('World');
// TypeScript knows the return type too

No code generation step, no API client initialization, no URL configuration. Change a backend method signature and the frontend gets a compile error instantly. Locally it goes through an HTTP server; in production it reaches Lambda through API Gateway.

The transport underneath is JSON-RPC 2.0. When you call a typed method, it is converted into a JSON-RPC request internally and delivered to the backend. The developer never assembles a payload by hand; the transport stays hidden behind the types.

Local-first = everything runs without an AWS account

Run npm run dev and the whole app starts locally. Blocks resolve to local implementations (an in-memory store, local auth, an embedded DB), running at http://localhost:3000 with hot reload. No AWS account, no internet connection, no cloud billing. Local data is persisted under .bb-data/ at the project root.

The "embedded DB" here turns out to be PGlite (a WebAssembly build of PostgreSQL). Not only Database but also DistributedDatabase, which uses Aurora DSQL, falls back to PGlite locally. It works because DSQL is PostgreSQL-compatible, so a near-real Postgres comes up locally without an AWS account.

When you want to check behavior against real cloud services, npm run sandbox deploys to a fast, disposable environment using hot-swap to Lambda. The mocks are swapped for real AWS services (DynamoDB, Aurora, S3, Lambda, and so on), and once you are done you can tear it all down with npm run sandbox:destroy. To ship to production, you run npm run deploy. The same backend code runs in all three.

Direct CDK access = if Blocks isn't enough, write it yourself

Every Blocks app is a CDK app. You can use arbitrary CDK constructs alongside Blocks, and you can embed Blocks into an existing CDK stack.

When you want to add a resource Blocks doesn't provide (SNS, Step Functions, and the like) or set up a custom domain, you write aws-blocks/index.cdk.ts and access CDK constructs directly. Normally the infrastructure is derived from the backend definition (aws-blocks/index.ts), so you don't need to touch CDK directly. When you do need it, you just write CDK and can build as far as you like, beyond the edges of the framework. By design, it is structurally hard to get trapped and stuck inside the abstraction.

AGENTS.md bundled = agents write correct code from the start

Blocks ships an agent-facing guide inside the npm package. Without adding a plugin, an AI coding agent is said to be steered toward writing correct code from the start.

Its actual form turned out to be an AGENTS.md placed in the project. Reading it, rather than carrying the full guide inline, it is a pointer to references. The detailed explanation, how to choose a Block, and how to use each Block live under node_modules/@aws-blocks/blocks/ in README.md, docs/index.md, and docs/.md, and it tells the agent to read those. Alongside, the Rules forbid anti-patterns: always use a Block for persistence (no local files, in-memory arrays, or local DBs), and don't assemble JSON-RPC payloads by hand.

Rather than pouring everything into the agent's context, you have it read only the docs it needs when it needs them. This pointer approach is sound as a design for agent-facing docs, and it is worth noting that an official AWS framework ships with it.

How it differs from Amplify and App Studio

Let me line up what we have covered against Amplify Gen2 and App Studio.

Amplify Gen2 is a development platform where you describe data models, business logic, and authn/authz in TypeScript and the appropriate cloud resources are auto-provisioned. It is built on CDK internally and offers categories like Data, Auth, Storage, and Functions out of the box. Per-developer cloud sandboxes, shared environments mapped one-to-one to Git branches, and the Amplify Console that bundles hosting and CI/CD are its hallmarks.

https://docs.amplify.aws/react/how-amplify-works/concepts/

App Studio sits on the low-code side, letting non-developers and beginners with little coding experience design and build apps. Where Amplify Gen2 and Blocks target developers, App Studio aims at a fundamentally different audience. I have organized the product details in the write-up I mentioned at the top.

https://zenn.dev/ikenyal/articles/c2c3ccf9fdd0cf

Lined up by aspect:

Aspect	App Studio	Amplify Gen2	AWS Blocks
Primary audience	non-developers	frontend developers	developers who also write backends
How infrastructure is handled	fully hidden	abstracted by category	derived from code (IFC)
Local development	cloud-first	per-developer cloud environment	fully local, no AWS account
Hosting / CI-CD	built in	bundled in the Amplify Console	Hosting is one Block, CI/CD is your own
How types flow	types aren't a concern	schema-driven Data types	type-safe RPC, no generation
Distance from CDK	far	used for extensions when needed	a CDK app from the start, write CDK directly
AI coding agents	out of scope	little explicit support	bundled AGENTS.md as a premise

A shared obsession with iteration speed shows up too. Amplify Gen2 advertises up to 8x faster iteration than Gen1, speeding up cloud-side reflection via hot-swap to per-developer sandboxes. Blocks, on the other hand, completes locally, so in many cases there is no round trip to the cloud at all. The direction of "faster" differs, but the aim of shortening the write-then-check loop is shared.

What each one is aiming for

Rather than feature diffs, let me put into words what the three are trying to achieve.

What App Studio aims for is delivering apps to people who don't write code. Its abstraction is the highest, and it minimizes the developer's involvement the most.

What Amplify aims for is freeing frontend developers from infrastructure. Even though Gen2 was rebuilt on CDK, what developers face day to day are categories like Data, Auth, and Storage, and that managed mass covers up the infrastructure underneath. It takes care of hosting and CI/CD together, freeing developers from stitching individual AWS services by hand. The key is to hide.

What Blocks aims for looks a little different. The entry point of making infrastructure tools unnecessary to learn is similar to Amplify, but its means leans toward making things transparent with code and types rather than hiding them. Infrastructure grows from the app's code, the same code runs both locally and in the cloud, and you can write CDK directly when needed. Rather than ease through concealment, it aims for reassurance through transparency and the absence of a ceiling.

In short, Amplify reaches the shared goal of making frontend developers' lives easier by keeping infrastructure out of mind, while Blocks does it by letting you stay aware of infrastructure without requiring you to. They take two routes to the same place. The former thickens the wall of abstraction; the latter makes it thin and transparent.

And Blocks carries one more premise that is distinctive of 2026. It takes for granted that AI agents write code, and the framework itself carries the correct way to write from the start. This is a design that lowers not only the cost of humans learning but the cost of agents making mistakes, and its starting point looks different from Amplify's design philosophy.

Where is Amplify headed

What follows is interpretation, not fact. I write it on the premise that none of it is certain.

The official docs explicitly call the relationship between Blocks and Amplify complementary. Amplify provides hosting, CI/CD, and a managed backend experience; Blocks focuses on type-safe infrastructure-from-code and local-first development.

Still, the overlap is not small. Amplify Gen2 also defines backends code-first in TypeScript, on top of CDK. Blocks' IFC layer and Amplify Gen2's backend definition are both a "write your backend in TypeScript" experience standing on CDK, so they sit close in philosophy.

The closeness shows on the implementation side too. Blocks' project-creation CLI auto-detects an existing Amplify Gen2 project and integrates with it, and there is even a dedicated amplify template. Adding Blocks to an Amplify Gen2 backend is an entry path assumed from the start. The complementary relationship is not just a line in the docs; it is implemented as tool behavior.

One possibility I read from this: Amplify shifts its center of gravity toward the hosting and managed-experience layer, while Blocks takes the composable-backend-parts and local-development layer. In fact, Hosting in Blocks is treated as one part bundling CloudFront, S3, and even WAF, and the integrated hosting and CI/CD experience remains a strength on Amplify's side.

What matters is that this doesn't necessarily mean Amplify is shrinking. It reads more naturally as AWS deliberately keeping multiple entry points into full-stack development. App Studio for non-developers, Amplify for those who want a managed experience, Blocks for those who want to command infrastructure with code and types. Rather than converging on a single right answer, it looks like a strategy of preparing a different door for each developer's stance.

Closing

AWS Blocks turned out to be not a replacement for Amplify but one more entry point into full-stack development. Amplify, which makes things easy by hiding; Blocks, which makes things transparent and hands you the controls. Which you choose is also a statement of how you want to relate to infrastructure.

AWS is not converging on a single answer. When the doors multiply, the question isn't which tool is better, but which kind of developer you want to be.

S3 annotations and the question of where object metadata should live

Kento IKEDA — Thu, 18 Jun 2026 23:59:16 +0000

On June 17, 2026, AWS Summit New York ran a long line of announcements that filled in the infrastructure for running agents in production. The official blog has the full roundup.

https://aws.amazon.com/blogs/aws/top-announcements-of-the-aws-summit-in-new-york-2026/

One of them reads as plain from the headline alone: S3 annotations. You can attach up to 1 GB of information directly to an object, so the first impression is "tags just got bigger."

Reading it as a capacity story misses the point. This changes a decision: where the information attached to an S3 object should live. For a long time, the answer was usually "outside the object." You kept it in an external database or a sidecar file and reconciled the two with a sync process. S3 annotations add another option to that answer: don't move it out. If you have kept data in S3 while managing its metadata off to the side, you know the cost of keeping the two in sync.

The official announcement is here:
https://aws.amazon.com/blogs/aws/amazon-s3-annotations-attach-rich-queryable-context-directly-to-your-objects/

The difference is character, not capacity

Line up annotations against the existing ways to describe an object, and what matters is the difference in character, not the numbers.

Mechanism	How much	Mutable	Character
System-defined metadata	Fixed fields	No	Intrinsic properties of the object
User-defined metadata	2 KB total, set at upload	Effectively no	Small incidental notes
Object tags	Up to 10	Yes	Labels for access control and lifecycle
annotations	Up to 1,000, 1 GB total	Yes, without rewriting	Structured knowledge that grows over time

The counts and sizes come from the official blog. The clear gap is in the bottom two rows. Tags are key-and-value labels, meant for access control and cost allocation. annotations carry structure in JSON or YAML, and you can rewrite them as often as you like without rewriting the object. They travel with the object on copy and replication, and they are removed when the object is deleted.

A different character means a different job. Tags describe how to handle an object. annotations hold what the object is and what is known about it, the kind of knowledge that accumulates after the fact: an AI-generated summary, an inference confidence score, processing history. That information does not fit in 2 KB, and you want to update it as the data changes. Until now, meeting that requirement meant moving the context outside the object.

What AWS says, and one step past it

The official blog describes annotations as removing the need for a separate metadata system. It is true that the pattern of double-writing to DynamoDB for cross-object search, syncing with Lambda, and watching for drift can be retired for some use cases. The annotations you attach flow through S3 Metadata into Apache Iceberg tables and become queryable from Amazon Athena.

But stopping at "you no longer need an external database" only repeats what AWS already said. The part worth pressing on is what else moves along with the location. When the context lived outside the object, who could touch it was decided directly by the access control on that external database. Once it sits on the object, you have to redesign who is allowed to change which context. Adding and reading annotations requires the IAM actions s3:PutObjectAnnotation and s3:GetObjectAnnotation. Coupling context tightly to data is the benefit. It also means tampering with the context turns directly into a misreading of the data.

The other thing that moves is responsibility for structure. Key-and-value tags left almost no room for design, but being able to hold structure means you have to decide which key represents what and at what granularity to split things. If an agent is meant to read it, that decision drives search quality. Dump everything in, and you end up with annotations that are useless in a later cross-object query. The freedom of capacity arrives bundled with the responsibility of design.

How much of your sync layer can you actually retire

"Then move everything to annotations" does not follow. There is a line on what you can move over.

In a verification by Classmethod, the Annotation Table took about 25 minutes to become active even in a small environment. That is a third-party measurement, but it lines up with the spec: reflection into the table is asynchronous. What you attach is not reflected into cross-object search the instant you write it.
https://dev.classmethod.jp/articles/s3-annotations-crud-athena-search/

The plain conclusion is about fit with low-latency reads. Information you want to pull in milliseconds on every screen render, or that needs a secondary-index lookup, still fits the older DynamoDB design better. Context that gets updated later but does not need immediacy, such as compliance status, history, or summaries, leans toward annotations. What you retire is part of the sync layer, not all of it.

Cost does not move in one direction either. You can let go of the sync layer that watches for drift, but storing and reading annotations is billed at the same rates as S3 Standard storage and requests, and the S3 Metadata and Annotation Tables behind cross-object search carry their own processing and storage charges. You weigh the cost you remove against the cost you add. The pricing page has the breakdown.
https://aws.amazon.com/s3/pricing/

Region coverage differs too. Per the official blog, attaching an annotation works in nearly all Regions, while the Annotation Table used for cross-object search is limited to Regions where S3 Metadata is available. S3 Metadata coverage has been expanding in waves.
https://aws.amazon.com/about-aws/whats-new/2025/11/amazon-s3-metadata-expands-22-regions/

Coverage and per-feature availability change, so the documentation is the reliable place to check the current state.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/annotations-overview.html

Is an agent a precondition

annotations are designed for agents, and the official explanation is written with that in mind. It is tempting to read "I don't run AI agents, so this isn't for me" and skip the rest, but hold off for a moment.

Agents or not, if you operate sidecar files or an external metadata database today, the shift in location is where the benefit shows up. What annotations really are is an improvement to metadata management itself: structured context attached close to the object, queryable across objects. Natural-language search can be added later through the S3 Tables MCP server, so retiring the sync layer first is a fine way in. The agent is a possible first reader of the context you place, not a precondition.

From object-level to organization-level context

Widen the view, and annotations look less like a standalone feature and more like part of a movement. At the same Summit, AWS previewed AWS Context, which maps the data relationships across an organization into a knowledge graph. Its availability is stated as forthcoming.
https://aws.amazon.com/blogs/machine-learning/context-intelligence-for-your-data-and-ai-agents-at-scale/

If S3 annotations are context at the object level, AWS Context is context at the organization level. Both are designed to surface in Apache Iceberg format in S3, so they read as continuous. The movement is from each team holding its own context for RAG toward a shared context layer with managed access for the whole organization. annotations cover the lowest layer of that, the part where context is attached close to the object.

Where to start

The first step is an inventory of the context you currently hold in external databases or sidecars. Pull out the context described above, the kind that gets updated later but does not need immediacy, and make it a candidate for annotations. For schema, deciding a handful of keys you want an agent to read on a single bucket is enough to begin. As long as you hold the premise that responsibility for structure is now yours, you lower the odds of getting stuck in a later cross-object query.

For a long time, keeping context outside the object was simply the premise. There is meaning in being able to question that premise at all. This is not a story about more capacity. It is a story about who is responsible for the context, and that answer has come back to the side of the object.

Your Claude automation starts metering today (June 15). A quick checklist to avoid surprise charges

Kento IKEDA — Sun, 14 Jun 2026 23:00:00 +0000

Update (June 16, 2026): After this article was published, Anthropic emailed eligible users to say it is not making this change, which had been set for June 15. For now, Agent SDK and claude -p usage continues to work with your subscription exactly as before, and there is no credit to claim. Subscription usage limits are unchanged. Anthropic says it will give advance notice before any future change takes effect.

So the "conversation vs automation" split described below is postponed for now. Whether the direction itself is withdrawn, or will return on a revised plan and timeline, is not clear from the current notice. The body below remains as an explanation of what was originally announced. Note that the official help page had not been updated at the time of writing and still shows the June 15 date.

A change to Claude's paid plans takes effect today, June 15, 2026. Your monthly fee is not going up. What changes is that usage which used to come out of a single pool splits into two: the part where a person is in a conversation, and the part where a program runs on its own. The details are in Anthropic's help center, and every number and scope in this post comes from that page.

Starting June 15, 2026, Claude Agent SDK and claude -p usage no longer counts toward your Claude plan's usage limits. Your subscription usage limits stay the same and stay reserved for interactive use of Claude Code, Claude Cowork, and Claude.

https://support.claude.com/en/articles/15036540-use-the-claude-agent-sdk-with-your-claude-plan

The side that gets affected is the second one. If you have been running automation inside your subscription plan, part of that usage leaves the flat-rate pool today and starts metering. To be precise, what you spend beyond a newly granted monthly credit either bills at standard API rates or stops. If you only chat through the app or the terminal, nothing changes for you.

The first thing to get right is which of your own usage is in scope. Get this wrong and you risk either stopping automation you wanted to keep running, or triggering charges you did not expect. This post gives you a checklist to avoid surprise charges as a same-day fix, then connects it to the larger question of how to design your automation from here.

What is changing

There is one line to draw: is a person typing and waiting for a response, or is a program calling Claude on their behalf? The former keeps using your subscription usage limits exactly as before. The latter is what the new monthly credit covers.

Here is the official breakdown.

How you use it	What happens from June 15
Claude conversations on web, desktop, or mobile apps	Stays on subscription. No change
Interactive Claude Code in the terminal or IDE	Stays on subscription. No change
Claude Cowork	Stays on subscription. No change
`claude -p` (non-interactive mode)	Moves to the monthly credit
Claude Agent SDK (Python or TypeScript)	Moves to the monthly credit
Claude Code GitHub Actions integration	Moves to the monthly credit
Third-party apps that authenticate through the Agent SDK	Moves to the monthly credit

If you open the app and chat, or type into Claude Code in the terminal and wait for replies, June 15 looks the same as before. Those keep drawing from your subscription usage limits and never touch the new credit. If that is you, there is nothing to do today.

The dividing line is whether you operate Claude by hand or have code operate it for you. Running claude -p from a shell script or cron, having a homegrown app built on the Agent SDK, wiring Claude Code into a GitHub Actions CI step. If any of these sound familiar, the checklist below is for you.

Why it split into two

Before the checklist, the background helps you make the call. Note that Anthropic has not stated the reason outright; what follows is my reading of what the announcement implies.

The flat-rate subscription pool let programmatic use, which should run on metered API pricing, run far more cheaply than it otherwise would. Unlike a conversation, automation is not bound by human typing speed. A script or an agent fires requests back to back without pausing, so token consumption climbs even within the same monthly fee. The flat pool absorbed that non-stop consumption. Run the same work through metered API pricing instead, and the gap widens the longer it runs.

In other words, the flat-rate plan was effectively subsidizing the cost of automation. Conversation has a natural ceiling because the human pauses; automation has no such brake. Keep both in the same pool, and the heaviest automation users get the biggest discount, while the gap between cost and price keeps growing.

Read the split as ending that subsidy and it makes sense. The conversational part stays inside the flat rate; the part where code runs on its own moves toward metering that tracks real cost. Conversation and automation went into separate wallets because their cost structures were different to begin with. It is less a price hike than a separation of things that had been mixed together.

This reading carries into the decisions later in the post. Automation is no longer an extra bundled into a flat rate; it has become something you design with cost in mind.

A checklist to avoid surprise charges

To sort out whether you are in scope and to avoid surprise charges or unexpected stops, here is what to check today. Go through it in order and you will know which side you are on and what to set.

First, check whether your day-to-day use is interactive or non-interactive. Chatting on the app or web, typing into Claude Code in the terminal or IDE and waiting for a reply, Claude Cowork. If that is all, you are out of scope and the rest is unnecessary. Most people stop here.

Second, find every place that runs on its own. Are you calling claude -p from cron or a shell script? Is there a homegrown script or app (Python or TypeScript) built on the Agent SDK? Is Claude Code wired into CI, GitHub Actions in particular? All of these are in scope. The ones running quietly in the background are the easiest to miss, so it is worth opening your job definitions, crontab, and repository workflows once to check.

Third, check the authentication path of any third-party tools. If an editor extension or an external agent tool is linked to your Claude account and authenticates through the Agent SDK on your subscription, it is in scope. Even if you never wrote claude -p yourself, a tool making non-interactive calls behind the scenes counts.

Fourth, check whether those authenticate via subscription or an API key. Automation that already authenticates with a Claude Platform API key is out of scope for this change. Pay-as-you-go billing continues, unaffected by the monthly credit. This is the final fork for whether you are in scope.

If you only use it interactively, there is nothing to do today. Only if you fall into automation, third-party tools, or the Agent SDK, and you are running on subscription authentication, do you move on to the next step.

If you are affected, what to do

What you do divides into two stages on a time axis: the same-day fix you finish today, and the deeper rework you take your time over. The first keeps automation from stopping, but it is a stopgap; the real solution is in the second.

Quick fix: claim the credit and decide how it stops

Eligible plans get a monthly credit. Officially it is called the "Agent SDK credit," but it is not limited to people writing the SDK directly. The claude -p calls in cron or CI, the GitHub Actions integration, and the third-party tools you checked above all draw from this same credit. Even if you never touch the SDK, if something runs Claude non-interactively, you are in scope. The credit equals your plan's monthly fee.

Plan	Monthly credit
Pro	$20
Max 5x	$100
Max 20x	$200
Team (Standard)	$20
Team (Premium)	$100
Enterprise (usage-based)	$20
Enterprise (seat-based Premium)	$200
Enterprise (seat-based Standard)	$0

On seat-based Enterprise plans, Premium seats are eligible but Standard seats are not. You claim the credit once through your account (a one-time opt-in), and it refreshes automatically each billing cycle after that. From June 15, eligible users are said to receive an email with claim instructions.

Here the path forks. If you plan to keep using automation, the first move is to not miss that email and claim the credit. If instead you would rather avoid being charged for automation, or you are fine letting it stop, there is no rush to claim. It starts with deciding whether you will keep running automation at all.

Next, decide what happens once the credit runs out. Agent SDK usage draws from this credit first, and what happens beyond it depends on a setting. With "Extra Usage" enabled, the overage bills at standard API rates and automation keeps running. Without it, requests stop the moment the credit is exhausted and stay stopped until it refreshes.

For a production workflow you cannot afford to stop, enabling extra usage to keep it running is the safer choice, though it leaves room for an unexpectedly large bill. For experimental automation that can stop without hurting, you might deliberately leave it off and let the credit act as a spending ceiling that is never crossed. You weigh "how bad is it if it stops" against "how scared am I of surprise charges."

While you are at it, note the credit's constraints. It is granted per user and cannot be pooled or shared across a team. It resets monthly and unused credit does not roll over. If your team runs automation, the idea of "everyone drawing from one person's credit" does not work, and that is worth keeping in mind.

Deeper rework: choose the authentication path as a design decision

The quick fix was about getting by within the credit you are given. The deeper rework is reconsidering, from the ground up, which authentication path your automation rides on.

Anthropic positions the monthly credit as "sized for individual experimentation and automation." Read the other way, shared production automation is not expected to fit within the credit. In fact, Anthropic points teams running production automation to a Claude Platform API key. API-key usage is out of scope for this change: pay-as-you-go continues and no monthly credit is granted.

This is where the earlier "automation is no longer an extra" reading pays off. When you build automation, which path you put it on, subscription auth or API-key auth, has become a design decision that shapes both cost structure and behavior at the limit.

Auth path	How cost reads	When you hit the limit
Subscription auth (Agent SDK credit)	Monthly credit equal to your plan fee, resets monthly	Stops if extra usage is off, continues at standard API rates if on
API-key auth (Claude Platform)	Consumed pay-as-you-go from prepaid credit balance	Stops when the balance runs out, can continue via auto-reload

For light experiments or personal helper automation, the subscription credit covers it. The way it stops when the credit runs out works as a spending ceiling you never cross. For production automation that hurts when it stops, or workflows shared across a team, API-key auth is easier to handle: consumption shows up directly as billing so it reads cleanly, and you can manage the balance with auto-reload and usage alerts. Both stop when the balance runs out, but subscription auth is a flat pool tied to your plan fee, while the API key is a balance you fund yourself. To run something steadily in production, the API key, whose ceiling is not pinned to your plan fee, gives more operational freedom.

Objections worth raising

A few objections come to mind against the picture above. Let me raise them and answer each.

Objection 1: Just put everything on an API key

So why not put everything on an API key and stop worrying about two wallets?

There is something to this. Unify on an API key and conversation and automation both run on the same pay-as-you-go billing, with no need to watch a subscription credit and an API balance separately. But for conversation, moving to an API key tends to cost more. The subscription is designed so that, within your usage limits, conversation runs flat-rate and effectively unlimited. Move that to metered API pricing and every single conversational turn that used to be covered by your monthly fee turns into per-token billing. Keep conversation on the subscription's flat rate and split only the heavy automation onto an API key. That is the realistic landing point.

Objection 2: My automation is light, so it fits

My automation is light, so it fits within the credit I am granted, right?

In many cases, yes. If you only run a script now and then, the granted credit is enough. The problem is estimating "it probably fits" without knowing your own consumption. If you keep an agent running constantly or call it often in CI, you can use up the credit faster than expected. Even when the judgment that it fits turns out right, it is safer to confirm the basis once. Take the run frequency you found in the checklist and reconcile it against actual consumption on the billing screen after the change takes effect.

Objection 3: It is a price hike in the end

Whatever you call it, is this not a price hike in the end?

For users who ran heavy automation on the flat-rate pool, it is indeed a real increase in burden, because the part that was cheap moves closer to real cost. But this is closer to correcting a distortion than an unreasonable hike. Until now the flat-rate subscription subsidized the cost of automation, and the heavier the user, the larger the benefit. The split moves things toward how they arguably should be: conversation flat, automation at real cost. Conversation-centric users see no change in usage limits or feel; the ones affected are limited to the layer that had been benefiting most. It is not an across-the-board hike but the removal of a subsidy that had been riding along.

A good moment to revisit your automation design

Once the checking and the fix are done, what remains is a design question. This change is a good moment to revisit how you design your automation.

Even before this, there was a split in practice: API keys for automation built into production systems, subscription-authenticated Claude Code for scripts at hand. But as long as you ran on subscription auth, that choice never showed up in cost, because the flat pool absorbed it. From June 15, the choice of path feeds straight into cost and availability.

The more you build and run your own agents, the bigger the impact. There are three axes to judge by.

Does it hurt if it stops
Do you need consumption to be readable
Is it shared across a team

Against these three, you sort which workflows stay on the subscription credit and which move to an API key. Light experiments that can stop use the credit as a spending ceiling; production or shared workflows you cannot stop go on an API key where consumption reads cleanly. The change is a good occasion to inventory your automation once.

Put plainly, the era of treating automation as something that "just runs, loosely, inside a flat rate" ends here. Until now, running an agent left its cost dissolved into the flat rate, out of sight. From here, cost follows automation everywhere. Since it costs more the more it runs, you will weigh, from the design stage, whether a given automation earns its cost.

The same-day fix is simple. Check whether you are affected, and if so, claim the credit and decide how it stops. That is all. What remains beyond it is the homework of how to watch the cost and value of your automation. The cost of automation, which had been hidden inside the flat rate, has come into the open. That, I think, is the real substance of this change.

Datadog and AWS Shipped Ops Agents on the Same Day. What Are They Fighting Over?

Kento IKEDA — Fri, 12 Jun 2026 00:34:47 +0000

On June 9, 2026 (US time), two big announcements landed on the same day.

At the keynote of Datadog's annual event DASH 2026 in New York, the Bits AI family expanded significantly: Detection, Investigation, Remediation, Infrastructure, Code, Release, Testing, Data Analysis, Chat, Memories, and Evals. Counting by agent, that is more than ten, with over 100 new features announced together. The full picture is laid out in the keynote roundup.

https://www.datadoghq.com/blog/dash-2026-new-feature-roundup-keynote/

The same day, AWS announced FinOps Agent as a public preview. It bundles four data sources, Cost Explorer, Cost Anomaly Detection, Cost Optimization Hub, and Compute Optimizer, and delivers automated cost-anomaly investigation, natural-language cost questions, periodic cost reports, and aggregated optimization opportunities straight into Slack and Jira. The details are in the AWS blog.

https://aws.amazon.com/blogs/aws-cloud-financial-management/aws-finops-agent-is-now-public-preview/

AWS DevOps Agent had already gone GA in March, handling incident response. With FinOps Agent now added, AWS-built standard agents line up across the main operational domains. That said, DevOps Agent also covers multicloud and on-premises environments, so its scope differs from FinOps Agent, which targets AWS cost data.

https://aws.amazon.com/blogs/mt/announcing-general-availability-of-aws-devops-agent/

On the surface, this looks like two separate stories: Datadog the monitoring platform, AWS the cloud provider. But read the two announcements side by side, and you see both reaching for the same territory, Ops, through different entrances. Line up their features and most of them overlap, so a surface spec comparison won't show the difference. This article sorts out the same-day releases by the two companies' positioning, asks what these very similar agent lineups are actually fighting over, and goes as far as the axes for telling them apart and the predictions that follow.

This is written for people working in SRE, FinOps, and Platform Engineering.

Why the "Ops Agent" Category Is Taking Shape Now

FinOps, DevOps, and SRE go by different names, but they share the same structural problem: the round-trip cost between the operator who notices an anomaly and the developer who actually fixes it.

Concretely, this kind of flow happens every day. Take cost management. Cost Anomaly Detection raises an alert about a cost anomaly. Someone opens a dashboard and looks at the per-service breakdown. They isolate whether the spike came from EC2 or RDS, open CloudTrail, and cross-check who deployed what. They confirm with stakeholders on Slack, then open the repository to find the relevant IaC code. By this point you have moved back and forth across many screens and tools. Writing the fix and opening a PR comes after all that.

Incident response follows the same flow. An alert fires, you open a dashboard, read logs, follow APM traces, pull stakeholders into Slack, and apply a fix. It is a chain of context switches, and it wears you down even more when the response runs into the middle of the night.

What changed over the past year or so is that LLM accuracy reached a level where it can take on this interpretation-and-round-trip work. Reading monitoring data, interpreting the cause, and proposing a fix can now be run on your behalf. The desire to make monitoring and action seamless has been around for a while; the shift is simply that it can finally be shipped in the form of an autonomous agent. In fact, AWS started lining up its operational agents around re:Invent 2025 late last year, when it previewed DevOps Agent and Security Agent.

https://aws.amazon.com/about-aws/whats-new/2025/12/devops-agent-preview-frontier-agent-operational-excellence/

That DASH 2026 and AWS FinOps Agent landed on the same day is probably not a coincidence. Two companies from different territories are evolving in the same direction at the same time. It was a day that showed exactly that.

How Datadog and AWS Differ in Strategy

Let's compare the two strategies along three axes: data source, coverage, and how each positions its agents.

Data Source

AWS owns cost data and CloudTrail as first-party data. This is data that only AWS holds completely.

AWS FinOps Agent's anomaly investigation is designed to leverage this directly. It starts from a Cost Anomaly Detection event and correlates the CloudTrail events around it. Because CloudTrail records who called which API and when, it can automate all the way to identifying the operation that drove the cost change and the IAM user or role behind it, the owner who bears responsibility. The output is not interpretation that reaches into business factors, but a Jira ticket or Slack notification delivered straight to the responsible engineer.

This is a flow you cannot build unless you hold both the place where cost is generated and the place where resources are operated. Third-party FinOps SaaS can ingest the CUR (Cost and Usage Report), but matching it against CloudTrail at the same precision is not as easy as it is for AWS itself, given data lag and permissions. AWS can capture its own API calls in near real time, and there lies a structural advantage.

Datadog owns the telemetry of APM, logs, traces, RUM, and profilers as first-party data. Bits Detection looks at real-time metrics and logs plus historical baselines, service topology, ownership information, and source-code context, and continuously judges whether the current state is healthy. Bits Investigation traces the root cause, and Bits Code takes on the fix, producing a production-ready PR on GitHub.

Because the kind of data at the source differs, the anomalies each is good at differ too. AWS is strong at "money anomalies," Datadog at "performance and behavior anomalies." Both call themselves general-purpose Ops agents, but as long as the underlying data source differs, the center of gravity of what they are good at naturally differs.

Coverage

AWS stays within its own cloud. It does not cover multicloud or SaaS.

This is a weakness and a strength at once. For an organization that runs entirely on AWS, everything from cost to execution logs sits inside the same cloud. Conversely, for an organization where Snowflake, Databricks, Google Cloud, Vercel, and others are mixed in, AWS FinOps Agent alone cannot see the organization-wide cloud cost anomalies.

Datadog spans multicloud and SaaS, and on top of that pulls in log infrastructure placed on your own infrastructure. In addition to BYOC Logs, which lets you search logs while keeping them in your own cloud environment, DASH 2026 introduced Federated Logs, which lets you search across external data stores like Databricks and ClickHouse from Log Explorer using the same syntax. It is expanding in the direction of searching from Datadog no matter where logs are scattered.

https://www.datadoghq.com/blog/introducing-datadog-byoc-logs/

Which side you choose depends on how far your operation spans. For an organization that is 100% AWS, the former is enough; for one with multiple clouds or SaaS mixed in, you need the latter's reach. Even an organization using both can land on a split: AWS FinOps Agent for cost optimization, Datadog Bits for performance and availability.

How Each Positions Its Agents

This difference is what I most want to convey in this article.

AWS is trying to treat AI agents as building blocks that run on its own platform. Bedrock AgentCore is that foundation, with components coming together: Memory, Identity, Observability, Gateway, Runtime, and built-in tools like Browser Tool and Code Interpreter. DevOps Agent and FinOps Agent are positioned as AWS-built standard agents that ride on top of this layer.

In other words, AWS holds the execution environment for agents itself and rolls out concrete agents on top of it over time. AgentCore also supports third-party frameworks like LangGraph and CrewAI, so third-party agents are expected to ride the same foundation.

Datadog is trying to look at AI agents from the outside, as objects to monitor. The Datadog Agent Console announced the same day is a mechanism that gives a cross-cutting view of the coding agents used across an organization (Claude Code, Cursor, GitHub Copilot, and so on) alongside Datadog's own Bits AI. It answers, in a single UI, questions like who is using which agent and how much, whether it is paying off compared to not using AI, and where cost is being wasted and what can be fixed.

What matters here is that Datadog's own Bits AI is also among the monitored targets. Your own agents and competitors' agents can be compared side by side in the same UI. The stance is not "use our agent," but "we'll give you visibility into your organization's agent operations themselves."

AI Guard sits in the same line of thinking. It protects AI agents from attacks like prompt injection, tool misuse, and data exfiltration, but the protection is not limited to Datadog-made agents. It discovers unprotected agents in the environment and includes externally built agents as targets. Datadog is positioning itself to monitor, protect, and govern agents.

AWS, which tries to own agents, and Datadog, which tries to monitor them. This difference is the core of the two strategies.

A Mapping of the Territory Agents Cover

Here is how each company realizes the territory that AI agents cover, from operations through development. This is not a comparison of the clouds as a whole. The top half is the day-to-day work of operations and development (from detection, investigation, and remediation through code change, release, and testing); the bottom half is the foundation that supports it (memory, evaluation, monitoring, protection, and execution environment). In this table, a slash means multiple products are candidates, and a plus means several are combined.

Role	Datadog	AWS
Anomaly detection	Bits Detection	Cost Anomaly Detection / DevOps Agent
Anomaly investigation	Bits Investigation	FinOps Agent + CloudTrail / DevOps Agent
Auto-remediation	Bits Infrastructure (infra ops & remediation)	DevOps Agent
Code change	Bits Code	Kiro (spec-driven dev environment)
Release validation	Bits Release	CodePipeline / CodeBuild
Test automation	Bits Testing	Kiro
Natural-language questions	Bits Chat / Bits Data Analysis	FinOps Agent (cost only)
Knowledge / memory	Bits Memories	AgentCore Memory
Agent evaluation	Agent Evals	Bedrock's evaluation features
Agent monitoring	Datadog Agent Console (includes external)	AgentCore Observability (internal)
Agent protection	AI Guard	AgentCore Identity / Bedrock Guardrails
Agent execution base	The Datadog platform itself	AgentCore Runtime

Most of the roles overlap. The strengths show up in the areas that don't overlap, or where the strength clearly differs.

Datadog has the reach to bring even external agents under its monitoring, plus a design, via Bits Code and Bits Release, that closes the loop from detection through fix and validation inside Datadog.

AWS has root-cause identification for cost anomalies, sourced from data it owns in CloudTrail, plus AgentCore Runtime as a foundation for pulling third-party agents onto its own platform. The latter is a device for inviting third-party developers to "run your agents on AWS."

The Tipping Point Is "Who Changes the Code"

Look at the mapping and one point of contention surfaces: how much of the flow from detection through investigation, code change, and release validation each company completes inside its own UI. That is where the turf war between the two lies.

Datadog's Bits Code is the emblematic product on this front. According to the official description, Bits Code takes signals from Error Tracking, APM Recommendations, Continuous Profiler, Test Optimization, Code Security, and Bits Investigation, and takes over the chain of work an engineer usually does by hand: triaging the problem, locating the relevant code, writing the fix, running tests, and opening the PR. It handles all of it in one stroke.

In other words, it is trying to pull into Datadog the division of labor that used to be "monitoring tools alert you, fixes are the job of the IDE and GitHub." The repository itself lives on GitHub or GitLab, but the entity generating the PR is Datadog.

AWS DevOps Agent also goes from root-cause identification through fix proposal. But when the fix reaches into application code, the main stage for that code is not necessarily on AWS's own services, so the degree of closure here is weaker. AWS's main horse for coding is Kiro, which is good at spec-driven development, or third-party agents via Bedrock, and DevOps Agent ends up on the side that coordinates with them. AWS's strategy is a different path: hold AgentCore as the agent execution environment, and put everything, including third parties, onto its own platform.

This is where the Datadog Agent Console comes in. By bringing the Claude Code, Cursor, and GitHub Copilot that an organization uses under monitoring, even the activity of coding agents running outside Datadog comes into view. Because Bits Code itself is also a monitored target, in-house and third-party can be compared side by side in the same UI.

So Datadog is taking a position where, even without owning the coding agent itself, it can hold the initiative by taking the stance of monitoring it. AWS, on the other hand, takes the stance that if everything is completed on top of AgentCore, monitoring is handled in-house too.

Which strategy works depends on where engineers spend their time. A team that stays long in the AWS console leans AWS; a team whose main stage is the Datadog UI and Slack leans Datadog. For organizations with many resources that can only be touched directly via the AWS console (IAM, VPC, Cost Management), AWS's advantage tends to hold. For organizations whose operations revolve around Slack notifications and PR reviews, the range Datadog can cover keeps extending.

Predictions from the Facts

From the facts laid out so far, here are a few predictions.

Prediction 1: AgentCore Repeats Bedrock's Strategy for Agents

AWS's strategy looks like a replay of its Bedrock strategy. Bedrock grew into a foundation that hosts models from Anthropic, OpenAI, Meta, and others on AWS, places them on the same footing as AWS's own Nova models, and puts everything on the AWS bill. Rather than betting on which model wins, it secures a share of the AI market by holding the execution environment and billing where models run.

AgentCore Runtime is trying to reproduce this same structure in the agent market. Rather than building a large number of agents in-house, it is a device for steering third-party agent developers to "run it on AWS." DevOps Agent and FinOps Agent are positioned as reference implementations that run on top of that foundation. Components like AgentCore Memory, Identity, Observability, and Gateway look like a setup where AWS takes on the common parts that agent developers find tedious to build themselves.

Whether this strategy works hinges on whether agent developers accept AgentCore's constraints. Just as Bedrock succeeded in pulling in model providers, if AgentCore succeeds in pulling in agent providers, AWS can keep monitoring, execution, and billing all inside its own house.

Prediction 2: Bits Code and Bits Release Together Signal Datadog's Closure Strategy

At DASH 2026, Datadog lined up Bits Code, which handles code change, and Bits Release, which handles release validation, in the same family. Bits Release is an agent that analyzes the intended impact of a code change, builds a validation plan, runs checks in staging, and watches the rollout. The two lining up together is itself a sign of strategic intent.

To complete the loop from detection through investigation, code change, and release validation inside your own walls, holding the code change (the fix PR) is the starting point. Once you hold the code change, release validation becomes a natural extension: "Datadog watches over the PR Datadog produced at release time." Conversely, if the code change is taken by another company's IDE-side agent or GitHub Copilot, holding release validation alone in-house does not make a closed loop.

Hold the loop's entry point with Bits Code, and pull its exit into your own walls with Bits Release. With these two lined up, a path to completing detection through fix and release inside Datadog's UI has come into view. Some of these features are still in preview, but the odds are high that Datadog finishes building the whole thing end to end.

Prediction 3: Integration into Chat and Tickets Is Part of a Big Shift to "Push UX"

Both companies emphasize chat and ticket-management tools as the output destination for their agents. AWS FinOps Agent lets you specify Slack or Jira as the output destination, and Datadog Bits also delivers investigation results and notifications to tools of the same kind.

This shows a big shift from a "go look at the dashboard" UX to a "delivered where the engineer already is" UX. Until now, both monitoring tools and FinOps tools were built on the premise that the user goes to look at the dashboard. The alert arrives by email, and you check the details on the dashboard.

In the age of agents, this flips. When a problem occurs, the interpretation and recommended action arrive directly in chat. You open the dashboard only when you want to dig deeper, or when you can't accept the agent's judgment. The dashboard drops to the position of a reference that quietly backs up the agent's judgment.

This change is reaching the billing model too. On top of the conventional per-view or per-seat billing, Datadog has already adopted AI credits tied to the agent's workload, and consumption-based billing per investigation. Not how much you look at the dashboard, but how much you put the agent to work. The axis of monitoring billing is starting to move that way.

Prediction 4: Coding-Agent Operations Move from "Individual Tool" to "Organizational Tool"

The very fact that the Datadog Agent Console launched shows that coding-agent operations are taking shape as a genre. When Claude Code, Cursor, and GitHub Copilot are adopted ad hoc, it is easy to end up in a state where no one can see who is using how much.

Each tool, too, is moving to bundle organizational usage, like GitHub Copilot's enterprise management features or Cursor's team features. But an individual tool's management screen is basically closed within that product. The Datadog Agent Console is trying to put a cross-cutting management layer on top of that. It is a different direction from, but close in aim to, GitHub Copilot starting to pull other companies' coding agents onto its own platform.

The category of an "organizational coding-agent operations dashboard" is already taking shape. What isn't visible yet is the next step: industry-standard tags for cost allocation, and per-agent productivity metrics (PR acceptance rate, error rate, cost per accepted PR, and so on).

Prediction 5: The Initiative War Develops into a Three-Way Contest

So far I've written in terms of the two parties, Datadog and AWS, but the real initiative war looks like a three-way contest. GitHub and GitLab, which hold the home of source code; AWS, Google Cloud, and Azure, which hold the execution platform; Datadog, Grafana Labs, and New Relic, which hold the monitoring platform. Each is after the initiative in the age of agents.

GitHub is moving in the direction of completing the loop from code change through release to monitoring inside GitHub, with the combination of Copilot and Actions. GitHub Advanced Security includes Copilot Autofix, which is already built out to the point of opening a PR with a proposed fix for a vulnerability detected by CodeQL.

The repository side, the execution-base side, and the monitoring side are each reaching, from the territory they own, for the loop of detection, fix, and release. Rather than any one of the three winning outright, the using organization picks the lead based on its own scale, industry, and existing stack. The winner splits by organization, that kind of outcome looks likely.

Prediction 6: An Organizational Shift in FinOps

What AWS FinOps Agent changes most may be not the technology but the shape of the organization.

FinOps has so far mostly run on a central FinOps team. A group of specialists builds cost reports and talks with the development teams in a monthly review. In recent years, the central team has been moving toward setting standards and guardrails while delegating responsibility to each team, but the carriers of that work were still people. AWS FinOps Agent is designed so individual engineers can ask about cost in natural language, and so that when an anomaly occurs, a Jira ticket arrives directly to the engineer.

For an organization that wants to lean toward self-service, this is a device that advances that shift a step, because an agent can take on the decentralization that people used to carry. If you take this path, the FinOps team's job moves from teaching everyone toward mapping accounts to owners, setting tag conventions, and designing the context handed to the agent. The FinOps specialist doesn't disappear; the substance of the work changes toward shaping the foundation. The view is that such an option has become realistic.

A similar structure shows up in other areas. In SRE, there is the choice between watching monitoring directly and shaping the monitoring system. In Platform Engineering, between each team handling its own infrastructure and a team laying down a standard path with guardrails so they can. In security, between a specialist team reviewing one item at a time and embedding policy into the platform's templates. In each, there is a choice: does the specialist team handle the field directly, or move to the side that shapes a system the field can run on its own.

Neither is the right answer. There are organizations where full centralization fits, and organizations that delegate heavily to the field. Many settle in the middle, where a center provides standards and guardrails and the field moves on its own within that range. AI agents have arrived as an option that can realize the "field moves on its own" side at a lower cost than human effort. The self-service shift in FinOps is one example. Adopting AI agents widens not only the how of monitoring and operation, but the very set of options for where an organization places the center of gravity of its roles.

Anticipated Objections

A few objections come to mind against the picture so far. Let me raise them and answer each.

Objection 1: Datadog's Multicloud Advantage

Datadog can see multicloud and SaaS too, so won't it win on coverage in the end?

The breadth of multicloud coverage is indeed Datadog's strength. But AWS has an area, cost data and CloudTrail, where only AWS holds the first-party data. The advantage that AWS itself can trace the basis for a cost change at the highest precision remains. Even a multicloud organization can land on a split where it leans only its AWS cost governance onto AWS FinOps Agent. Unifying monitoring versus precision of the data source: these two hard-to-reconcile axes are what's at stake here.

Objection 2: The Disadvantage of Staying Inside Its Own Cloud

AWS closes itself inside its own cloud. Doesn't the narrow coverage put it at a disadvantage?

Narrow coverage is indeed a disadvantage. But developers who use AWS as their main platform already spend long hours in the AWS console. Being able to slip into the screen engineers already touch is a strong weapon for lowering the cost of behavior change. Not having to move people to a new UI works strongly in the field of tool adoption.

Objection 3: The Both-at-Once Option

Rather than picking one, why not just adopt both?

In fact, this is a realistic option. In coding, the practice of using multiple agents in parallel has already spread: refining the design with Claude Code, handing implementation to Codex, and having one review the other's output. Use agents with different strengths, and have one verify the other's results. For Ops agents too, the same worldview holds up well.

But it holds up because humans have the judgment axis of "which to use for what, and which output to trust." Ops agents overlap heavily in function; anomaly investigation, auto-remediation, and natural-language questions can all be done by both. Put both in without a judgment axis, and all that's left is the cost of carrying the same experience in two places, with engineers wondering which one to ask. It is the same structure as the monitoring-tool sprawl of the past. Whether you make the most of using both comes down to the design of how you divide them.

Objection 4: The Possibility That Datadog Sees Everything Anyway

If agents on AgentCore are visible from Datadog too, won't Datadog hold it all in the end?

This is a future well worth considering. AgentCore Observability is OpenTelemetry-compatible and can already send telemetry to external monitoring tools, including Datadog. Even if AWS holds the execution base and billing, a split where the main stage of monitoring stays on the Datadog side is possible. Holding execution but opening monitoring to the outside: that gradient looks like the long-term fault line for initiative.

Objection 5: GitHub Entering the Field

Won't GitHub, which holds the repository, build the same loop first?

This connects directly to the three-way contest from Prediction 5. GitHub is actually moving in this direction, building a code-origin loop with the combination of Copilot Autofix and Actions. Datadog's Bits Code starts from monitoring, GitHub Copilot Autofix from code, AWS DevOps Agent from the execution base. The same loop, with each reaching from a different entrance, is how it can be organized.

Where to Start

The axes for telling them apart are now in place: data source, coverage, and how each positions its agents. With those three, you can read each agent's character. So let's move to what to actually do first. I'll organize this assuming a common setup: AWS as the execution base, Datadog for monitoring, GitHub for code management.

Step 1: Put into Words What Your Operation Runs "From"

The first thing to do is not tool selection or a proof of concept, but putting the current state into words. When an incident or cost anomaly occurs, where do you look first, where do you identify the cause, and where do you apply the fix? Write out this round-trip path once, and you start to see which vendor your organization is most compatible with.

The judgment axis can be organized like this.

Central issue	Lead agent	Basis for the call
Cost optimization / cost-anomaly investigation	AWS FinOps Agent	Matching cost data against CloudTrail at this precision is hard for third parties to produce
Infrastructure failures / resource anomalies inside AWS	AWS DevOps Agent	If it stays inside AWS, even execution logs are reachable via its own APIs
Multicloud or cross-SaaS performance and availability	Datadog Bits	The amount of telemetry owned and the breadth of coverage pay off
Visibility into coding-agent operations	Datadog Agent Console	Seeing external agents across one view is currently nearly unique
Code-origin security fixes	GitHub Copilot Autofix	A fix flow tightly coupled to the repository is easy to run

In many organizations, several of these apply at once. If you use AWS as the execution base, Datadog for monitoring, and GitHub for code management, you probably match most rows of the table. That's exactly why, rather than putting everything in at once, you need to prioritize by which path has the highest round-trip cost right now.

Step 2: Agentify Just the One Highest-Round-Trip Path First

Try to agentify everything at once and you end up just trying a bunch of agents at once with no settled evaluation axis. The realistic move is to narrow to the one path from Step 1 where the human round-trip happens most.

For example, an organization spending serious hours each month on cost-anomaly investigation could start by trying AWS FinOps Agent's public preview, configured with Slack notifications and a threshold filter (only investigate anomalies above a certain amount). If the initial investigation for incident response has become a late-night burden, pick either AWS DevOps Agent or Datadog Bits Investigation, choosing based on whether your incidents stay inside AWS or reach external SaaS. Narrow to one path, and you can concretely evaluate what got faster compared to doing it by hand, and what you can't yet trust.

Step 3: Sort Out Context and Tag Conventions During the Trial

An agent's accuracy changes greatly with the context you hand it. AWS FinOps Agent is designed to take context files like account-to-owner mappings, team definitions, tag conventions, and review cycles. Whether it can resolve "what is this team's cost" to the right set of accounts hinges on this context work.

What matters here is that sorting out context and tag conventions is an investment that won't go to waste no matter which vendor you choose. The account-to-owner table, tag conventions for cost allocation, and service-to-team mappings are included. These work as the prerequisite information handed to the agent, whether for AWS FinOps Agent or for Datadog. Sort it out while you're still in the trial phase, and it carries over as is, whether you move to full adoption or switch to a different vendor. Sorting out the metadata at your feet before the agent itself is a shortcut that looks like a detour.

Step 4: Start Visibility into Coding-Agent Operations Early, as a Separate Track

This is a slightly different topic from Ops or FinOps, but for running an organization's AI use in a healthy way, it's worth moving on in parallel.

If you use Claude Code, Cursor, and GitHub Copilot across the organization: who uses how much, how much it costs, and how many PRs get accepted. Aren't there many organizations struggling to get these numbers? Use that started as an individual trial eventually spreads to teams and the whole organization. So as not to scramble at that stage, you want to anticipate team and enterprise use from the start. Now that a visibility layer for agent operations like the Datadog Agent Console has appeared, even if adoption is deferred, it's safe to at least start the discussion early on how to make your organization's coding-agent use visible.

The metrics worth tracking are around the number of users and frequency of use, spend per agent, PR acceptance rate, and rework rate of fixes. With those visible, you can judge by the numbers which agent to weight investment toward, and where waste is occurring.

Because It's a Transitional Period, Build Switchability into the Design

What's common across these steps is the stance of not going all-in yet. Most Ops agents are still in preview, and their features and billing models will change.

That's exactly why you should avoid building deeply dependent on a specific vendor's conventions, and instead thicken assets that work no matter where you land, like the context and tag conventions from Step 3. Put the data source and round-trip path into words, sort out the metadata, and validate one path at a time. Proceed in that order, and whether you move from trial to full adoption, or the initiative landscape shifts, you can move without panic.

What's Happening Now

That the two companies' announcements landed on the same day is a signal that the Ops-agent category has entered a phase where multiple vendors reach for the same territory at the same time. Datadog, strong in monitoring, and AWS, with the cloud base, are each reaching for the initiative from the source of their own strength. The structure where the one who holds the most data also holds the action that follows is a movement that applies to AI agents generally.

AWS's strategy is platform-type: hold the lower layer of the agent execution environment, line up the agents that ride on it with its own products over time, and invite outside developers onto the same foundation. Datadog's strategy is meta-platform-type: hold the outer layer of the monitored target, and make both its own and others' products visible in the same UI. And as the third party in the three-way contest, GitHub and GitLab, which hold the home of source code, are entering the same territory.

Which strategy comes out ahead changes with what an organization runs its operations from, so there is no single answer. But the structure of this competition itself is a microcosm of the initiative war in the age of agents. Each reaches, from its own territory, for what lies beyond. The one who owns the infrastructure reaches for the execution environment of the software that runs on it. The one who monitors reaches for the action beyond the expanding monitored target. The one who keeps the code reaches for the loop of code generation and release.

What's being fought over is the initiative of where, in which UI, to complete the loop from detection through fix to release. The same-day release of June 9, 2026 was the day that fight came clearly into the open.

Why Coding Stays in Human-AI Collaboration: A Paradox in Stanford's 51 Deployments

Kento IKEDA — Sat, 06 Jun 2026 04:09:33 +0000

"We rolled out AI and saw no results" and "AI made our development dramatically faster" are being said in the same year, often inside the same company. Where does that gap come from?

Stanford Digital Economy Lab's The Enterprise AI Playbook: Lessons from 51 Successful Deployments (April 2026) goes after that question with real data. It analyzes 51 production deployments across 41 organizations and 9 industries, drawing on structured interviews and internal documents to separate what made deployments succeed from what made them fail.

Most of the coverage so far reads the report from a management angle: AI adoption as an organizational-change problem, the importance of process redesign and executive commitment. That framing is accurate. But the report also spans customer support, software engineering, marketing, and more, and there is plenty in there about software engineering that the management-focused takes barely touch.

Read it with an engineer's eye and one paradox jumps out. While customer support and IT operations move toward autonomous AI, coding alone stays in "human-AI collaboration." That runs against the prevailing mood that "AI coding is the frontier."

This post starts from that paradox. First I'll walk through the report's method and key findings, then analyze the structure that keeps coding in collaboration, and finally re-read the 51 cases from three vantage points: the individual engineer, the engineering lead, and whoever owns org-level development. I'll stay close to the report's findings and then push past them.

What the 51 cases actually are

A quick look at the study first, so the interpretation later lands properly.

Authors and background

The authors are Elisa Pereira, Alvin Wang Graylin, and Erik Brynjolfsson. Brynjolfsson is one of the most-cited researchers in the economics of information, known for early work measuring the productivity effects of IT investment. The "Productivity J-Curve" that Brynjolfsson and colleagues laid out in 2021 is one of the foundations of this study.

The J-Curve goes like this. A general-purpose technology like AI doesn't raise productivity just by being deployed. It needs complementary investment in intangibles: process redesign, training, reorganization. During that investment, productivity actually dips. Only once the investment pays off does productivity jump. The curve dips into a trough before springing up, hence the "J." The report's recurring message that organization matters more than technology rests on this premise.

Method

The study only looks at deployments that moved past the pilot stage into production and produced measurable business value. The selection criteria were:

Stable operation, integrated into real workflows
Used in decision-making by multiple teams for at least 3 months
Clear outcomes in productivity, revenue, or customer satisfaction
Replicable to other teams or regions

Interviews ran from August 2025 to February 2026, with at least one 60-minute structured interview per company, supplemented by internal metrics, project plans, and financial documents. The sample skews toward manufacturing, financial services, and technology.

What the study is telling us

The conclusion is simple. Using the same technology for the same purpose, outcomes varied widely by organization. What made the difference was not the AI model. It was how prepared the organization was, what processes it had, how its leaders engaged, and whether it had a culture that tolerated failure.

The findings most relevant to an engineering organization:

77% of the hardest challenges were intangible costs: change management, data quality, process redesign. The technology itself was consistently rated "the easiest part."
61% of successful projects had a failed AI project before the one that worked. Those sunk costs never show up in the success case's ROI.
For similar use cases, one company took weeks and another took years. The difference was not technology but executive engagement, existing infrastructure, and user willingness.
The "escalation model," where AI autonomously handles 80%+ and humans review only exceptions, had a median productivity gain of 71%, well above the 30% of approval-based models. (The report notes this gap may also partly reflect differences in task characteristics.)
Agentic AI implementations were still only 20% of cases, but their median productivity gain was 71%, above the 40% of high-automation approaches.
In 42% of cases, model choice was fully interchangeable. The durable advantage sat in the orchestration layer, not the foundation model.

The selection bias to keep in mind

Worth stressing: this is a study of successful deployments only. The report is explicit about the selection bias. Companies were asked about past failures and abandoned pilots too, but what ends up analyzed are the cases that created value.

So this study shows "what success looks like and what it takes to get there," not "how common success is." The report cites MIT's NANDA initiative study from 2025, "The GenAI Divide: State of AI in Business 2025" (which reported that 95% of generative-AI pilots produced no measurable financial impact), and positions itself as the inverse: a deep look at the side that succeeded. Read it with that asymmetry in mind.

The coding paradox

Here's the core. Chapter 3 of the report has a table organizing human-in-the-loop (HITL) involvement by business function. Read it as an engineer and the table feels off.

The three HITL levels

The report splits HITL into three levels:

Escalation: AI autonomously handles 80%+; humans review only exceptions (20% or less sampled)
Approval: AI does the work; a human approves each output before it executes
Collaboration: humans and AI both work hands-on, task by task

Autonomy is highest at escalation and lowest at collaboration. By function:

Function	HITL level	Median productivity gain
IT operations	Escalation	90%
Customer support	Escalation	71%
Claims processing	Escalation	50%
Field service	Approval	80%
Clinical documentation	Approval	66%
Coding	Collaboration	54%

(from Chapter 3, "How much human oversight is optimal?")

Coding is the only function in the collaboration tier. Clinical documentation sits in approval because medical records are legal documents a physician has to sign off on, one by one. Claims processing and customer support can move to escalation because they're high-volume, have clear success criteria, and tolerate recoverable mistakes.

So why does coding stay in collaboration? No regulation pins AI down here. And yet humans and AI keep working task by task.

The role shifted from "writing" to "reviewing"

The report describes the change on the coding floor like this: rather than completing a whole task themselves, engineers increasingly review AI-generated changes, make small adjustments, and merge the PR. At one Latin American fintech, AI agents migrated millions of lines of legacy code in a system serving 100M+ customers, compressing work originally estimated at 18 months and 1,000+ people into a few weeks. At an insurer, a legacy rebuild scoped at 5,000 hours, 7 people, and a 2027 finish was done in 600 hours with 3 people.

So coding isn't "not getting faster." The role moved from writing to reviewing, and productivity is up 54%. It just hasn't reached the full autonomy other functions have. There's a structural reason.

Checking coding against the 4 conditions for agentic success

The report lists four conditions under which agentic AI delivers:

High-volume, repetitive tasks
Clear success criteria
Recoverable errors
Access to data across multiple systems

Hold coding up against these four and the reason it stays in collaboration comes into view.

Procurement and alert triage cleanly satisfy all four. High volume, a clear right/wrong, recoverable mistakes. So they move to full autonomy.

Coding? Routine refactors, test generation, dependency bumps tend to satisfy the four. But feature work and architectural change break them. "Tests pass" isn't a sufficient success criterion when readability, maintainability, and fit with existing design are in play. And production migrations or schema changes can produce unrecoverable errors. Two of the conditions, "clear success criteria" and "recoverable errors," fail across a wide swath of coding.

Per the METR measurements the report cites (METR is a research org that measures AI autonomy), the length of software tasks frontier models can complete autonomously has been doubling roughly every 7 months, reaching about 15 human-expert-hours in early 2026. Anthropic, meanwhile, warns that around the 3.5-hour mark, API success rates drop below 50%. Coding agents that run autonomously for days and emit tens of thousands of lines are no longer rare, but production reliability falls off as tasks get longer and more complex. That's exactly why the human involvement of engineer review still governs quality.

Why engineering is a natural fit for collaboration

One step further. Coding stays in collaboration not because AI is weak or engineers are behind, but because software engineering already has a deeply layered culture of verification.

Type systems, unit tests, code review, CI, static analysis, canary releases. Engineering spent decades building a culture that distrusts even human-written code and puts it through layers of verification before production. Adding review on top of AI-written code is the most natural extension of that culture.

The flip side: full autonomy (escalation) collides head-on with that verification culture. "Only a human reviews 20% of samples" works for alert triage, but "review only 20%" against production code runs against most engineers' instincts. Coding stays in collaboration partly as a technical limit and partly because engineering, as a discipline, is built around verification.

Seen this way, HITL level isn't a simple matter of "as models improve, things automatically advance to escalation." The stronger the verification culture in a domain, the longer collaboration persists. Coding's path to autonomy depends not just on model performance but on how much of the verification you can hand to AI itself, specifically, how you design the layer where AI writes tests and AI reviews. Whether review itself can be handed to AI is something I'll come back to later.

Re-reading from three vantage points

What coding-stays-in-collaboration means depends on where you stand. Three vantage points: the individual engineer, the engineering lead, and whoever owns org-level development. The same study hands each a different assignment.

The individual engineer: reviewing becomes the precondition for productivity

The change the report describes means an individual engineer's daily work is already shifting. Instead of writing from scratch, you read AI-generated changes, judge them, adjust, and merge. Time spent writing code gives way to time spent evaluating code.

This is where the nature of the collaboration model bites. Under escalation, a human only looks at exceptions. Under collaboration, human judgment governs quality on every single output. Let review get sloppy and defects in generated code flow straight to production.

The awkward part: review gets harder as AI gets better. The more "plausible" the output looks, the more humans skip the details. Automation complacency, the long-known phenomenon in aviation and process industries where over-trusting an automated system erodes attention, shows up in code review too. Obviously wrong code is easy to catch; code that's 90% right with a subtle 10% wrong slips past a skim. Collaboration's 54%, the most modest gain among the functions, can be read partly as that "cost of review" offsetting the productivity gain.

The question for the individual engineer is where to draw the line between what to delegate to AI and what to keep under your own judgment. The data says coding is still in the collaboration stage, meaning human judgment is directly tied to quality. Not taking AI output at face value, not treating review as a formality, these become preconditions for productivity at this stage. The direction of skill shifts too. More than the ability to write code from zero, the ability to spot defects in code others (or AI) wrote, and to read the intent behind a design, grows relatively more important.

The engineering lead: designing the HITL levels

For a lead, the job becomes designing which development tasks run at which HITL level. The report frames HITL choice as determined by error tolerance, regulatory requirements, and task complexity. That maps directly onto a team's design guidance.

The four conditions for agentic success (high-volume/repetitive, clear success criteria, recoverable errors, multi-system data access) double as a way to sort development tasks. Tied to practice, roughly:

Lean toward escalation: dependency bumps, lint/format autofixes, boilerplate generation, filling test-coverage gaps. High-volume, mechanically checkable success criteria, and CI stops failures.
Collaboration fits: feature implementation, refactoring, bug fixes. Success criteria involve readability and design fit, and human review governs quality.
Push toward approval: production migrations, schema changes, auth and billing. Errors can be unrecoverable, so a human approves each one even when AI does the work.

Leave that sorting vague and pick "let AI do everything" or "humans review everything," and the former invites incidents while the latter caps productivity. Varying the autonomy level by the nature of the task is where a lead earns their keep.

There's another important finding. In 42% of cases model choice was interchangeable, and the durable competitive advantage was not the underlying large model (the foundation model) itself but the design of how you combine and use it, the orchestration layer. Which model you use is becoming a commodity for many use cases. What separates teams is how you decompose tasks, where you insert human involvement, and how you wire up multiple tools and data sources. In the report's words, the advantage isn't the model but how you compose it.

This is actually good news for the people doing the designing. You don't have to win the race of chasing the latest model; you can build an edge on the quality of task decomposition, HITL design, and tool integration. How you build the verification layer, the multi-stage scheme where AI writes tests and AI reviews, is exactly this orchestration-layer question, and it's the key to moving coding from collaboration to the next stage.

Org-level development: how to use the freed capacity and make it stick

At the org-development level, the question is what happens after productivity rises. The report shows, with concrete cases, that what you do with the freed capacity is decided by organizational choice, not technology.

The report lays out three strategies for that capacity:

Accelerate: keep headcount, pour it into development speed
Redeploy: move people from automated work to higher-value work
Reduce: cut headcount directly

At one PE-owned company, an 88% productivity gain in coding led to cutting the development team from 7 to 3. At an edtech company, a 20-30% improvement in coding went not to layoffs but to accelerating the roadmap: with a large product backlog, shipping features faster was worth more than trimming the team. The report notes growth-stage companies lean toward acceleration, while cost-focused ownership (PE, turnaround) leans toward reduction. The same productivity gain can tip toward reduction or acceleration. What decides is not technology but organizational strategy.

A second case, security operations, reads as a model of redeployment. A 6-person SOC (security operations center) team at one tech company was buried under 1,500 alerts a month. After automating first-pass triage with AI, the required headcount dropped to the equivalent of 1.5 people, but no one was laid off. The freed 4.5 FTE were redeployed to proactive threat hunting, security-design review, and team skill-building. The executive who led it put it this way: "AI isn't replacing the person you have; it's replacing the person you don't need to hire." In areas like SRE and security, chronically understaffed with a backlog of "things we want to do but can't," redeployment rather than reduction is the natural choice.

Org-development also has to think about how to bring cautious departments along. Per the report, the most cautious about AI adoption were not frontline users (23%) but staff functions like legal, HR, risk, and compliance (35%). Different stances care about different things: executives want ROI you can see in numbers, staff functions worry about procedural risk and where the blame lands, and the frontline fears losing their jobs. Each needs a different move. Spreading AI inside an engineering org likewise means looking past the dev team's own walls. How you bring legal, security, and HR along shapes how fast you can roll out. The report has several cases where handing those staff functions a governance role turned the cautious departments into active champions.

Executive involvement comes in stages too, the report says. Hands-on engagement, checking progress weekly and clearing blockers, accounted for 58% of the successful cases. The 7 cases that reached company-wide transformation all wired AI adoption into a corporate OKR and tied it to evaluation and compensation. For an engineering leader as well, the condition for making it stick is to connect AI use to organizational goals rather than leaving it as a "clever trick on the floor." And one more thing the report stresses repeatedly: a culture that doesn't punish failure. 61% of successful cases had a prior failure, and in none of the cases studied was anyone punished for a failed AI project. Since putting AI into production presupposes trial and error, building a culture that can forgive failure is one of the most important jobs at the org-development level.

What to add when you read the report

I've walked through the report's findings. But what's in the report isn't enough on its own. To carry it into practice, a few things need reading-in.

First, selection bias. As noted, this is a study of successful deployments. The report itself admits it doesn't show "how common success is." Take it as a description of patterns shared by organizations that succeeded, not a guarantee that the same approach will work. MIT's "95% fail" number and this study's "51 that succeeded" are the same phenomenon seen from opposite sides. Only by overlaying both do you get the full picture.

Second, the handling of reliability. As one critique points out, the report claims "messy data isn't a blocker if you design around it," yet flags reliability problems in 27% of cases while never once using the word "hallucination."

To an engineer, "you can design around it" doesn't quite hold for messy data and model instability. In coding especially, hallucination shows up as "plausible but wrong code" that can slip past review. Take the report's view on board, but also look hard and soberly at your own data quality and model reliability. The "vague success criteria" and "unrecoverable errors" I named earlier as reasons coding stays in collaboration are, in fact, another face of this same reliability problem.

Third, the time axis. The report's data collection ran from late 2024 to early 2025, when agentic AI was still nascent, and at the time of the study agentic implementations were only 20% of cases. The report itself notes that the redeployment and hiring-freeze patterns observed here are characteristics of an early-adoption phase, and that the distribution may shift as models mature and cost pressure builds. It also references separate Brynjolfsson-et-al. research showing that employment of younger workers in AI-exposed roles has already declined in relative terms, and warns this is an early sign of a larger shift. The coding-stays-in-collaboration picture here is likewise not fixed; read it as a snapshot of this moment. Given how fast the length of tasks AI can complete autonomously is growing (per METR), the boundaries should keep moving, from collaboration to approval, from approval to escalation.

Finally, separate from those three points, let me answer an objection likely to be aimed at this post's own argument. Even if coding stays in collaboration, why not just hand review to AI too? In fact, AI code review has spread fast, and AI now catches style violations and common bugs. For low-risk changes, some teams already run on AI review alone.

But having AI review AI-written code has a trap. The same model tends to share the same blind spots, so "plausible but wrong code" can be missed by both the author-AI and the reviewer-AI. The automation complacency from earlier stacks up, multiplied, between AIs. The one who ultimately decides the merge and owns the soundness of the design is, for now, a human. So even as AI review advances, it's more realistic to expect that human review won't go to zero but will narrow to the high-risk areas. Low-risk changes go to AI; areas carrying unrecoverable risk stay with humans. That lines up exactly with this post's view: vary the HITL level by task.

Closing

Coding stays in human-AI collaboration not because the technology is immature or engineers are slow to change. Task complexity, the risk of unrecoverable errors, and the fact that software engineering already has a multi-layered culture of verification: these three overlap to keep coding at a stage where human judgment still governs quality.

That fact hands each vantage point a different assignment. The individual engineer: the responsibility to keep engaging with review that gets harder as AI advances. The lead: the role of deciding which task sits at which autonomy level and how to design the verification layer. The org-development owner: the work of choosing, as strategy, what to do with the freed capacity, and of building a culture that forgives failure plus org-wide adoption that sticks.

What Stanford's 51 cases show again and again is one thing: what separates outcomes is not technology but organization. When coding's autonomy moves to its next stage, what decides it won't be model performance but how the engineering org designs, judges, and builds in verification. The shift has already begun. The organizations that don't put off that design work are the ones that will pull ahead.

"Reinstalling Won't Fix It": A Cross-App Shared-Auth Deadlock After Switching Phones

Kento IKEDA — Sat, 30 May 2026 07:26:30 +0000

After migrating to a new Android phone, a few specific apps stopped launching. Amazon Shopping and Kindle would freeze on a blank white or black screen for a while, then close on their own. Reinstalling, clearing storage, updating the OS — none of it helped. Going through the usual support steps changed nothing.

What finally fixed it was clearing the storage of every Amazon app at once. Tracing the cause through ADB logs, it turned out that authentication data shared across multiple apps had become inconsistent, and the auth-retrieval step at startup was deadlocking.

The incident itself happened with a specific Pixel-and-Amazon combination, but structurally it's a pattern that can hit any app where "authentication data shared across apps" meets "many subsystems initialized in parallel at startup for speed." It's worth knowing about whether you design SDKs, build apps, or handle ops and support, so I'm leaving it here as a case study.

Note: This happened on my own device, with my own account. I'm reading the diagnostic output that the OS itself wrote out — not decompiling any app.

What happened after switching phones

Right after migrating data to a Pixel 10a, only certain apps refused to launch.

Affected: Amazon Shopping, Kindle
Symptom: open the app, it freezes on a blank white or black screen for about ten seconds, then closes on its own
Everything else: other apps work fine

The officially suggested remedies are generic — "reinstall the app," "clear the cache," "update the OS" — and none of them worked. When the root cause is in the OS or the data migration rather than the app itself, the standard support path has a hard time isolating it.

At first I couldn't even tell whether it was an OS problem, an app problem, or an Amazon account problem.

When nothing works, change the framing

I tried all the standard fixes.

Reinstalling the affected apps
Clearing storage of just the affected apps
Clearing the cache of the affected apps
Updating every app in the Play Store
Updating the OS
Restarting the device

The affected apps still wouldn't launch, no matter what.

The key realization: as long as you think of it as "a problem with one app," you won't fix it. Reinstalling just the affected app, or clearing just that app's storage, changes nothing. If that's the case, the cause probably isn't contained within a single app.

Only once you reframe it as "a problem across a group of apps" does the solution come into view.

Hypothesis: the shared authentication data is corrupted

Let me start with a hypothesis built only from public information. I'll back it up with logs in the second half.

Amazon's Login with Amazon SDK has a mechanism that lets other apps reuse the Amazon Shopping app's logged-in state. This is documented officially.

https://developer.amazon.com/docs/login-with-amazon/customer-experience-android.html

According to the docs, if the user is already signed in to the Amazon Shopping app, an app that integrates Login with Amazon won't ask them to re-enter account details — the SDK recognizes and reuses the auth state of the Amazon Shopping app or the Fire OS device. That's single sign-on (SSO). The SDK's internal package name is com.amazon.identity.auth.map.device, which also appears in Amazon's official migration guide.

https://developer.amazon.com/docs/login-with-amazon/upgrade-android-sdk.html

What we can say from this is that Amazon's authentication layer (referred to internally in the SDK as MAP) is designed so that other apps can reference the Amazon Shopping app's logged-in state. What the docs directly describe is SSO for apps using Login with Amazon, but it's reasonable to think that Amazon's own apps such as Kindle and Prime Video share the same auth layer too.

The hypothesis, then: during data migration, only part of the authentication data was carried over in a corrupted state, and the startup process that tries to fetch that shared data is getting stuck. If that's right, it explains why nothing short of wiping the apps that hold the shared data will fix it.

That said, specifics like "the first-installed app holds the auth data as the representative" or "only one particular app is the source" can't be asserted from public information. I'll check how solid the hypothesis is by reading the ANR trace in the second half.

The fix: clear storage for the whole group of related apps at once

Here's the fix up front. The concrete steps:

Open Settings > Apps > See all XX apps
List every Amazon app installed
For each one, run Storage & cache > Clear storage
Once they're all cleared, restart the device
Then open Amazon Shopping or Kindle — if a login screen appears, you're good

Examples of apps to target:

Amazon Shopping
Kindle
Amazon Prime Video
Amazon Music
Amazon Photos
Amazon Alexa

The important point is that what you need to wipe is not "the app that's failing" but "every app that shares the authentication data." In my case, the one that finally did it was clearing Prime Video's storage. It was an app I barely ever opened, and until I cleared it, clearing the other Amazon apps did nothing. It may well have been the source of the shared data.

Migration tools restore apps automatically from the old device's app list. As a result, you can end up with Amazon apps you haven't used in ages — ones you've forgotten you ever installed. In the user's mind it's an "app I don't use," but in the authentication-sharing network it's a full-fledged node, and the corrupted data sitting there drags down the apps you are launching. The instinct to "only clear the app that's failing" or "only clear the apps I actually use" backfires here.

Verifying the hypothesis with the ANR trace

Now the main part. Let's verify whether this really is a shared-auth deadlock, using the ANR (Application Not Responding) stack trace.

The first thing to establish: what's killing the app is an "ANR," not a "crash." A crash throws an exception and the process drops immediately; an ANR is the system force-closing the app after the main thread has failed to respond for some time (roughly several seconds to ten). Freezing on a blank screen and then closing is the classic ANR symptom — not an exception, but a timeout while waiting for a response.

Since this was happening on my own device with my own account, I connected the Pixel to a Mac over ADB and pulled the diagnostic log (the stack trace) that the OS wrote out when the ANR occurred. Again, I'm not decompiling the app — just reading the diagnostic output the OS left behind.

adb shell dumpsys dropbox --print data_app_anr | \
  grep -A 200 "Process: com.amazon.mShop.android.shopping"

The "DropBox" in dumpsys dropbox refers to DropBoxManager, the Android system-log mechanism that stores diagnostic entries (crashes, ANRs, and so on) over time. It has nothing to do with the cloud storage service of the same name. --print data_app_anr pulls only the entries tagged as app ANRs, filtered here by Amazon Shopping's process name.

The trace recorded several threads running in parallel at startup. The key part: they were waiting on each other's locks. Let's read them in order.

main thread (tid=1): the UI itself, stuck

The main thread was stuck while running a startup task called AndroidComponentDetectTask.

"main" prio=5 tid=1 Blocked
  at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(...)
  - waiting to lock <0x00eb4d79> held by thread 37
  at com.amazon.platform.service.ServiceRegistryImpl.getService(...)
  at com.amazon.mShop.appStart.AndroidComponentDetectTask.apply(...)
  ...
  at android.app.ActivityThread.handleBindApplication(...)

It's trying to acquire a lock, <0x00eb4d79>, and waiting for thread 37 to release it. This lock is one the Service Registry (the common registry where each subsystem registers and retrieves itself) takes internally when fetching or creating a service. On Android the main thread is the UI thread, so when it stops here, nothing gets drawn and the screen stays blank.

thread 36: collateral damage waiting on the same lock

The error-reporting init task (thread 36) was waiting on the exact same lock as main.

"StagedExecutor2-pool-19-thread-1" prio=5 tid=36 Blocked
  at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(...)
  - waiting to lock <0x00eb4d79> held by thread 37
  at com.amazon.platform.service.ServiceRegistryImpl.getService(...)
  at com.amazon.mShop.sam.log.SAMLogManager.initialize(...)
  at com.amazon.mShop.errorReporting.ErrorReporter.startSession(...)

Also waiting on <0x00eb4d79>. This Service Registry lock is a congestion point that multiple threads fight over at startup.

thread 37: the culprit, holding a lock while waiting on auth data

The problem thread is thread 37. It was holding <0x00eb4d79> (the Service Registry lock) while trying to acquire another lock, <0x004a4835>, and getting stuck.

"StagedExecutor3-pool-20-thread-1" prio=5 tid=37 Blocked
  - waiting to lock <0x004a4835> held by thread 62
  at com.amazon.identity.auth.device.api.MAPAccountManager.getAccount(...)
  at com.amazon.mShop.minerva.MinervaWrapperMAPClient.fetchAndSetAccountAttributeForTeen(...)
  at com.amazon.mShop.minerva.MinervaWrapperMAPClient.<init>(...)
  at com.amazon.mShop.minerva.MinervaWrapperServiceImpl.initializeMinervaClientIfNeeded(...)
  at com.amazon.platform.service.ServiceRegistryImpl.instantiateService(...)
  at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(...)
  - locked <0x00eb4d79>

Reading bottom to top, the sequence is:

The Service Registry takes its internal lock <0x00eb4d79> to create a service (at this moment, main and thread 36 are made to wait)
Still holding that lock, it proceeds into initializing the metrics SDK (Minerva) client
Inside that, it calls MAPAccountManager.getAccount to fetch the currently logged-in account
The auth SDK (MAP) tries to take another lock, <0x004a4835>, internally
But that lock is held by thread 62 and never comes back

Thread 37 sits in the decisive position that triggers the deadlock: holding the Service Registry lock while frozen waiting for auth data. Because it won't release the lock it holds, main and thread 36 — which are waiting on it — stall in turn.

thread 27: another one waiting on the auth lock

On top of that, thread 27 (Weblab, fetching A/B-test flags) was also waiting on the same auth lock <0x004a4835> as thread 37.

"StagedExecutor1-pool-15-thread-1" prio=5 tid=27 Blocked
  - waiting to lock <0x004a4835> held by thread 62
  at com.amazon.identity.auth.device.api.MultipleAccountManager.getAccountForMapping(...)
  at com.amazon.mShop.sso.SSOUtil.getCurrentAccountFromDisk(...)
  at com.amazon.mShop.core.features.weblab.WeblabServiceImpl.getTreatmentAndCacheForAppStartWithTrigger(...)

Weblab also needs auth information at startup, and via getAccountForMapping it's waiting on the same auth lock to be released. Note that thread 37 reaches the lock through MAPAccountManager and thread 27 through MultipleAccountManager — two different APIs converging on one internal lock.

The big picture: auth-data retrieval is where every task converges

Laid out, the dependencies look like this:

What stands out is that the tasks meant to run in parallel at startup (metrics, error reporting, A/B testing, component detection) all ultimately converge on a single point: "fetching the MAP auth data." Minerva and Weblab are supposed to be independent features, yet somewhere in initialization each one reaches for the same auth SDK to find out "who is logged in right now."

That auth-data retrieval never returns, because the shared data is corrupted. Every task that needs auth stalls; and because the task holding the Service Registry lock has stalled, even tasks unrelated to auth (main, error reporting) get dragged down. That's the full chain that leaves the screen blank until the ANR fires.

Following thread 62 — the one stuck while holding the auth lock — it was sending a query to another process via a ContentProvider and waiting for the response. A ContentProvider is Android's mechanism for sharing data between apps, and Amazon's apps appear to use it to pass authentication data around. It seems thread 62 was stuck holding the auth lock because one of the sharing sources never returned a response. Which app, and why it didn't respond, can't be pinned down from this trace alone. But the structure — "go fetch the shared auth data from the source, and it never comes back" — is consistent with the fact that wiping every Amazon app's storage fixed it.

Strictly speaking, this isn't a circular wait where two threads grab each other's locks (the textbook deadlock). It's a hang: a thread holding a lock freezes waiting on an external process, and the threads waiting on it stall in a chain. But since the outcome — "stuck holding a lock, with everyone waiting on it blocked forever" — is no different from a deadlock, I'm calling it a deadlock in this article.

The design pitfalls this case reveals

The textbook lessons — "acquire locks in a consistent order," "don't block the main thread" — apply here too, of course. But what the trace really surfaced is the pitfall that emerges when well-intentioned design decisions pile up. Parallelization for speed, auth lookups for functionality, cross-app data sharing for convenience. Each is reasonable on its own, but stacked together they become the following three pitfalls.

Pitfall 1: parallel-init speedups backfire on shared resources

Initializing subsystems in parallel to speed up startup looks like a correct optimization. Indeed, the trace recorded several init tasks running concurrently on separate threads — metrics, error reporting, A/B testing, component detection.

The thing is, many of them internally call the same shared operations: "register with the Service Registry" and "fetch the current logged-in account." Even run in parallel, they end up serialized on the shared resource's lock. On its own that just makes startup slower — but when the thread holding a lock stalls on something else, everyone waiting gets swept up all at once, as happened here.

Parallelization aimed at speed becomes effectively serial under shared-resource contention, and in the worst case deadlocks. The most dangerous spot is the assumption that "it parallelized, so it must be faster." When you add startup tasks, you have to look at how each one touches shared resources (registry, auth, settings store) as a set — otherwise you not only fail to get the scaling benefit, you raise the odds of a deadlock.

Pitfall 2: auth has become an implicit dependency of every feature

The most surprising thing in the trace was that both metrics and A/B testing — features that look unrelated to auth — were reaching for "who is logged in right now" during initialization. Metrics wants to attach user attributes; A/B testing wants to bucket by account. The reasons are each fair enough, but the result is that the auth SDK has become an implicit dependency point for the entire app.

When auth-data retrieval jams at a single point, it's not auth itself that stops — it's every feature that referenced auth, stalling in a chain. You need to recognize that auth isn't "a concern around the login screen" but "a critical path of the entire startup sequence." If you count how many subsystems call auth-data retrieval at startup in your own app, the number may be higher than you'd imagine.

Pitfall 3: ownership of shared data goes adrift during migration

A design where multiple apps share authentication data is convenient for users — sign in once and you don't need to log in again on the other apps. The problem is that "who owns this shared data, and who fixes it when it breaks" is left implicit.

Suppose there's an implicit rule like "the first-installed app is the representative." If migration doesn't reproduce the install order or state, the ownership relationship goes adrift. The owner sits there holding corrupted data while other apps go to reference it. The fact that nothing was fixed until I cleared Prime Video this time may have this ownership ambiguity in the background. Shared data needs a fallback — another app taking over, or safely regenerating the data — for when the owner disappears or its data breaks.

Lessons for support and for users

Even if you're not in a position to change the design, knowing this structure changes how fast you can respond.

For support: keep in mind that "please reinstall" only works when the problem is contained within a single app. For a post-migration report of "only certain apps won't launch," suspect that migration left the shared data corrupted, and being able to offer the next move — "clear storage for the related apps as a group" — changes the opening response. Even just asking "did this start right after switching phones?" up front can sometimes narrow the investigation considerably.

For users: think of the cleanup target as "every app from the same provider," not "the app that's giving you trouble." Keeping in mind that even an app you don't use is a node in the sharing network, and that corruption there causes collateral damage, raises your odds of getting out of it on your own.

Other cases where the same pattern can occur

It's not just Amazon — there are plenty of designs where multiple apps share authentication information.

Sharing auth tokens across apps via Android's standard AccountManager
Sharing login information across same-signature apps through a ContentProvider
Groups of apps with a common account platform (cross-app login across several apps from the same company, for example)

When these combine with a design that "initializes many subsystems in parallel at startup," the same conditions line up as in this case: when the shared data breaks, every app chain-fails to launch, and a single reinstall won't fix it. If your own app group meets these two conditions, it's worth checking once how you guarantee consistency across a device migration, and how you degrade or regenerate when the sharing source breaks.

Closing

If you run into apps not launching after switching phones, first try "if a single clear doesn't fix it, clear the group of related apps." That's the shortest fix from the user's side. When the sharing source is broken, you have to wipe the source along with the rest.

From the design side, the three pitfalls this trace surfaced are worth remembering: parallel init can backfire on shared resources; auth tends to become an implicit critical path for every feature; and ownership of shared data goes adrift during migration. Each is a "well-intentioned design" on its own, yet combined they produce an app that won't start.

"Please reinstall" only holds up as a universal fix for designs contained within a single app. For apps that hold shared data and carry a complex startup sequence, the same symptom recurs even after reinstalling. Just knowing this one structure makes a real difference in how fast you respond the next time you hit the same incident.

What AgentCore Managed Harness Takes Over, What It Leaves to You

Kento IKEDA — Fri, 08 May 2026 21:03:13 +0000

On April 22, 2026, AWS added a "managed agent harness" (preview) to Amazon Bedrock AgentCore. With this feature, you declare the model, system prompt, and tools as configuration, and the agent runs—the orchestration code lives on the AWS side as managed.

https://aws.amazon.com/blogs/machine-learning/get-to-your-first-working-agent-in-minutes-announcing-new-features-in-amazon-bedrock-agentcore/

What stands out about this release is less the feature itself and more AWS's adoption of the term "agent harness." Since Martin Fowler wrote his harness engineering essay in February 2026, Anthropic and OpenAI have started using "harness" officially, and now a cloud vendor has applied the same word to its own service.

https://martinfowler.com/articles/harness-engineering.html

From the perspective of someone who has been assembling a harness by hand, the question becomes: what does managed harness take over, and what stays in my hands? This article sorts out that dividing line. Drawing on experience running business-automation agents with Claude Desktop, multiple MCP servers, and Markdown-based knowledge, I lay out the correspondence with AgentCore managed harness.

A few "tried it out" articles have already been published, so this article positions itself as the prequel: it offers material for deciding whether to adopt, not adopt, or how to phase in. Drawing on the official blog, documentation, and existing explanatory articles as sources, I sort out the correspondence and the judgment criteria that emerge from self-built operation.

AWS released "managed harness"

The official blog mentioned above lays out the structure: every agent has an orchestration layer, and running that layer requires compute, a sandbox to safely execute code, tool connections, persistent storage, and error recovery as the underlying infrastructure—bundled together, they form the agent harness. Managed harness is AWS providing this harness as a managed offering, where the user declares the model, system prompt, and tools as configuration, and a working agent is the result.

Let me first align on what the word "harness" refers to. The term gets used both for what the vendor builds in (internal) and for what the user assembles around the agent (external), and the meaning shifts with context. In addition to Fowler's framing, watany has organized the internal/external confusion in a Zenn article.

https://zenn.dev/watany/articles/d8b692bbca65a3

This article is written from the position of "someone who has been assembling the external environment by hand"—the user-side harness, in operation. AgentCore managed harness can be read as the vendor-side internal harness now offered as managed, but from the user's perspective, it can also be read as: part of what we used to build for ourselves can now be delegated. This duality is the starting point for thinking about where responsibilities split with self-built operation.

Self-built harness composition, the four blank layers

Let me map my self-built harness to AgentCore's components. The environment I've been operating consists, broadly, of three elements, and I'll lay out how each one corresponds to something on the AgentCore side.

Self-built harness	AgentCore side	Degree of correspondence
Markdown knowledge files (under `agents/`, `knowledge/`)	AgentCore Memory	Similar role; persistence and retrieval mechanisms differ
MCP servers (task management / calendar / chat / document management, etc.)	AgentCore Gateway	MCP is becoming the standard, so they're close
Claude Desktop	AgentCore Runtime	The execution base for the agent loop, at a different scale
(none)	AgentCore Identity	Not implemented in self-built
(none)	AgentCore Policy	Not implemented in self-built
(none)	AgentCore Observability	Not implemented in self-built
(none)	AgentCore Evaluations	Not implemented in self-built

The top three are the correspondence between "what I assembled by hand" and "what AgentCore provides as managed for the same role." The bottom four are blank layers in the self-built harness—components AgentCore offers that aren't covered by my operation.

The natural question here is whether these four blank layers are "things I didn't write because I didn't need them" or "things I wanted but had given up on." The two are different. For the former, introducing managed harness yields little value; for the latter, it brings value.

Let me go through the four layers in order.

Identity is for managing authentication and permissions when multiple users access the agent. Since my self-built harness runs on a personal device, authentication can rely on the device login, and per-agent authentication wasn't necessary. This is unnecessary "as long as it's just me." The moment you try to share an agent across an organization, controlling who can call which MCP for what becomes a problem, and the gap surfaces in the form of resignation.

Policy is the mechanism for declaratively defining boundaries when the agent calls tools. It's based on Cedar, AWS's open-source policy language, and you can generate policies from natural language. In my self-built harness, I draw loose boundaries through MCP server scopes and by documenting "what not to do" in the knowledge files—but this is discipline, not enforcement. I had wanted to write strong, enforceable boundaries, but didn't have the motivation to build a Cedar-equivalent system myself, so I had given up on this area.

Observability is the mechanism for emitting agent execution logs, traces, and metrics to CloudWatch for visualization. In my self-built harness, I have the conversation history in Claude Desktop and individual logs from each MCP server, but no mechanism to track "which agent called what when, and how it failed" across the board. For solo use, looking at the chat screen suffices, but this becomes necessary in organizational deployment, and falls into the resignation category.

Evaluations is the mechanism for continuously evaluating the agent's response quality, with built-in evaluators for dimensions like helpfulness, tool-selection accuracy, and correctness. In my self-built harness, I check subjectively through knowledge-file improvement history and daily work logs, but I have no quantitative quality monitoring. For solo use, subjective is enough; but for organizational operation or paid services, this becomes essential.

Looking back at the four layers, only Identity falls into "unnecessary as long as it's just me," while the other three fall into "would have been nice, but had given up on as self-built." The fact that the meaning of "blank" differs by layer affects the judgment of whether to adopt managed harness.

Layers managed harness takes over, layers it leaves

When you use managed harness, what stops being something you write, and what continues to require writing? This can be derived as fact from the official blog and documentation, so let me sort it out first.

What managed harness takes over is the following range:

The agent loop: calling the model, selecting tools, returning results, managing context, and recovering from errors
A microVM, filesystem, and shell isolated per session
Tool-connection orchestration via AgentCore Gateway
The framework portion based on Strands Agents

Conversely, what users still need to write even when using managed harness is the following range:

Which model to use
What to write in the system prompt
Which tools to make callable
What goes into AgentCore Memory and what doesn't
What boundaries to declare in AgentCore Policy

Since declaration-based configuration suffices, the amount of code drops significantly. However, the five items above are simply "what you write as configuration changes"—the judgments themselves don't go away. They just shift into the form of the harness.json configuration file. Reading preview validation articles by people who have actually tried managed harness, you'll see that harness.json lists the model and tool list as declarations, while a separate system-prompt.md file holds the system prompt.

https://dev.classmethod.jp/articles/bedrock-agentcore-managed-harness-preview/

https://github.com/aws-samples/sample-AgentCore-Managed-Harness-News

This looks like what was previously written as Markdown system-prompt files and MCP connection definitions in the self-built harness, repackaged into AWS's configuration file format.

In other words, what managed harness takes over is "the labor of writing orchestration code," not "the judgment of designing the agent." Design judgments still rest with the user. AWS expresses this as removing the infrastructure barrier, but the non-infrastructure part—"what is this agent for, and how far should it be allowed to go"—remains on the human side, whether it's managed or self-built.

This distinction is an important perspective when judging whether to adopt managed harness. The pitch "you don't have to write code" is accurate, but reading it as "you don't have to think" makes it inaccurate.

Where self-built operation can articulate "the place of design judgments"

When you operate a self-built harness, you accumulate judgments about "where it's okay to move things, and where you must not." These don't go away when you adopt managed harness. The place where they appear shifts to the contents of harness.json, but the judgments themselves continue to rest on the human side. Let me name a few representative ones.

Knowledge file granularity. Whether to split your Markdown knowledge "by role" or "by task" is a judgment that, once made, eases subsequent operation. Splitting by role lets agent dispatch fall naturally out of context. Splitting by task scatters cross-task knowledge. There's no simple winner; the optimum depends on the number of agents you operate and how tasks overlap. Even with managed harness, the same question—what to combine in Memory and what to separate—remains.

MCP server combination design. This is the line between "how far to wire up as tools via MCP" and "how far to handle through local file operations." For example, task management is better suited to MCP via API for automation, while sensitive tasks are safer kept as local file operations—judgments that emerge through use. Managed harness's Gateway has to answer the same question, just translated into declarations in a tool list.

Agent-to-agent responsibility split. This is the design choice between having a coordinator agent that judges context and dispatches to specialist agents, or calling specialist agents directly from the start. The coordinator style depends on context-judgment accuracy; the direct-call style puts the discrimination burden on the user. This too remains as a design judgment in managed harness, in the form of how to arrange and connect multiple harnesses.

These three are judgments that are hard to articulate without operating self-built first. If you start from managed harness, these judgments end up looking "as if they were optimally placed from the beginning." In reality, you've just fixed the premises, but inside fixed premises, the existence of design judgments themselves becomes harder to see.

Why not just use managed harness from the start?

Here's a counterargument I anticipate: "If we just use managed harness from the start, we won't need to build anything ourselves."

I partially agree with this counterargument. If you're building a new agent for organizational production from zero, going in through managed harness is faster, I think. However, the design of an agent to run in production rarely "is visible from the start." Only by actually using the agent do the granularity of knowledge, the over- and under-supply of tools, and the boundaries of responsibility come into view. Whether you run this discovery flow on top of a managed harness with set boundaries, or on a self-built harness with high freedom, changes the amount of learning you get.

Another perspective: judgments gained from self-built operation can be reused as a blueprint when you migrate to managed harness. If you go into managed harness without a blueprint, you can produce something that appears to work, but a system remains where it's hard to explain why it was structured that way. Whether "let's just put it on managed harness and improve it as we go" works depends on whether one person is improving or multiple people are improving. For one person, the iteration speed gap between self-built and managed may be small; but at the stage where multiple people improve, the declarative changes in harness.json and the deploy-unit iteration cycle start to take a toll as operational debt.

Order of adoption: where personal and organizational use diverge

Whether to adopt managed harness can naturally branch by operational scale. Let me go through three stages.

In the personal-use stage, where one person is using the agent, the self-built harness is often sufficient. The editing and use of knowledge files are tightly coupled, and the iteration of "rewrite Markdown the moment you notice something while using it" runs fast. Both Identity and Observability are hard to recognize as gaps as long as you're operating solo, and end up in the "would-be-nice-to-have, maybe" zone. In the experimental stage, this freedom directly translates into learning speed.

At the stage of expanding to organizational operation where multiple people use the agent, the four blank layers all surface as problems at once. You need audit logs of who used which agent how (Observability); you start running into situations where shared environments must not allow tools to be called freely, so boundaries become necessary (Policy); you need to manage credentials per member (Identity); you want to continuously measure agent response quality (Evaluations). At this stage, the value of managed harness comes to the fore. Comparing the labor of writing the four layers yourself versus putting them on AgentCore, the latter becomes practical.

In the transition phase, you can take a hybrid strategy. Continue the personal exploration stage with a self-built harness, and put only the confirmed paths used in organizational operation onto managed harness. Move agents whose design has settled to AgentCore in order, and keep agents that are still being learned on while running close at hand.

There's also a guideline for the order of adoption. The first things needed for organizational deployment are Identity and Observability, then Policy, and finally Evaluations. Without Identity, sharing itself doesn't get established. Without Observability, the organization can't make operational judgments. Policy is often too late after an incident, so placing it early in organizational deployment is safer. Evaluations can come in the order of "after operation gets going, then introduce quality measurement"—that's fine.

The harness was originally a concept lying at the boundary between those who build agents and those who use them. With AWS releasing managed harness, part of what we used to assemble by hand has shifted into a mechanism that runs simply by declaring it as configuration. The fact that layers like Identity, Observability, and Policy—which I had given up on as self-built—have come within reach is no small thing.

Even so, design judgments such as "what is this agent for," "what to leave in the knowledge," and "how far to grant tools authority" haven't been put into a form you can declare as configuration. The basis of these judgments will continue to live in the commit history and work logs of one's own repository. The experience of having built a self-built harness leaves behind, in your hands, knowledge that doesn't lose its value when you migrate to managed. With the arrival of managed harness, the boundary between "the layers we build ourselves" and "the layers only human judgment can carry" has become more clearly visible than before, you might say.

Claude Managed Agents: The Layer That Disappears, The Layer That Stays — A View from Business Automation Agents

Kento IKEDA — Tue, 05 May 2026 07:29:36 +0000

On April 8, 2026, Anthropic released Claude Managed Agents. The official framing is "meta-harness," and the engineering blog reports infrastructure-level improvements: p50 TTFT down about 60%, p95 down more than 90%. TTFT is the time from request to first response, where p50 is the median and p95 covers the slowest 5%. Cut the median by 60%, cut the slow tail by 90%. These aren't numbers you get from a minor optimization — they're the kind of numbers an architectural change produces. Early adopters include Notion, Rakuten, Asana, Sentry, and Vibecode.

https://www.anthropic.com/engineering/managed-agents

There are already several Japanese articles covering this — terminology breakdowns by watany, builder/user harness classifications by Mr. Katayama (paiza), and trial reports by kumamo_tone and galirage, among others.

https://zenn.dev/watany/articles/d8b692bbca65a3

https://note.com/rk611/n/n8424c56f4fa5

https://zenn.dev/kumamo_tone/articles/365845d65e6cf4

https://zenn.dev/galirage/articles/claude-managed-agents-quickstart

But the existing discussion almost entirely assumes coding agents. Notion (coding, spreadsheets, slides), Sentry (bug → PR automation), Vibecode (code generation infrastructure) — they all line up as coding-style use cases. For someone building business automation agents (morning briefings, monthly accounting, QA operations, style audits) with Markdown and MCP, where does Managed Agents fit? That perspective hasn't really been laid out yet.

I run a personal repository where agents/ holds 15 role-specific agent instructions, knowledge/ holds 40+ knowledge files, and prompts/ holds task templates. I run my work through Claude Desktop and MCP. The "harness engineering with Markdown only" idea I wrote about in a previous article is exactly this kind of setup.

https://dev.to/aws-builders/harness-engineering-with-nothing-but-markdown-g6b

With Managed Agents arriving, what happens to this self-hosted harness? Does all of it get replaced? Just part? Or is this a different conversation entirely?

What this article covers is the boundary between the layer Managed Agents provides and the layer you keep yourself. It's not just about what's technically possible to put on Managed Agents — it's also about why, even when it's possible, keeping it yourself can still be the better call. I'll lay this out in five points. The latter half of the article uses my own agents as a worked example, classifying them as "fits / partial fit / doesn't fit."

Coding Agents and Business Automation Agents Make Different Demands on the Harness

Before evaluating Managed Agents from a business automation angle, it's worth checking the assumptions behind the existing case studies.

The early adopters Anthropic highlights are these:

Notion: parallel coding, spreadsheet, and slide tasks delegated within the Notion workspace
Rakuten: specialist agents per department, each shipped within a week
Asana: AI Teammates picking up tasks inside projects
Sentry: bugs detected and turned autonomously into pull requests
Vibecode: code generation infrastructure with 10x faster setup

The mix runs from coding-leaning examples (Notion, Sentry, Vibecode) to department-business types (Rakuten, Asana), but they all share one thing: long-running, autonomous tasks that involve continuous operations on files and resources. Managed Agents features like "$0.08/session-hour," "checkpointing for long-running tasks," and "sandboxed code execution" are tuned for this kind of workload.

On the other hand, here are the kinds of uses you might see from someone building business automation agents with Markdown and MCP. Drawing from my own active and planned setups:

Morning briefings (calendar, Slack mentions, Gmail, news summaries pulled together each morning)
Monthly accounting support (pulling transactions from freee, aggregating in Excel, sharing with stakeholders) — under construction
QA operations (reviewing MagicPod test runs, recording problematic test cases in Confluence, sharing in Slack)
Style audits (checking article drafts against writing-style-guide.md)
1on1 prep (consolidating past notes, organizing discussion points)

Lining the two up, the demands on the harness can be sorted into roughly four points:

Aspect	Coding agent	Business automation agent
Main targets of operation	File system + repositories	SaaS APIs (Slack, Calendar, Gmail, freee, etc.)
Execution time	Long-running tasks of minutes to hours	Repeated short tasks
Where state lives	File state inside the container persists	State lives on the SaaS side, local state is transient
Triggers	Initiated by humans (chat UI)	Mix of schedules, events, and human prompts

The way Managed Agents is designed fits use cases where the file system persists and tasks run for a long time. Sandboxes, checkpoints, and session-runtime billing all make sense in that context. For business automation use — many short tasks each calling SaaS APIs — most of these features won't get fully used.

This isn't to say "business automation doesn't fit Managed Agents." Different use cases mean different things you actually get out of Managed Agents. With that in mind, the next sections cover how to combine it with a self-hosted harness.

What the Managed Agents meta-harness Actually Provides

Reading the official engineering post "Scaling Managed Agents: Decoupling the brain from the hands" (April 8, 2026) makes the design philosophy of Managed Agents clear.

https://www.anthropic.com/engineering/managed-agents

The opening sentence captures the whole thing.

Harnesses encode assumptions that go stale as models improve.

As a concrete example, Sonnet 4.5 had a behavior where it would wrap up tasks early just before hitting the context limit ("context anxiety"), and harnesses implemented context resets to compensate. Run the same harness on Opus 4.5, and the behavior is gone — the resets become dead weight. Corrections you bake into the harness become unnecessary as the model gets smarter. That's the observation.

So Anthropic chose to abstract the harness itself. Just as an OS virtualizes hardware behind abstractions like process and file, Managed Agents separates an agent into three pieces:

Session: an append-only event log. The source of truth for everything that happened
Harness: a stateless loop that calls Claude and routes tool calls
Sandbox: the execution environment for code and file operations

The harness calls the sandbox through a simple execute(name, input) → string interface. Containers, smartphones, Pokémon emulators — anything fits behind the same abstraction, as the official post puts it.

Where this decoupling pays off is that containers go from "pet" to "cattle." A pet is a uniquely cared-for, named individual; cattle are interchangeable, managed by number — that's the infrastructure ops metaphor. If a container dies, the harness receives it as a tool call error and provisions a new container. If the harness itself dies, you can wake(sessionId), call getSession(id) to retrieve the event log, and resume from the last event. Only the session log is persisted. That's the design.

The TTFT improvements mentioned at the start (p50 down about 60%, p95 down more than 90%) come from this decoupling. Inference can begin without waiting for container provisioning.

Anthropic positions its own service as a "meta-harness." Quoting the conclusion of the article:

Managed Agents is a meta-harness in the same spirit, unopinionated about the specific harness that Claude will need in the future.

In other words, what Anthropic provides is "a stable interface that any harness can sit on top of" (the virtualization of session/harness/sandbox), not a prescription for "this is your harness." Claude Code, task-specific harnesses, custom harnesses — all of them are meant to run on top of it.

That's what Managed Agents actually is.

https://platform.claude.com/docs/en/managed-agents/overview

The next section moves on to what sits on top: the contents of the self-hosted harness.

A Self-Hosted Harness Is Two Layers: Managed Agents Replaces the Bottom Half, You Keep the Top

What the official meta-harness design provides is the three abstractions: session / harness / sandbox. In other words, the OS layer that makes an agent run — the substrate that hosts what's above, like processes, file systems, and memory.

So what does a self-hosted harness put on top of that? In my own setup, it looks like this.

ikenyal-ai-agents/
├── agents/                  # role-specific agent instructions
│   ├── executive-assistant.md
│   └── ...                  # other role definitions
├── knowledge/               # knowledge base
│   ├── writing-style-guide.md
│   ├── article-strategy.md
│   └── ...                  # various contexts
├── prompts/                 # task templates
│   ├── morning-briefing.md
│   └── 1on1-prep.md
├── tasks/                   # task definitions
├── scripts/                 # analysis and automation scripts
├── docs/                    # work logs and operational docs
└── README.md                # root instructions

These sit on a different layer than the OS layer Managed Agents provides. They express what the agent "knows," "how it should behave," and "what's off limits" — the territory you might call the knowledge layer.

Layer	Provider	Contents	Examples
OS layer	Managed Agents (meta-harness)	agent loop, tool execution, sandbox, session persistence	the three abstractions of `session / harness / sandbox`
Knowledge layer	self-hosted repository	agent behavior instructions, organizational context, domain knowledge, style conventions	`agents/`, `knowledge/`, `prompts/`, `CLAUDE.md`, `AGENTS.md`, `SKILL.md`

The existing Japanese articles (watany's terminology breakdown, Mr. Katayama's builder/user harness classification) discuss the harness as one big thing. To read accurately what Managed Agents actually changes, you need to see this two-layer structure.

What Managed Agents replaces is the OS layer only. The knowledge layer stays as is.

That was the technical-fact part. From here, this article gets into its main argument. Managed Agents lets you register Skills (SKILL.md) and agent definitions, so technically you can put parts of the knowledge layer on it too. Even so, why is keeping it self-hosted the better choice? The next section breaks it down across five points.

Why You Shouldn't Hand the Knowledge Layer to Anthropic — Five Points

Reading the official Managed Agents docs, you can register the mcp_servers definition, the choice of tools, the system prompt, and Skills (SKILL.md) all as part of the agent definition. Technically, the knowledge layer can ride on Managed Agents too.

The argument of this article is that even so, keeping it self-hosted is the better call. Five reasons.

Point 1: Where the data lives changes

Business automation agents often include organizational context. In my case, agents/ contains things like the structure of my organization, operational know-how, contact information, and various judgment criteria for business decisions. If you register all of this as part of Managed Agents, it becomes a resource on Anthropic's side. Until you explicitly delete it, it stays there.

What's worth being aware of is the data retention character of Managed Agents. The official "API and data retention" doc states clearly:

https://platform.claude.com/docs/en/build-with-claude/api-and-data-retention

Claude Managed Agents is a stateful resource. You can delete session transcripts, but there is no automatic deletion.

There's a similar note for Skills (SKILL.md):

https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview

Agent Skills is not covered by ZDR arrangements. Data is retained according to the feature's standard retention policy.

ZDR (Zero Data Retention) is a contractual option Anthropic offers to enterprise API customers who go through Anthropic's review and sign individually. It guarantees that data sent through the API isn't retained on Anthropic's side. It's often cited as a precondition for handling internal data with AI. Managed Agents and Agent Skills are out of scope even under that strictest contract — that's the current positioning.

Whether or not you have a ZDR arrangement, your agent definitions, sessions, and skills are all retained on Anthropic's side. Unless you explicitly delete them, they don't go away on their own.

This isn't a "this is absolutely a no-go" kind of statement — it's a question of where the data lives and how you can move it around. Git management also typically uses external services like GitHub, so the storage being external is the same. The difference is that with git, you can choose the storage (GitHub, GitLab, self-hosted, or local-only without a remote), and the content stays as Markdown that can move anywhere. Once you register something in Managed Agents, the location is Anthropic, and the format is Anthropic's proprietary JSON structure — that's the fixed shape. When designing an agent that handles internal data, this difference becomes a factor in the decision.

Point 2: Friction in the Edit-and-Run Workflow

With a self-hosted harness, the update cycle for the knowledge layer goes like this. Open the editor, edit agents/executive-assistant.md, save. Claude Desktop picks it up on the next session — instant reflection. The whole thing takes seconds.

With Managed Agents, you edit the file, then make an API call (create / update agent) and restart the session. It's not instant — the API call adds a step in the middle.

Where this cost actually shows up is when edits happen "the moment you notice something while using it." While running an agent, you realize "this instruction is too verbose" or "I want to add this here," switch to the editor, fix the relevant file, save, see it on the next message — that flow happens routinely.

The bigger difference isn't the time itself, but whether the flow of thought breaks. With a self-hosted harness, edit-and-reflect is part of the "using it" flow. With Managed Agents, the API-call step interrupts. There's an option to write a "Markdown → API sync script" yourself, but that script then becomes its own maintenance target.

Point 3: Losing the Benefits of Git Management

The knowledge layer is a continuous loop of trial and error. You rewrite an agent's instructions, see what happens, rewrite again. git diff shows you what changed, git log lets you trace history, git blame tells you why something was added. If you don't like where it's going, branch off and experiment.

None of this works through a Managed Agents agent-definition API. Anthropic's side likely has version control of some kind, but the wider git toolchain ecosystem (GitHub, PRs, CI, code review, cherry-pick, rebase) doesn't apply.

The evolution of the knowledge layer has value when you can look back at it through git history. Being able to trace "when and why was that one line added to executive-assistant.md" alongside the commit message — that's a small thing that quietly props up your operational confidence.

Point 4: Open-Standard Portability

This is the point I personally weight the most.

In my previous DESIGN.md article, I covered how AGENTS.md and SKILL.md are open standards.

https://dev.to/aws-builders/agentsmd-skillmd-designmd-how-ai-instructions-split-into-three-layers-d0g

AGENTS.md is jointly promoted by OpenAI, Google, Sourcegraph, Cursor, Factory and others, and was donated to the Linux Foundation in December 2025. SKILL.md is the core of the Agent Skills standardized by agentskills.io.

https://agents.md/

https://agentskills.io/

Multiple AI agents — Codex, Claude Code, Cursor, GitHub Copilot — read the same files.

Managed Agents agent definitions, on the other hand, are an Anthropic-proprietary JSON structure that bundles fields like name, model, system, tools, mcp_servers, skills, etc. Registering SKILL.md to Managed Agents makes it work, but it's a registration confined to Anthropic — Codex and Cursor can't see it.

That's less "vendor lock-in" and more like a re-lock-in of something that just got standardized into one specific implementation. Against the trend of AGENTS.md / SKILL.md spreading as open standards, choosing to confine your own knowledge to a vendor-specific format doesn't have a compelling reason to actively pick.

A repo that holds AGENTS.md / SKILL.md itself is a way to keep a "neutral location" — referenceable equally from Managed Agents, Codex, Cursor, and other AI agents.

Point 5: Speed of Testing and Iteration

In the "growing it" phase of an agent, cycle speed is what determines quality. Rewrite a line of instruction, try it, see what happens, rewrite again. The faster that loop, the higher the agent's accuracy ends up.

With a self-hosted harness (Claude Desktop + Markdown), you rewrite, save, see it on the next message — seconds.

With Managed Agents, you call the agent-update API, rebuild the environment, restart the session, then test. Every cycle has API-mediated steps in it. For a "growing it" phase, that tends to work against you.

For long-running production tasks (sessions of multiple hours and up), Managed Agents' stability and scalability really shine. But many business automation agents stay in the "fed and grown daily" phase for a long time. My executive-assistant.md has been getting some kind of weekly tweak for several months now.

What Comes into View When You Bundle These Five

"Can be put on" and "should be put on" are different problems. Just as the official Managed Agents design points toward a meta-harness, the knowledge layer that sits on top can also reasonably stay outside the meta-harness, judged across the points above.

Just as the official side chose three-way separation (session / harness / sandbox) and a design that doesn't dictate the shape of what goes on top, the user side can equally make the choice of "keep the knowledge layer self-hosted" without dictating its shape. That feels like a natural conclusion when you see things through the two-layer structure.

Classifying My Own Agents into "Fits / Partial / Doesn't Fit"

Mapping the discussion onto my own agents: classifying them across "fits / partial fit / doesn't fit" on Managed Agents, the patterns roughly look like this.

Type of use	Main tools	Verdict	Reason
Personal-assistant style (calendar, mail, chat, local files)	calendar, mail, chat, ticket management, web search, local file ops	Partial	Main MCPs are remote; local file-ops MCPs don't exist as remote, that part doesn't fit
Infrastructure-ops support	cloud APIs, chat, docs	Fits	All SaaS, long-running tasks are also a comfortable assumption
Project / org-ops support	chat, ticket management	Fits	All main tools complete via remote MCP, no local dependency
Test-quality ops	test automation tool MCPs, chat, docs	Doesn't fit	Main test automation tool MCPs are local stdio-only, so they can't be called from Managed Agents which assumes remote
Department functions (accounting, legal, etc.)	SaaS APIs, chat, local Excel or doc references	Partial	The SaaS API side is fine on remote MCPs, but local Excel and doc references don't ride along, and how internal data is handled also needs an organizational governance call
Conversational thinking-organization (morning briefings, 1on1 prep, etc.)	calendar, chat, web search	Doesn't fit	Designed in tandem with the Claude Desktop conversational experience (digging in by dialog) — autonomous Managed Agents isn't the right fit

What's notable is that the knowledge layer stays self-hosted across every pattern. Even the agents I judged as "fits" still keep their role definitions, organizational context, and style conventions in the self-hosted repo. What's placed on the Managed Agents side is the OS-layer functionality (sandbox, harness loop, session persistence, auth vault).

This is the concrete instance of the two-layer structure shown in earlier sections. The OS layer is something you can hand off to Anthropic; the knowledge layer stays in your hands. Per agent, you decide "OS layer handed off / OS layer kept" — that's the shape of the call.

The judgment axes for fitting your own agents into this look like four:

Whether the main tools complete on remote MCPs, or depend on local tools
Whether the data being handled includes internal-organization data, or doesn't
Long-running tasks, or repeated short tasks
"Growing it" phase, or "operating it" phase

Looking through these, decide per agent.

Anticipated Counterarguments and Responses

Three counterarguments, taken in turn.

Counter 1: If the main MCPs are remote-capable, why not put it all on?

Main MCPs (Slack, Atlassian, Calendar, Gmail, freee) are remote-capable, so why not put all the business automation on Managed Agents?

True — as of May 2026, many of the main MCPs you'd want for business automation are remote-capable: Atlassian Rovo, Slack, Google Calendar/Gmail, freee, and so on. The pool of "agents you could put on it" is growing.

https://platform.claude.com/docs/en/managed-agents/mcp-connector

https://platform.claude.com/docs/en/agents-and-tools/remote-mcp-servers

That said, what this article argues isn't "don't put it on because you can't" — it's "even when you can, you don't always should." The OS layer is a candidate for putting on; whether to put the knowledge layer on it has to be judged individually across the five points (where data lives, edit workflow, git management, open standards, iteration speed).

Counter 2: $0.08/hour seems acceptable, doesn't it?

$0.08/hour seems acceptable, doesn't it?

For short tasks, no problem. If a morning briefing finishes in 10 minutes, then 20 working days × 10 minutes — about 3.3 hours × $0.08 = $0.26, plus token billing. At that scale, fine.

The question is whether you move the agents you use daily on Claude Desktop. Use that's typical of "open all day during work hours" doesn't translate to Managed Agents cleanly: session billing and token billing scale directly with running time. Same usage, same outputs — the cost is likely to go up.

"Put it on Managed Agents / use it on Claude Desktop / use both" is something to decide per agent based on use case and cost structure.

Counter 3: With existing case studies, why not put business automation on it too?

With case studies like Notion, Sentry, Vibecode, why not put business automation on it too?

These case studies are all coding-style (code generation, bug fixing, spreadsheet operations). Business automation agents typify a different shape (SaaS integration, monthly reports, QA ops) — the demands on the harness are different, as covered earlier.

And in fact, none of these case studies are entirely closed inside Managed Agents either. Notion runs in Notion, Sentry in Sentry's infrastructure, Vibecode on Vibecode's platform — each with their own knowledge and UX. Managed Agents functions as the OS layer underneath. That lines up with the two-layer structure this article argues for.

Where to Place the First Move

If you're going to actually start somewhere, this kind of flow makes sense.

Pick out the SaaS-integration-centric agents in your own collection: agents without local file ops or desktop integration are the candidates
Check whether the MCPs each agent uses are remote-capable: Slack, Atlassian, Calendar, Gmail, freee are already remote; others, check individually
Put just one agent on Managed Agents and run it: don't migrate everything at once — get the operational feel from one
Keep the knowledge layer in the self-hosted repository: register the agent definition through the API, but keep agents/, knowledge/, prompts/ in git. Treat the Markdown as canonical, the API registration as a mirror
Expand the put-on range gradually, or don't: after running one, if cost, speed, and edit workflow are fine, move to the next; if they aren't, keep it self-hosted

"All rewritten" and "all self-hosted" are both extremes. Per agent — that feels like the realistic landing.

Harnesses encode assumptions that go stale as models improve. Those are the official words. With Managed Agents arriving, self-hosted OS-layer implementations do go stale. There's no longer a need to write the sandbox, agent loop, and session persistence yourself.

But the knowledge layer above continues to be the place where organizational and personal context lives. AGENTS.md and SKILL.md are referenced as open standards by multiple AI agents. Managed by git, edited in the editor in seconds. Things you've grown like a writing-style-guide.md of your own keep evolving in your own repository, not as a stateful resource on Anthropic's side.

The next step in harness engineering starts from thinking in layers.

AGENTS.md, SKILL.md, DESIGN.md: How AI Instructions Split into Three Layers

Kento IKEDA — Sat, 02 May 2026 21:35:11 +0000

In April 2026, Google Labs released a spec called DESIGN.md. It's a design system specification readable by AI agents, packaged with a CLI validator: npx @google/design.md lint.

With DESIGN.md in the picture, we now have three different file types for instructing AI agents. AGENTS.md has been spreading as an industry standard since 2025 (jointly developed by OpenAI, Google, Sourcegraph, Cursor, and Factory; donated to the Linux Foundation in December 2025). SKILL.md sits at the core of Anthropic's Claude Skills. And now DESIGN.md. The three handle different concerns and don't overlap.

This article is for developers using coding agents like Claude Code, Cursor, or Codex in their work, and for tech leads operating natural-language instruction files like CLAUDE.md and style guides. If your team is doing Spec-Driven Development (SDD), this should also reach you.

What I want to lay out is two things: how AI instructions are starting to split across three layers — behavior, individual tasks, and visual appearance — and how that connects with SDD as a parallel movement.

The Old Pattern: Natural-Language Documents

A few years into the ChatGPT era, most engineers have written some form of "rules I want the AI to follow" in a Markdown file. CLAUDE.md, styleguide.md, CONTRIBUTING.md, internal coding conventions. The locations vary, but the format is roughly the same: unstructured natural language.

A writing-style-guide.md file I've been building over the past few months is a typical example. It's a style guide I use when writing technical articles with Claude — a list of patterns common in AI-generated text, written down as forbidden phrases. By making Claude Desktop read it every session, the tone of my output stays consistent. It's part of a personal repository (ikenyal-ai-agents) I use as the harness for my business automation agents — the one I covered in my previous post.

https://dev.to/aws-builders/harness-engineering-with-nothing-but-markdown-g6b

The file contains roughly 150 lines: rules like "don't use em dashes," "avoid invitations like 'let's try…!'," "drop AI-style preambles like 'what's interesting is…'." The same repository has 15 instruction files under agents/, organized by team and role: executive-assistant.md, sre-support.md, qa-support.md, accounting.md. Each describes "the assumptions to operate under as this role" in plain natural language.

This approach has clear benefits. You can articulate tone, stance, and implicit rules. New team members can read the files and pick up the expectations. With CLAUDE.md, Claude Code reads it every session, so persona-level instructions land consistently.

There are limits, too. First, validation falls on humans. Whether a rule was followed or not gets decided by a human reading the output. Second, individual judgment leaks in. "Write politely" means different things to different reviewers.

The third limit is the actual subject of this article. Rules that are formally verifiable (forbidden phrases, em-dash usage, specific pattern matches) and rules that require judgment (tone, structural choices, how to open with empathy) sit in the same file. So even the verifiable parts end up depending on human review. That's the problem the three new file types are addressing.

New Type 1: How DESIGN.md (Google Labs) Specifies Visual Appearance

On April 10, 2026, Google Labs published the DESIGN.md specification at google-labs-code/design.md. As of early May, the repo has over 11,000 stars. It's the reference implementation for Google Stitch (stitch.withgoogle.com), an AI-driven UI generation product.

https://github.com/google-labs-code/design.md

The specification doc lives on the Stitch side.

https://stitch.withgoogle.com/docs/design-md/specification

What DESIGN.md covers is the design system specification. You write machine-readable design tokens in YAML at the top of the file (colors, typography, spacing, components), and human-readable design intent in the Markdown body underneath. Both live in the same file.

---
name: Heritage
colors:
  primary: "#1A1C1E"
  tertiary: "#B8422E"
typography:
  h1:
    fontFamily: Public Sans
    fontSize: 3rem
---

## Overview

Architectural Minimalism meets Journalistic Gravitas.

## Colors

- Primary (#1A1C1E): Deep ink for headlines and core text.
- Tertiary (#B8422E): "Boston Clay", the sole driver for interaction.

The headline feature of this format is the CLI validator that ships with it.

npx @google/design.md lint DESIGN.md

This checks token reference integrity, WCAG contrast ratios, and structural rule compliance, returning the result as JSON. Wire it into CI and you can verify design system consistency on every pull request. There's also a diff command that compares two DESIGN.md files and returns token-level changes in a structured form. Design system version control — historically a manual process — gains a verifiable layer.

For Japanese UIs, the Google Labs spec alone falls short. It doesn't define the typography requirements specific to Japanese (CJK font fallback chains, line height, letter-spacing, kinsoku shori, mixed typesetting). The gap is filled by kzhrknt/awesome-design-md-jp, which publishes Japan-localized DESIGN.md files for over 10 services including Apple Japan, SmartHR, freee, note, MUJI, Mercari, LINE, and Toyota. For Japanese products, using both the Google Labs spec and the Japan edition together is the practical approach.

https://github.com/kzhrknt/awesome-design-md-jp

What DESIGN.md carries is the design system that used to be scattered across Figma files and style guide PDFs, now consolidated into a single file with both machine-readable and human-readable parts. Think of it as the spec foundation that lets AI agents generate UIs with a consistent look every time.

New Type 2: How SKILL.md (Anthropic) and AGENTS.md Specify Behavior

While DESIGN.md covers "appearance," SKILL.md and AGENTS.md cover "behavior" — defining what the agent is trying to do, how it should proceed, and what it must not do.

SKILL.md is the file format standardized by agentskills.io as part of the Agent Skills open standard. Anthropic's Claude Skills is one implementation of this standard; the same SKILL.md works across Claude Code, Claude.ai, and the Agent SDK. Because it's standards-compliant, the same file is also readable by other agents like OpenClaw and Hermes. The structure: declare metadata (skill name, description, allowed tools) in the YAML at the top of the file, and write the task procedure or domain knowledge in the Markdown body below.

https://agentskills.io/home

A clear example of SKILL.md is conorbronsdon/avoid-ai-writing. It's an English-only skill that detects and rewrites AI patterns in English text — transition phrases like "Moreover," significance inflation like "watershed moment," and roundabout verb constructions like "serves as." It uses a 100+ word replacement table organized into 3 tiers (Tier 1 always replaces, Tier 2 flags when 2+ words appear in the same paragraph, Tier 3 flags only at high density), and audits 36 pattern categories. Two modes: detect and rewrite.

https://github.com/conorbronsdon/avoid-ai-writing

What sets it apart from a one-shot prompt is the structured audit it returns. In rewrite mode, you get four discrete sections: identified issues, the rewritten text, a summary of changes, and a second-pass audit. What changed and why becomes transparent.

AGENTS.md covers the agent's overall behavior. Project assumptions, roles, prohibitions, escalation rules. As I mentioned at the top, it started with the Amp team at Sourcegraph; today OpenAI, Google, Cursor, and Factory jointly drive it, and it was donated to the Linux Foundation in December 2025. Think of CLAUDE.md as the Claude-specific version of AGENTS.md. Claude Code reads CLAUDE.md rather than AGENTS.md in its spec, but the pattern recommended by agents.md is to make AGENTS.md the actual file and symlink CLAUDE.md to it. In the personal repository I introduced earlier, the files under agents/ belong to this layer.

SKILL.md and AGENTS.md cover different ranges. AGENTS.md handles "overall context and boundaries." SKILL.md handles "an executable unit for a specific task."

The avoid-ai-writing English style auditor I mentioned is a specific task, so it ships as SKILL.md. A file like agents/genda/qa-support.md, which describes the assumptions and engagement style of a QA role, defines the agent's boundary — that goes on the AGENTS.md side.

The shared concern of these formats is "behavior and procedure," not visual appearance. What the agent knows, what it's tasked with, what it must avoid. That's a movement to fix these in a verifiable form.

The Three-Layer Split

Lining up the three file types, the layers each one handles become clear.

Layer	Format	What it carries	Examples
Behavior	`AGENTS.md` / `CLAUDE.md` (natural language + rules)	Overall context, roles, prohibitions	`CLAUDE.md`, role-specific files like `agents/genda/qa-support.md`
Individual task	`SKILL.md` (YAML at top + Markdown body)	Reusable tasks, procedures, domain knowledge	avoid-ai-writing, in-house procedure skills
Appearance	`DESIGN.md` (YAML at top + Markdown body)	Design system spec, verifiable visual rules	The Google Labs reference, individual service files in `kzhrknt/awesome-design-md-jp`

The three are complementary, not competing. CLIs like bergside/typeui are emerging as tools that can generate or update either SKILL.md or DESIGN.md, depending on what you choose — a sign of tooling that assumes the division of labor.

https://github.com/bergside/typeui

What's actually different across the layers is "where to place the balance between machine-readable and human-readable." AGENTS.md skews almost entirely human-readable; over-structuring it would block the contextual judgment and nuance it needs to convey. SKILL.md is partially structured by the YAML at the top, but the body stays human-readable — task granularity has to be readable by humans before it can be instructed. DESIGN.md puts machine-readable design tokens in the top YAML and human-readable design intent in the body, with the two cleanly separated.

The center of gravity between "machine-readable" and "human-readable" sits in different places per layer. That's just the standard structuring principle — "manage things at different layers in different files" — applied to AI agents. The file names themselves spell out the division: AGENTS.md ("instructions to the agent"), SKILL.md ("a reusable skill"), DESIGN.md ("the design system"). The names match what each one carries.

Teams that have been packing all their "AI rules" into a single CLAUDE.md now face a split decision. Open up your CLAUDE.md and run these questions against it — splits start to surface:

Is there a section writing design system rules? → If yes, that goes to DESIGN.md
Are specific task procedures in there (monthly aggregation, test review, contract review)? → If yes, those go to SKILL.md
What's left is overall agent context and boundaries (roles, prohibitions, escalation criteria) → that's the AGENTS.md equivalent that stays

The three-layer split works as a framework for splitting your file.

Connecting with SDD

Stepping back to look at the bigger picture: how does the three-layer split relate to the broader movement of "specs for AI"?

SDD is a development style where you write the spec — requirements, design, tasks, implementation — before generating the code. The underlying idea: "specs aren't disposable scaffolding, they're executable artifacts that produce code." AWS's Kiro provides a workflow that generates requirements.md, design.md, and tasks.md in order under .kiro/specs/{feature}/. GitHub's Spec Kit (over 90,000 stars) supports the same flow with slash commands like /specify, /plan, /tasks, /implement. The EARS notation (Easy Approach to Requirements Syntax) used by Kiro reduces ambiguity by formatting requirements into 5 fixed templates. SDD has spread quickly between 2025 and 2026.

https://kiro.dev/

https://github.com/github/spec-kit

The three-layer split (AGENTS.md / SKILL.md / DESIGN.md) and SDD look like separate movements on the surface. The SDD community concentrates on Kiro and spec-kit usage; the DESIGN.md side concentrates on formal specs and validation tooling. You don't see many articles bridging the two.

But put their philosophies side by side and the overlap is striking.

#	Shared philosophy	SDD (Kiro etc.)	`DESIGN.md` / `SKILL.md` / `AGENTS.md`
1	Specify before implementing	requirements → design → tasks → implementation	behavior → implementation, appearance → implementation
2	Mix machine-readable + human-readable	`requirements.md` (EARS notation) + natural language	YAML at top + Markdown body
3	Persistent context for the AI	reference `.kiro/specs/{feature}/` every time	reference `DESIGN.md` / `AGENTS.md` every time
4	Reduce ambiguity through structured syntax	EARS notation structures requirements (5 templates)	`lint` validates WCAG contrast ratios and structural rules
5	Fix "decisions made" as a place	spec files are where decisions live	spec files are where decisions live

Both sit inside the larger "specs for AI" movement and share the same underlying philosophy.

That said, they're not the same thing. The biggest difference, in one phrase: time horizon.

#	Axis	SDD	`DESIGN.md` / `SKILL.md` / `AGENTS.md`
1	Time horizon	Describes "what to build next"	Describes "rules that already exist"
2	Scope	Single feature / project lifecycle	Persistent rules and styles
3	Update rhythm	New per feature → consume → archive	Long-term maintenance, gradual growth
4	Subject	Requirements, design, tasks (procedure for action)	Rules for behavior, individual tasks, appearance

SDD specs describe "what we're going to build." requirements.md is "what this feature needs to satisfy"; design.md is "how to implement this feature"; tasks.md is "how to break the feature into work." Once the feature ships, they finish their job and get archived.

The three-layer specs describe "what should always hold." DESIGN.md provides the color and typography rules every time you generate a UI; AGENTS.md provides the agent's assumptions across every session. They get maintained long-term and grow incrementally.

This time-horizon difference is why the two don't compete. Transient specs and persistent specs coexist in the same project. They can also reference each other. Imagine writing "use {colors.tertiary} for the button" inside .kiro/specs/checkout-feature/design.md — that lets a transient feature spec reference a color token from a persistent DESIGN.md. The pattern isn't widely established yet, but the structure fits cleanly.

One thing worth noting: as of May 2026, the active areas of SDD (the Kiro community and similar) and the active areas of DESIGN.md / SKILL.md / AGENTS.md haven't really crossed paths. The SDD side concentrates on "how to build a feature"; the three-layer side concentrates on "how to deliver the rules."

You don't have to be doing SDD to start with the three-layer split — the split alone gets you to the door of "specs for AI." If your team is already on SDD, start referencing DESIGN.md tokens from inside your feature specs and you avoid maintaining the same rules in two places. The two movements look set to converge in the next phase.

Not Everything Becomes a Spec

The discussion of the three-layer split tends to drift toward "shouldn't we just spec everything," but in practice, that doesn't happen.

Rules that can't be formally verified stay as natural-language documents. Tone, structural choices, cultural nuance. Things like "how to open an article with empathy" or "how to give an ending the right amount of resonance" — judgment-based qualities. The cost of speccing them isn't the issue; the essence falls out when you try.

The judgment is straightforward: "is this formally verifiable?"

Color contrast ratios (verifiable) → DESIGN.md
Word substitutions like "leverage → use" (verifiable) → SKILL.md
Tone (soft assertions, not textbook-sounding), overall stance (not teaching, just organizing) and similar (not verifiable) → stays in AGENTS.md / CLAUDE.md

For small teams, "one natural-language file" is often enough. If CLAUDE.md alone is keeping things running, there's no need to force a split. The trade-off between the cost of speccing and the load of operating it depends on team size and how long the operation has to last.

The three-layer split is something you adopt incrementally, just like SDD — you don't need to spec everything at once. Start with the complex areas, the areas where verification helps most.

In other words, the three-layer split isn't a goal. It's an option you adopt when the situation calls for it.

Where to Start

A few options come into view from this overview.

A reasonable first move is to open your CLAUDE.md or style guide and sort it into "formally verifiable" and "judgment-based" sections. Color and typography rules, word substitution lists, structural rules. If a useful amount of verifiable content sits there, pick one to break out into either DESIGN.md (appearance) or SKILL.md (task). Don't try to split everything at once — start with the most independent piece.

Pulling in external skills is another route. Drop a ready-made SKILL.md like avoid-ai-writing into ~/.claude/skills/ and your stance as a writer doesn't change — only the verification gets handed off to the machine.

Teams already running Kiro or spec-kit are probably at the stage where they could try referencing DESIGN.md tokens from inside .kiro/specs/{feature}/design.md. The cross-reference between feature specs and persistent specs is still a thin area in terms of public examples.

The shared stance: don't try to spec everything at once. Document split → operational trial → speccing — staged migration is the realistic path. The three-layer split isn't a finished form. It's a movement still in progress, and that's the safer way to read it.

AI rules started splitting from a single natural-language document into three spec formats. That's another side of the same movement as SDD.

Not everything becomes a spec, but managing different roles in different files — that ordinary structuring is starting to apply to AI agents, too.

Harness Engineering with Nothing but Markdown

Kento IKEDA — Sun, 26 Apr 2026 23:54:18 +0000

If coding agents aren't your primary battlefield, "harness engineering" probably feels like a distant concept. Scrolling through a timeline full of articles written for Claude Code and Codex users, you may have thought, "This isn't about me."

My own agent use wasn't centered on coding either, so none of the articles out there seemed to apply to my case. But looking back, I'd been doing the same thing — it just didn't have a name yet.

I've been running a business automation agent via Claude Desktop (through MCP servers) for several months now. It gathers information across multiple work tools like Slack, Confluence, and Google Calendar, switches judgment criteria based on context, and produces outputs accordingly. What the agent refers to goes beyond surface-level rules — accumulated knowledge such as understanding of organizational structure, past decision-making history, and writing style guidelines forms the foundation for its judgment.

I haven't written a single line of code. All I write is Markdown. And most of that Markdown is generated by the agent itself — I just approve or give revision instructions through chat. I almost never open the files directly to edit them.

This article isn't for people already practicing harness engineering. It's for those who've heard the term but thought, "That's a coding thing, right?" — I'm sharing the structure I've found. Each example includes a ready-to-use sample, so if you're running a business automation agent with MCP, you can try them as-is.

What Is Harness Engineering?

Let me set the foundation.

Mitchell Hashimoto, co-founder of HashiCorp, gave the name "Engineer the Harness" to a practice he'd cultivated in his AI agent workflow, in a February 2026 blog post. The approach: when an agent makes a mistake, instead of fixing the prompt, build an environment where the same mistake can't happen again.

https://mitchellh.com/writing/my-ai-adoption-journey

Days later, OpenAI published a practice report titled "Harness engineering." A small engineering team spent five months building a product using only Codex agents with zero hand-written code, and the repository reached roughly one million lines. The back-to-back publication of Hashimoto's blog and this report cemented "harness engineering" as a term.

https://openai.com/index/harness-engineering/

In the coding agent context, this translates to implementations like banning specific patterns with ESLint, defining commands in AGENTS.md, and running automated reviews via pre-commit hooks.

From "asking" (prompts) to "building" (environment). That's the core.

Up to this point, the story seems confined to the world of coding agents. But in 2025, MCP became widespread and rapidly expanded the practical scope of non-coding agents. Once agents gained direct access to business tools like Slack, Confluence, Google Calendar, and Jira, the risk of "agents making mistakes on their own" spilled beyond coding. Harnesses are no longer just for coding agents.

I Kept Rewriting Prompts

When you incorporate agents into business workflows, you run into experiences like these.

You write "don't make financial judgments" and it makes them anyway. You write "don't post directly to Slack — create a draft" and it tries to post. You write "commit and push at the end of the session" and it forgets.

Each time, I'd rewrite the prompt. Under the assumption that "if I write it more clearly, it'll understand."

At some point, I realized the assumption itself was wrong. No matter how much you polish a prompt, the agent makes the same mistake in the next session. Instructions get buried in long contexts. When the session ends, memory disappears entirely. Requests are volatile.

Stop expecting the agent to remember. Change the environment instead. Looking back, this was the entry point to harness engineering.

Harnesses for Non-Coding Agents

When I lined up what I'd been doing in my repository, the same structure as coding agent harnesses emerged.

Coding Agent Environment	Non-Coding Agent Environment
ESLint / TypeScript strict type enforcement	Prohibited actions section under `agents/`
`AGENTS.md` command definitions	Context routing rules in instruction files
Pre-commit hooks	Mandatory actions at session end
CI gates (can't merge unless tests pass)	Forced knowledge accumulation rules under `knowledge/`

The materials on each side are completely different. One uses linters and hooks, the other uses Markdown files. But the design intent is the same: building an environment outside the agent where the agent can behave correctly.

One prerequisite to note: most AI chat tools have a designated place for instruction files that are automatically loaded at session start. In Claude Desktop it's Project Knowledge; in ChatGPT it's Custom Instructions. What I call "instruction files" in this article are Markdown files placed in this mechanism. Unlike writing in the prompt each time, they're automatically placed in a position that's hard to bury even as conversations grow longer.

Here are three concrete examples, each with a ready-to-use sample.

Structuring Prohibited Actions

Say you've delegated Slack posting to your agent. Even if you write "don't post directly — create a draft" in the prompt, it forgets across sessions.

The solution is to create a prohibited actions section in the instruction file and structure it so it's loaded every session. Move the instruction's location from prompt (volatile) to file (persistent).

## Prohibited Actions

Follow these without exception.

- Do not auto-post to company Slack (draft only; user handles posting)
- Do not make definitive financial judgments (always ask user for confirmation)
- Do not treat replies to clients as final versions (always get user approval)
- Do not make judgments about personnel evaluations or compensation
- When including confidential information (salaries, contract amounts, etc.) in summaries, explicitly note this

Instead of telling someone verbally each time, place rules in a fixed location and reference them every time. It's that simple, but it changes the lifespan of rules from per-session to permanent.

Forcing Actions at Session End

You want to leave a work log at the end of each session. Even if you write "create a work log and commit & push at the end" in the prompt, the agent just wraps up when the conversation gets lively.

The solution is to define trigger conditions and mandatory actions as a set in the instruction file.

## Mandatory Actions at Session End

When the user indicates work completion with phrases like "done," "thanks," or "commit,"
execute the following. Skipping is prohibited.

1. Create a work log at `docs/work-logs/YYYY-MM-DD-{topic}.md`
   - Include: background, options considered, key decisions, deliverables, next steps
2. Append a summary of changes to `CHANGELOG.md`
3. Execute git commit & push

The difference from the prohibited actions example is that trigger conditions for "when to fire" are also defined. By explicitly stating end signals like "done," "thanks," and "commit," the agent can more easily judge "this is the moment." It's not perfect, but the firing rate goes up significantly compared to writing "execute at the appropriate timing" with vague triggers.

The key is the single line: "Skipping is prohibited." If you leave room for the agent to judge, it will decide on its own that "it's probably fine to skip this time" when conversations get long. Removing discretion stabilizes behavior.

There's a secondary benefit too. When rules are defined in the instruction file, a simple "leave a log" or "commit" is enough for the agent to instantly understand "that action." No need to explain from scratch each time. The instruction file becomes shared vocabulary between human and agent.

Forced Knowledge Accumulation

The third is an example of a "can't proceed without passing the check" structure.

In conversations with agents, information worth accumulating comes up frequently — things decided in meetings, conclusions from tool selection, facts discovered during troubleshooting. Even if you write "save important information" in the prompt, it predictably forgets.

The solution is to embed a "knowledge check" protocol in the instruction file.

## Knowledge Accumulation (Mandatory Check)

Before each response, internally execute the following check. Skipping is prohibited.

Check: Does the user's immediately preceding statement, or your own response,
contain new information matching any of the following?

1. Factual information: team composition, tech stack, account info, environment configuration
2. Decisions: architecture selection, tool adoption, policy changes
3. Learnings: facts discovered during troubleshooting, gotchas, operational tips
4. Client-specific: contact names, contact info, project progress

→ If applicable: In addition to the normal response, append the following at the end.

💾 Knowledge capture proposal:
  File: knowledge/{project-name}/{filename}.md
  Content: (summary of content to add)
  Reason: (why this should be accumulated)

→ If not applicable: Append nothing.

The intended structure is "can't produce a response without passing the check." Of course, LLMs can skip instructions, so the enforcement isn't as strong as a mechanical gate. Still, by embedding the check into the system, the probability of capturing information rises significantly even when the human forgets to say "save that."

Since implementing this system, knowledge files have been steadily accumulating in the knowledge directory.

Acknowledging the Enforcement Gap

Let me address the strongest counterargument upfront. "Markdown prohibitions don't have the same enforcement power as a linter." That's correct.

Linters and type checkers mechanically detect rule violations. Depending on configuration, they can even block builds and merges entirely. Markdown prohibitions, on the other hand, carry the risk of the agent reading past them. If buried in a long instruction file, effectiveness drops.

However, the comparison here isn't against "mechanical enforcement" — it's against "writing it in the prompt each time." Why does writing in a file work better than a prompt? Two reasons.

First, the "reference mechanism is different." As noted earlier, instructions placed in Project Knowledge or Custom Instructions are passed to the agent in a separate channel from regular messages. They're placed in a position that's harder to bury even as conversations grow longer, structurally increasing the probability of being referenced.

Second, "accumulation becomes irreversible." Instructions written in a prompt don't exist in the next session. Write them in a file, and they persist unless deleted. The cycle of "write a good instruction → forget → write again" becomes "write a good instruction → append to file → automatically referenced from then on."

Lining up enforcement strength from weakest to strongest:

"Write in the prompt each time" → "Place in a persistent file and reference every time" → "Mechanically block with linters and hooks"

Non-coding agents are currently at the middle position. Definitely stronger than the left end, doesn't reach the right end. But moving to the middle is still better for agent stability than staying at the left.

Repository Structure as a Design Decision

So far I've written about individual rules, but the "where to put" the rules is itself a design decision.

The repository structure that solidified through operation looks like this:

ai-agents/
├── agents/                  # Role-specific instruction files
│   ├── assistant.md         # Main instructions (prohibitions, mandatory actions)
│   ├── project-a/
│   │   ├── sre-support.md   # SRE-specific instructions
│   │   ├── qa-support.md    # QA-specific instructions
│   │   └── ...
│   └── project-b/
│       ├── accounting.md    # Accounting-specific instructions
│       └── ...
├── knowledge/               # Accumulated knowledge
│   ├── project-a/
│   ├── project-b/
│   └── writing-style-guide.md
├── docs/work-logs/          # Per-session work logs
└── CHANGELOG.md

This structure shares two principles with coding agent harness design.

The first is "separation of concerns." OpenAI's report documents the experience of a monolithic AGENTS.md not working well. When everything in the context is "important," nothing is important. In my own repository too, I initially crammed everything into a single Markdown file. Separating files by role and having the agent reference only what's needed improved instruction effectiveness.

What enables this is context routing rules. Define routing in the main instruction file so the agent can reference the appropriate specialized instructions based on conversation content.

## Context Routing Rules
Judge which context the user's statement belongs to and reference the appropriate specialized instructions.

- Project A context signals: AWS, infrastructure, SRE, QA, team member names → Reference files under `agents/project-a/`
- Project B context signals: billing, contracts, accounting, legal → Reference files under `agents/project-b/`
- Ambiguous: Ask which project this is about

This is the same structure as the AGENTS.md "design as a pointer" principle. The main file handles routing only, delegating details to specialized files. OpenAI's report describes keeping AGENTS.md to roughly 100 lines, functioning as a map. For non-coding agents, I've observed the same tendency — the longer the instruction file, the more effectiveness drops.

The second is "version control." By placing instruction files in a Git repository, change history is preserved. "When was this prohibition added?" "Which rule change made things stable?" — all traceable via diff. Slack messages and ad-hoc prompts don't preserve this history. Additionally, since it's a Git repository, you're not tied to a specific PC. Keep it on a remote, and you can launch the same harness from any device.

OpenAI's team makes the same point. Slack discussions, Google Docs content — if it's not in the repository, it's inaccessible to the agent and might as well not exist. This applies equally to non-coding agents.

Getting Started

You don't need to structure everything from the start when beginning harness engineering for non-coding agents.

In my case too, the early days were spent rewriting prompts. The order in which structure solidified was:

When the agent makes the same mistake twice, write it in a file instead of a prompt
When the file gets bloated, split by role
When information is lost between sessions, build an accumulation system

It's the same pattern Mitchell Hashimoto describes. "When the agent makes a mistake, build a system where that mistake can't happen again." For coding, you build it with linters and hooks. For non-coding, you build it with Markdown file structure. The material differs, but the thinking loop is the same.

Here's a minimal starter template. Place it in Claude Desktop's Project Knowledge or ChatGPT's Custom Instructions and it works as-is.

# Assistant Instructions

## Your Role

An AI assistant that supports user workflows.
Use MCP tools like Slack, Google Calendar, and Confluence for information gathering and organization.

## Prohibited Actions

- Do not auto-post to company Slack (draft only)
- Do not make definitive financial judgments (always ask user for confirmation)
- When including confidential information in summaries, explicitly note this

## Mandatory Actions at Session End

When the user indicates work completion, execute the following. Skipping is prohibited.

1. Create a work log at `docs/work-logs/YYYY-MM-DD-{topic}.md`
2. If there are changes, execute git commit & push

## Knowledge Accumulation (Mandatory Check)

Before each response, internally execute the following check. Skipping is prohibited.

Check: Does the immediately preceding conversation contain new information matching any of the following?
1. Factual information (team composition, tech stack, environment configuration)
2. Decisions (architecture selection, tool adoption, policy changes)
3. Learnings (facts discovered during troubleshooting, gotchas)

→ If applicable: Append a knowledge capture proposal at the end
→ If not applicable: Append nothing

This template is roughly 30 lines. Start here, and add one line to the prohibited actions every time the agent makes a mistake. In a few months, you'll have a harness built specifically for you.

The Question Harnesses Share

Harness engineering isn't a coding-specific technique. It's a design philosophy: giving agents a reliable execution environment.

Coding agents build that environment with types, linters, and hooks. Non-coding agents build it with structured Markdown and forced referencing. The materials differ, but the question is the same: "When this agent makes a mistake, where is the system that prevents it from happening a second time?"

Since shifting from "I just need to write better prompts" to "I need to build a structure where the same mistake can't happen," my agents have been running more stably.