DEV Community: Austin Vance

AI Agent Evaluation Steers the Harness | Focused Labs

Austin Vance — Wed, 03 Jun 2026 15:24:58 +0000

Agent evaluation is being conflated with scoring agent performance. Such scoring is useful, but what one gets from such scoring are edits to the harness.

If an evaluation fails to affect the harness after the fact, the team was simply left with a dashboard written up in better language. LangChain puts the sharper version in its Deep Agents writeup, where every evaluation is a vector that shifts the behavior of the agentic system.

That sentence carries the argument.

An eval suite is part of the system. The eval suite is driving the agent to behave in one way and not another. A sloppy eval is cheap to run. A broad benchmark is a good thing to include in a review. But a broad benchmark does not push the system to perform well on the task distribution actual users encounter.

I see people treat evaluation of AI agents similarly to how they score the performance of agents. Yes, that matters. But what does the score from that evaluation buy? Edits to the harness. That one failing trace gave the team something valuable. Insight into the structure of the harness.

The harness is where the lesson lands

An agent harness is the stuff around the model: tools, tool descriptions, prompts, routing rules, memory, retrieval, runtime policy, state shape, and all the weird connective tissue that decides what the model can actually do. We have written about this in the context of developing AI agency, because model swaps get too much credit and harness changes get too little ownership.

Agent evaluation belongs there.

LangChain's Better-Harness writeup makes the loop explicit: evals create the learning signal for iteratively improving prompts, tools, tool descriptions, instructions, and runtime scaffolding. To recap the useful part: design evals, run evals, get learning signal from evals, then use that learning signal to improve the components around the model. The evals are training data for the harness.

This might happen because of a weakly worded instruction, or a tool set up with incorrect parameters, or a retrieval component that sends the agent on a wild goose chase through a swamp of useless information, or a routing rule that gives the cheap model a go first even though that model is likely to take patience to get to the right answer, or because the agent's state is disappearing at exactly the wrong moment as the agent is trying to recover from something.

The score does not fix these problems. The score only earns the right to edit the harness.

The eval loop is only useful when it changes the harness and then protects the next release.

Production traces are the raw material

The best eval cases come from the system embarrassing itself.

Here is the refund agent case. The customer asks for a refund and the agent fails to check eligibility. For the research agent example, the agent reads the first file in a series of linked files but fails to open the rest. It then confidently and incorrectly summarizes the material for the user. For the coding agent, the agent changes the implementation but fails to update the tests. The agent then reports success because the patch compiled in the agent's head.

LangChain's readiness checklist says the first move is to manually review 20 to 50 real agent traces before building eval infrastructure. The fact that a team would be willing to read 50 real traces is what makes this refreshingly boring and, more important, it saves months while the team avoids building an eval suite from vibes, guesses, and the latest loud failure.

There is a distinction between capability evals and regression evals. The ability of the system to do new things will naturally have a low pass rate at first because the team is hill climbing. Once the system is able to do something, it should continue to do that thing in the future. Regression evals catch the system falling back to old behavior that the product relies on.

Evaluate the path when the path is the product

Another simplistic assumption for agent evaluations: just grade the final answer.

For a variety of reasons, grading the final answer simply does not work for task classes where the path is part of the product surface. The refund agent rejects a refund request because it skipped a check against company policy. The data agent in a meeting creates a chart by issuing a full table scan in production. The support agent solves the ticket by leaking internal notes to the customer in a conversation.

Google's Agent Development Kit docs make the split cleanly: agent evaluation should assess both final output quality and trajectory, meaning the sequence of steps, tools, and reasoning the agent used. Their ADK codelab turns that into a testing workflow with golden datasets that preserve user query, trajectory, and final response.

The release gate has to know what to test for, which behaviors rely on an acceptable path and which behaviors rely only on the output of the agent for a given input. Otherwise the team either blocks good changes with brittle tests or ships dangerous changes because the final answer looked right.

System, trace, and node evals answer different questions

The name Agentic CLEAR serves to identify the different levels at which a team can review and assess the ability of an agent to complete complex tasks. The paper describes an evaluation framework that produces insights at system, trace, and node levels of granularity. IBM's project page expands that into an open-source package that evaluates traces across system-wide issues, node or component analysis, and trace-level inspection.

Those levels map to different harness changes.

A system-level eval asks whether the workflow produced the outcome the product cares about. Did the claims agent resolve the case? Did the analyst agent produce a grounded answer? Did the coding agent land a patch that passed the intended checks? System-level failures point to architecture: routing, ownership, data access, memory, deployment boundary, or business process fit.

Trace-level evaluations assess whether an agent follows a task through to a coherent end. They look at whether an agent searches for the appropriate information prior to writing, whether an agent uses the correct tool to send a package, whether an agent sends off work that can be completed in parallel, and whether an agent follows through on a task that has no end point. Trace-level failures suggest problems with planning, tool use, interrupt handling, retry policy, and multi-agent orchestration. This is where multi-agent orchestration stops being a diagram and starts being an eval surface.

Node-level evaluations examine individual behaviors produced by an agent as it goes through a task. Did a retrieval node produce the correct documents? Does a summarizer preserve constraints? Did a tool call include the tenant ID? Did a model produce the correct function for the job? Node-level failures can be addressed by changing local parts of the harness, including a tool schema, prompt wording, retrieval filters, model choice, and guardrail placement.

One pass rate does not cut it for this type of evaluation. A single number will not highlight the repair surface to the agent developer.

System, trace, and node evals answer different questions, so they should change different parts of the harness.

Observability feeds evaluation, then evaluation changes behavior

Agent evaluation without traces becomes example-driven theater in which teams argue about a few examples, write up synthetic test cases, and then do a qualitative evaluation of behavior that nobody has actually seen triggered in real life.

Observability without evaluation is storage. I like traces. I like spans. As wonderful as these things are, the operational data the system is running on is also a receipt for how the system got to that point. That is what can become evaluation data.

Honeycomb's AI-era observability piece spells out how agentic workflows depend on operational data of high cardinality, queried quickly, because agents query production context iteratively on a case-by-case basis. In their words, agentic workflows depend on fast, queryable, high-cardinality operational data because agents ask iterative questions against raw production context, not just dashboards. The easiest way to compromise an evaluation dataset is for production traces to stop including tenant ID, tool arguments, retrieval sources, policy decisions, model versions, prompt versions, and release versions.

The eval dataset should be downstream of observability and upstream of the harness.

As such, Agent Monitoring Is an Infrastructure Workload. Monitoring, log collection, metrics collection, and tracing must be treated as workloads and run as services. Otherwise they are screenshots in a vendor console. The trace proves the agent failed. Then it dies in storage.

The release gate is the boring power move

A good release shape is a pull request that includes the modified harness, with all changes visible in the diff, plus the relevant evaluations that identified the change. The diff states that trace 481 failed and that the failed trace and its evaluation were used to modify the tool description. Or that a retrieval filter changed to avoid tenant leakage found by a node eval. Or that a route now uses a stronger model because a holdout set found cheap-model failures. The release is blocked by a regression test suite that found a path violation in the payment-approval case.

That is boring in the correct way.

LangChain's readiness checklist describes a CI/CD flow where code or prompt changes trigger offline evals, preview deployments, online evals, and promotion only after quality gates pass. Better-Harness then covers how optimization examples can guide improvement, while holdout evals and human review protect against overfitting the harness to visible cases.

Without that owner, AI agent evaluation becomes a pile of numbers. With that owner, evals become a steering wheel.

The first useful eval stack is small

The practical stack does not have to start fancy.

We should start by recording real usage traces, and then the failures in there as well, with corresponding success criteria that any reasonable human could check. Distinguish between the capability hills that the evaluation is trying to climb, and the regressions that it is trying to prevent. Tag evaluations by behavior. Holdouts should not be part of the agent's optimization loop. Log the part of the harness that changed for each failed eval. Run regression evals in CI before the agent gets another production release.

Note the granularity of agent evaluation is about a system workflow and thus System-level evaluations about the entire system workflow as a whole. Also note that trace-level evaluations verify the acceptability of the path an agent took to arrive at a conclusion. Finally, note that node-level evaluations verify the local step an agent took through a given node was correct.

Even good AI will sometimes fail to reach its desired outcome, and that is where the dataset comes from.

Every meaningful failure of the AI system should become evidence which a human can reuse. Thus a trace of AI system failure through human interactions becomes an eval of an AI program. An eval of an AI program becomes a harness edit. A harness edit goes through the holdout gate for that AI system. The next production run of the AI system produces more evidence.

In the end, AI agent evaluation is engineering.

Agent Skills Are a Software Supply Chain Surface | Focused Labs

Austin Vance — Thu, 28 May 2026 22:09:15 +0000

Agent skills, designed to serve as instructional features for developers building applications using agentic AI, are increasingly becoming supply-chain features that developers can download and execute. Agent skills packaged as code and distributed through marketplaces are becoming what I refer to as “executable supply-chain artifacts”. Like any other software feature, these artifacts can have code shaped power wrapped in a friendly-looking markdown jacket.

As AI agent skills are distributed through marketplaces and enterprise tools, the security world has to examine them as executable artifacts. A new paper, “An Empirical Study on Malicious AI Agent Skills in Marketplaces,” found that 76 of 3,984 total AI agent skills in marketplaces contained malicious payloads for credential theft, backdoor installation, and data exfiltration. The authors also found that 13.4% contained at least one critical issue after manual review. They confirmed that eight malicious skills were still available in one of the marketplaces they studied at publication time (arXiv, 2026).

That changes the conversation around agentic AI security.

The skill file has become a package boundary.

Skills are becoming packages

About a year ago we would show a skill file that was essentially a repository of an application, next to a set of rules written by a developer for a model such as Claude Code, Cursor, an agent, or workflow assistant. That repository of rules would give the model additional context or habits for how to process a query or workflow, such as being able to extract information from a complex table or generate supporting text for a given prompt.

Now the pattern has grown teeth.

The problem is that shortcuts cross trust boundaries.

A skill package has to answer plain questions: Who was the owner of the skill, what did it do, where did it get executed, what credentials were in use, what tools were used, what was the configuration of the tool’s sandbox, what were the results of the evaluations for the skill, what traces were produced by the skill, what user action started it, how rollback works for that user.

Honeycomb announced Agent Skills in the form of eight skills, two agents, and workflows that encode the observability expertise required to instrument, observe, and debug OpenTelemetry-enabled production services and migrate from distributed tracing systems into observability stacks (Honeycomb, 2026). Lyft built a self-serve platform with LangGraph and LangSmith for domain experts to define custom agents, while the platform manages the underlying graph, tools, safety, state, tracing, dashboards, and LLM-as-a-judge evaluation (LangChain, 2026).

That is the right direction for adoption.

Governance has to move with that power.

This is directly related to agent monitoring as an infrastructure workload. That note cautioned against monitoring individual model calls. For skills, monitoring has to track the installed artifact, runtime permissions, traces, user actions, and recovery process.

The skill package is where untrusted instructions become executable production behavior.

Incident review then starts with ‘how did we feel’ instead of examining the facts of the incident.

The optimizer makes the boundary harder

There are also techniques that treat the skill file as external state for a frozen agent and use scored rollouts to inform bounded edits on the skill document. SkillOpt reports wins or ties in 52 out of 52 model, benchmark, or harness comparisons across direct chat, Codex-based chat, and Claude Code-based chat (SkillOpt, 2026). SkillGrad treats the skill package as a parameter and uses trajectory-level loss to patch the package over time (SkillGrad, 2026).

The manifest of a skill in serious skill systems includes a list of tools that the skill invokes, a list of files that the skill reads and/or writes to, a list of network locations that the skill connects to, a list of secrets that the skill can request, a list of data classes that the skill can read, and a list of human approvals that are needed for a skill to perform an action that could affect data external to the skill.

Reviews do not make skills safe forever. A reviewed skill can go into production clean, then run through evals, production traces, failed tasks, and large tests. The habits it picks up may be useful. They may also add permission, relax assumptions, or change data handling through a single text edit that looked safe because it passed review and scored high.

Agyn defines “definition of an agent” in this context as “code through a Terraform provider (e.g., provisioning a function), a signal-driven stateful serverless runtime on Kubernetes (e.g., a Python loop processing events), and a zero-trust, least-privilege security model for agents that hold state and access internal services” (Agyn, 2026). In summary, the shape of an agent as a piece of infra to be governed is the intersection of its definition (as code), its runtime (as a serverless process), and the corresponding boundaries on permissions that it must operate within as a state holding, service accessing agent.

More problems arise because skills are packaged in readable files and can be dropped into workflows quickly. Package managers with no provenance tracking become malware distribution channels. CI/CD pipelines with no scoped credentials become secret leak engines. Folklore around prompts, created through limited reproduction or no change history, enables suboptimal agents to be built and deployed through self-serve platforms.

If we assign a skill to write out release notes, it will be in a small sandbox. A skill that updates billing / refund information for customers / production data / source code / users’ identities should be in a totally different box and inherit the same spending discipline we previously discussed in agentic payments.

Trusted vs. declared runtime behavior.

Self-serve agents move governance to the edge

Lyft’s description of customer support work shows the adoption pressure. Account access, damage claims, charge reviews, earnings disputes, and autonomous vehicle support are closer to support domain experts than a central MLE queue. Their old loop took months. The self-serve layer cut that loop to weeks while the platform managed state, tools, safety, and evaluation (LangChain, 2026).

For prompt libraries, the registry only needs to store the text of the prompts. For skill registries, the registry stores signed artifacts together with metadata for the artifacts: the owners, the diffs (if the artifact was created incrementally), validation results for the artifact, failure modes that are known to occur for the artifact, permission manifests for the artifact, sandbox profiles for the artifact, telemetry definitions for the artifact, and deprecation state for the artifact. The registry should enable the system to answer questions about incidents, not questions about content-library usage.

What changed last night? What skill versions did which agents run last night? What tools did these skills invoke? What eval gate evaluated updates for these skills? What credential scopes were active for these agents? What customer data were read or written by these agents? What is the process for rolling back an agent to a previous skill version? (Re)Installing an agent is presumably not sufficient to “roll back” an agent.)

Agent platforms are evolving to treat all of an agent’s skills as packages that can be run or installed, in effect treating operating knowledge the same way that a new product or feature would be packaged for distribution by an agent platform.

The minute something gets packaged, distribution outruns inspection.

Yes, there is a place for manual review, and it should be limited, though. Skills should pass static checks before they can be installed. Skills should declare what permissions they expect before it can be installed. Skills should run in a sandbox before they are promoted to full-scale task execution. The skill’s evals should be cleared before the skill is promoted to full-scale task execution. Skills should emit trace receipts during their execution. Skills should support a form of rollback after it misbehaves.

The governance model for skills has to go beyond a document and meeting. The deployment record should name the owner, track versions, show the diff for each change, define permissions, list in-scope credentials, list approved tools, define the sandbox configuration, list required evaluations, show runtime traces, record the user action that deployed the skill, and define a rollback plan.

The next set of incidents involving agents will be due to the skills executing on them.

This is directly related to our earlier note on agent monitoring as an infrastructure workload. Monitoring individual model calls is too narrow. For skills, the system has to monitor the installed artifact, runtime permissions, traces, user actions, and recovery process.

Then everyone asks who approved the agent.

This also maps onto our previous observation about trace evidence across the MCP boundary. The trace for a skill cannot simply be a transcript of a model conversation. It has to show the skill’s actions: what capability it requested, what credential it used, what tool calls it made, and what results came back.

Ask who approved the skill boundary.

Otherwise the incident review starts with vibes. Bad start.

We have spent real time on developing AI agency as if agency lives inside the model. Agency lives in the system: model, tools, runtime, memory, skills, and permission. Skills sit close to human expertise and close to runtime power. That combination is useful and dangerous.

Skill governance only works when the control follows the skill through its lifecycle.

The skill marketplace is coming. The governance boundary should get there first.

LangChain Interrupt: Agents Moved Into the Runtime | Focused

Austin Vance — Thu, 28 May 2026 04:06:28 +0000

Interrupt felt different this year.

Less model worship. More runtime.

Instead of another round of model worship, the more useful conversations at the conference took a more practical turn. Agents work best when the workflow is built agentically from the start, but as a reality on the ground, many existing enterprise processes are simply wrapped around a model, and then buttons are pressed, forms filled out, and the output of the model is copied and pasted into another field or application. That falls apart after the demo.

The better way to put agents into workflows is to build the workflow agentically in the first place. The parts that should be deterministic should be surrounded by software. The LLM can then be used for judgment, for synthesis, for planning, for dealing with ambiguity, for making priorities. And the harness and static code should be equipped with tools that have contracts. The harness and static code should be equipped with state. The harness and static code should be equipped with a way for the agent to recover from failures. And the harness and static code should give the agent enough visibility so that when the agent does something weird, someone can actually debug it.

The release I kept coming back to was LangSmith Engine.

This is the loop that people have been trying to describe for improvements of agents over time. The new trace engine watches traces of execution of agents. It clusters failures together and turns them into issues. It analyzes the production code of a harness to diagnose the root cause of a problem. It writes PRs for fixes. It proposes online evaluators. It moves failing traces of production runs into offline eval sets for further improvement.

Production behavior becomes evidence. That evidence becomes an issue for the system to fix. That fix becomes an evaluator watching for the failure to come back.

That is cool.

This changes observability. Longstanding views of observability as viewing a trace of an application's execution as merely a receipt of an application's actions are increasingly becoming obsolete as traces within agent systems become sources of new evidence for the next cycle of improvement whether that be harness tuning, updated prompts, additional context, alternative models, the generation of new evaluators or indeed the repair of workflows.

Storage has a similar point to make, as showcased by SmithDB. Agent traces have a fundamentally different shape to the traces which teams are accustomed to following through web applications. In agent systems, traces have a different event density, with individual spans taking longer to execute than would be the case for a web trace, and deep, wide span trees, with again, greatly variable timing characteristics. In some cases, individual runs of an agent can include tool calls, context that was retrieved within the run of the agent, model or evaluator output, files which were opened or written to within the run, user provided feedback, or, rarely, even a full research project.

An agent trace is different from an application trace. Instead of helping to debug why an application is slow or failing, an agent trace helps to understand why an agent decided to take a particular action in the first place.

That is a different data problem.

The same theme shows up in Deep Agents v0.6. The interesting parts are not flashy: code interpreter, programmatic tool calling, typed streaming, DeltaChannel checkpoint storage, and harness profiles for different models.

This means lower cost models, e.g. Kimi, will be able to handle routine tool work such as summarization, search, extraction and so on. This type of work should not burn frontier tokens. On the other hand, frontier models will handle the harder reasoning parts of the work, with the ceiling of the model in question being the ultimate determinant. As such, a team might use Kimi here, Qwen there, DeepSeek somewhere else and then reserve Opus or GPT for when the work actually needs it.

Just because something is in a managed platform and has a model in a box, doesn't mean that changing the model is free. It can have a whole lot of cascading effects on prompts, and how a team prompts, on tool calls, and how the system programmatically calls tools, on typed streaming, and how DeltaChannel checkpointing behaves. Failure modes are introduced in entirely new parts of the system. And cost and latency have entirely different trade spaces. If a team is tweaking the harness or static code to improve an agent, the team wants to know how different tweaks of the harness or static code correlate with changes to the model.

The question that recurs throughout this conference: what to put in a model, what to put in a harness, what to put in static code, and what to put in runtime?

And so platforms like Managed Deep Agents, Context Hub, and Sandboxes all provide ways to manage such durable threads and other important runtime structure. Files and skills and subagents, and versioned context. Safe code execution and human approval flows. Tracing and checkpoints. And, of course, a memory that lives somewhere other than in the vibes of the person interacting with the system.

MCP will connect agents to tools that are within an organization's own stack. This is different from A2A, where agents interact with other agents. Some of those will be user agents, and others will be digital employees that have their own permissions, budgets, policy boundaries, and audit trails. This will require different auth models than current approaches, and in many cases, existing agents are essentially useless, or worse, frightfully powerful in their current form.

For LLM Gateway: spend limits; PII redaction; tracking of policy events, whether spend limits are set, what they are, who updated them; trace continuity, so after a trace is sent to an external tool, the original trace can still be viewed together with the trace generated by the external tool in the same view; and eventually the same type of controls for external tools as for MCP, meaning tool gateway controls.

My takeaway from Interrupt is that the agent conversation is getting more honest.

Of course, this is the hard part: putting the LLM in the right place to bear judgment, and surrounding it with code that enables traceability, limits, eval, and improvement.

That is what made Interrupt feel important.

The ecosystem is moving in the direction of LLM-powered agents as runtime participants within actual software systems, instead of as magic employees.

Good. That is where the work is.

Agentic Payments Move Spending Authority Into the Runtime | Focused Labs

Austin Vance — Thu, 28 May 2026 04:06:26 +0000

Agentic payments are arriving before payment authority has an owner.

At LangChain Interrupt there was a constant return to a payment question for agentic AI: what to do with AI agents that need to spend money. Privy said the payment question kept coming up. Harrison Chase pointed at centralized least privilege, audit trails, and dynamic policies as the path that probably beats static allowlists.

The wallet signs. The runtime decides.

So let me just repeat that. The payment question that’s come up for agentic commerce, AI agents making payments, is delegated spend. And spending money, even as a new tool call, is not simply another tool call. It creates liabilities, dispute paths, budget pressure, user trust issues, procurement weirdness, and an audit trail the business needs to defend after the fact.

The Spending Bug Has a Perfect UX

In this integrated form, agents can research suppliers, compare offers, reserve resources, pay for APIs or other digital goods and services, issue refunds to customers, or book travel inside systems where humans already work. Thus, as previously noted, the integrated agent is far more powerful than AI-powered chat.

Money exposes the part of the architecture that hand-wavy autonomy hides.

Note that as soon as the agent stops being able to execute commands in the integrated workflow and instead fills out a human controlled checkout page (e.g. for purchases), then the workflow stops being integrated and turns into purely manual work. And as for spending money on behalf of a user with direct wallet access granted to the agent, the worst part is that the UI will make it look like the money was intentionally spent by the user (with a cool progress bar, for example), until it is revealed weeks later by the finance department that a prompt was entered incorrectly and money was spent by the AI agent.

Protocols are moving quickly. x402 frames itself as an open payment standard for internet-native payments. Coinbase describes x402 as direct programmatic payments over HTTP for human developers and AI agents, with no account setup and no manual payment flow. Cloudflare’s Agents documentation shows the machine-to-machine flow: client requests a paid resource, receives a 402 Payment Required response, retries with a signed payment payload, and gets settlement confirmation.

Good. Interfaces for agents need to be agent-operable interfaces, including payment interfaces. That’s why human checkout pages are so bad for autonomous work.

In sum, the hard work in enabling a resource to be paid for over the internet is left to the runtime (i.e. spending policy): what agent requested this spend, what was the purpose of the agent, what is the relevant budget, what merchants/services are in scope, who can retract a delegation of spend, who is the human to approve a spend, where will the settlement receipt of the spend of money be written.

Payment Intent Belongs in the Runtime

The core object is payment intent.

An agent shouldn’t be passing around payment authority, the agent should create a payment intent. The intent would include actor, task, merchant/service, amount, currency, purpose, budget, requested payment method, evidence gathered, and fallbacks for when the main method of payment fails. The runtime policy engine then determines the next steps for the payment intent based on its rules (approve the payment of metered API calls automatically, pay customer refunds to support leads, require approval from procurement for new vendors, reject transactions that were not for the intended task, etc). Those next steps could be yes, no, ask for human approval, or open a ticket.

Wallet should be behind the runtime policy not inside the agent loop.

This sounds boring because money likes boring.

The runtime spending policy needs a few plain fields:

Amount and currency limits.
Merchant or service scope.
Delegated purpose.
Budget window.
Approval threshold.
Wallet or signer scope.
Revocation owner.
Receipt destination.

This can be contrasted with the runtime spending policy that is required to be defined by the agent in Privy's x402 writeup here Privy's x402 writeup gets close to this shape for per-domain spend limits, agent-specific permissions, workflow-level approvals, and even for session signers for offline user workflows.

Wallet Ownership Changes the Failure Mode

Wallet architecture still matters. It just answers a much narrower question.

Privy's agentic wallet docs describe two control models: developer-owned wallets controlled by backend authorization keys, and user-owned wallets that grant an agent signer scoped policies while the user keeps ultimate control and revocation.

A developer-owned wallet for a backend agent paying for infrastructure, such as data enrichment, paid APIs, crawl credits, model calls, or transaction fees. The company owns the budget, the runtime owns the policy, and operational control of the revocation is sufficient (i.e. someone on the team can log into the management interface and cancel an agent).

User-owned wallets with agent signers also fit into the delegated action model. In these cases, users are delegating others to perform action(s) on their behalf. In these cases, users fund their own workflows, and the agent or signer is granted bounded authority to complete tasks on their behalf. (In these cases, the user should always have the ability to revoke the agent or signer).

The control model decides who owns funds, who can revoke access, and where policy gets enforced.

Both models need runtime policy. A developer-owned wallet with no spend controls turns into a company credit card taped to an agent’s back. A user-owned wallet with no scoped delegation turns into a consent problem just waiting to turn into a customer dispute.

Approval UI Is Runtime Infrastructure

Approval is a runtime state transition. The modal is just where a human sees it.

That is why agent UI belongs in runtime infrastructure. So when a user approves a spending action, they are approving a proposed transaction with a matched policy. The evidence the agent used, the budget impact, and the scope of authority being granted should all be visible to the user approving the spending. The user should be approving a bounded payment intent, not a vague agent plan.

That is the part that everyone tries to treat like an SDK integration and therefore payments “work”.

Receipts Beat Trust

A payment-capable agent needs a way to have those payments recorded as receipts as first-class data within the agent’s runtime.

When making payments, the payment response should be written back as evidence to the same stream as the rest of the work done by the agent. Cloudflare’s x402 flow ends with a PAYMENT-RESPONSE header that contains settlement confirmation.

That was also true for tool calls: agent traces need to cross the boundary of work, the side effect of which is that the trace becomes much more valuable. In the case of payments, the side effect is value transfer (i.e. money moving out of or into an account). Therefore, for all but the cheapest of actions, traces of agent work must include payment receipts.

The ledger doesn’t have to be pretty or shiny. It has to answer the boring set of questions of work done quickly.

What did the agent intend to buy?
Which policy allowed it?
Which signer executed it?
Which rail settled it?
Which human approved it, if approval was required.
Which task, ticket, customer, or account owns the receipt?

The Card Networks See the Same Boundary

We should pay attention to how existing payment networks already frame the problem. For example, while Visa Intelligent Commerce, enabled by crypto, frames agentic checkout around approved AI “agents” for consumers to manage their card(s) online, for merchants, it frames around similar but additional factors such as fraud/dispute protections, merchant acceptance, and especially “trust in transaction” for each payment context (visa online, in app, etc.), all already handled for them by existing payments networks. PayPal also recently announced integration of Mastercard’s Agent Pay into PayPal wallets Mastercard Agent Pay will integrate with PayPal's wallet to enable merchants to accept payments from AI agents acting on behalf of users that have added their payment methods to the wallet.

The runtime layer has to own the authority model before this becomes the normal pattern.

Build the Spending Policy First

Define the payment intent schema. Include purpose, merchant scope, amount, budget, evidence required, approval required, signer scope and receipt destination. Treat as an object from day one. Use for backend logic as well as for generating form fields for human fill-in-the-blanks approval actions.

A path for runtime control of payments above and beyond enterprise wallet offerings is to first implement a policy engine, write policies as simple rules which Finance and Security can understand, and test against various scenarios. Service allowlists are only a first input, but purpose, budget window, task owner, customer account, and approval state all impact payments.

Make approval a runtime event: The approval should pause the payment intent, show the policy match, capture a bounded decision, and then resume or reject the workflow. In other words, do not make “approve agent action” a button.

Test out denial paths. If all a payment system can test for is successful settlement, the team is asking for a weird Friday afternoon. Denied payment, expired approval, revoked signer, duplicate retry, failed settlement, disputed transaction, and exhausted budget all need first-class paths.

Agentic payments are going to make agents feel useful since they are able to buy data, pay for API calls, issue credits, book services… for companies within specified policy. That is real work.

The agent runtime becomes a spending control plane.

Companies will get agentic payments right by starting from a different vantage point than existing payment ‘solutions’ and beginning with a plain set of questions for every transaction: who is delegating the spend; what policy will be enforced on that spend; where is the approval for the spend delegated; and what receipt will prove that the transaction was for the business.

Then the wallet can sign.

Agent UI Is Runtime Infrastructure | Focused Labs

Austin Vance — Thu, 28 May 2026 04:05:53 +0000

Agent product spinners tell the truth badly.

There’s this common pattern in agent UX: a person clicks a button and after a few moments a spinner shows up in the center of the page and the product locks up. The user has no idea what tool is running, what changes are happening in the system, what subagents or agents are involved. Later on the team can go back and read through a long transcript to figure out what happened. It’s better than nothing, but ultimately useless for a human doing real work with real deadlines.

Token streaming solves the model-call UX problem and makes the answer feel more alive as the model writes out the answer. But agent products have a wider problem: the work of an agent to produce output happens across tools, subagents, checkpoints in a workflow, patches of state, approval gates, background jobs, reconnectable sessions, and different applications that a person is staring at to see the results of the agent’s work.

This problem has been named by LangChain in its post "From Token Streams to Agent Streams". Generative UI cannot be a pretty wrapper around a transcript. The UI/runtime boundary needs a contract.

The stream stopped being a text pipe. It became the UI contract.

A transcript can render text. An agent product has to render work.

Old chat UI worked because a user posted a message and waited for assistant text. During generation, tokens appended to the current thread until the assistant finished.

The easy chat app shape gets destroyed as soon as work gets added to the agent.

A support agent looks up account info, checks SSO configuration, opens a ticket, asks for approval to change a plan, adds a note to Salesforce, and waits for billing API to return info. Agent text output during that time is a small fraction of the work done by product to service that request. What matters is the entire path through product taken by that request: billing call, approval card, changed account info, ticket ID, and then finally the response to the agent’s question. A spinner with a blank text input and a dribble of tokens is poor UI for that experience.

This problem is already apparent in latency-prone chat agents, which are implemented as single-threaded bots that process requests one at a time. Even if the final response is correct, single-threaded support bots are an architecture smell, and event streams are just the UI version of the same problem: the frontend needs to see the product surface evolve in real time as the agent processes events in parallel, not after the fact as a single log entry.

This becomes clearer reading LangChain’s event streaming docs, as they outline this API boundary for new application and frontend work. LangChain recommends stream_events(..., version="v3"), which returns typed projections for messages, tool calls, state values, subgraphs, custom projections, and final output. The application then renders the projection it has been given to display to the user, instead of parsing through the individual chunks of the run and branching based on different text blocks.


stream = agent.stream_events(
    {"messages": [{"role": "user", "content": "Check the renewal risk for Acme"}]},
    version="v3",
)

for name, item in stream.interleave("messages", "tool_calls", "values"):
    if name == "messages":
        render_assistant_text(str(item.text))
    elif name == "tool_calls":
        render_tool_card(item.tool_name, item.input, item.output, item.error)
    elif name == "values":
        update_state_panel(item)

final_state = stream.output

This lets the front end promise something different than text streaming to a transcript. The transcript can be updated with text in real time. Tool calls can be surfaced as cards with progress and failure states. The latest state of the application can be rendered in real time. Errors can be surfaced with controls to retry. Final output can close the loop.

The code shape matters because instead of a frontend parsing logs, it subscribes to product nouns.

The event stream needs nouns.

A useful agent stream would contain information about the lifecycle of the agent, text messages, tool calls, state, activity, reasoning behind decisions, custom events, and errors. AG-UI, the Agent-User Interaction Protocol, is the cleanest current example of this vocabulary. Its docs describe event-based architecture where events are the fundamental communication units between agents and frontends, with lifecycle events, text message events, tool call events, state management events, activity events, reasoning events, special events, and draft extensions that are still under construction.

Again, the lifecycle portion of the events is more important than it may initially seem. A run can start. Steps start and end. Text chunks attach to a message ID. Tool calls start, finish, error, or disappear. State lands as deltas against the user-facing model of the world. A run can end, interrupt, or fail. The UI can behave like software instead of a chat window watching smoke signals.

AG-UI did not show up by accident. CopilotKit introduced it as a protocol for streaming a single JSON event sequence over standard HTTP or an optional binary channel, carrying messages, tool calls, state patches, and lifecycle events generated during an agent run. The protocol is also supported by Microsoft’s Agent Framework for remote agent hosting. The framework supports real-time streaming over SSE, session and state management, human-in-the-loop approvals, shared state, predictive state updates and tool-based generative UI.

There is also an enterprise adoption signal here. Agent frameworks are being judged by the event contract that their authors expose to application developers, not by the React widgets they use to build a demo application.

Generative UI makes the frontend part of the runtime boundary.

The term “generative UI” is abused because people believe that the model is painting components in front of them. But the useful part is the boring part. How does work happen in the frontend vs. the backend? How does approval work? How does writing state work? And what can be retried here?

AG-UI’s tool model gives the frontend a way to define client-side tools and let the agent request them through a structured lifecycle. The AG-UI tool docs describe tool schemas, tool calls, and frontend-defined tools that let the application keep sensitive operations under product control while the agent reasons over the interaction.

That boundary is significant. The calendar component exposing “propose meeting times” doesn’t grant the agent write access to the calendar. The billing page for plans exposing “prepare plan change” requires a human approval event before commit. The data table with rows and columns for accounts in a given timeframe can expose “filter this segment” as a UI operation, as opposed to a backend mutation that changes the underlying data, and have the agent ask for it and the product decide to allow it and have the resulting events recorded in the stream of events.

Once the run context is known, the screen can reflect state, permissions, intermediate artifacts, and interaction history. This is where context starts replacing static design. The UI cannot be drawn once and left alone. It has to change in response to each step of the run and the events generated by those steps.

Protocols matter because the UI is now part of the runtime boundary.

Subagents make transcript thinking collapse.

The transcript model gets messier with subagents in the product.

A supervisor might define a task to research a problem, look up a policy, analyze an account, and then plan a series of remediations. The tradeoffs for multi-agent orchestration with LangGraph were laid out in a recent architecture piece. The problem for the UI of such a task is similar to that of a chat transcript. When work is delegated to subagents, the work can be flattened into a single transcript, but this flattening obscures important information about the hierarchy of tasks, which are currently blocked, which have completed, and what results were generated by which child run.

Deep Agents can handle the UI problem that arises when work is split off to subagents by treating the stream of each subagent as a separate first-class projection. The docs say stream.subagents exposes one stream handle per delegated task, with scoped messages, tool calls, values, nested subagents, output, path, and lifecycle status. This means a UI could show the effect of the supervising agent’s work, then show off the work that the various child agents did as separate cards, and then drill down into any subagent’s stream without subscribing to the stream of every other subagent as well.

Reconnects and traces belong in the same design conversation.

Agent runs continue after a browser tab has been closed, run into problems with flaky API connections, have a human manually intervene at a particular point, and continue after a product update has been deployed. They can fail partially in the course of completing a write. So too the stream of UI events corresponding to a given agent run must also be resumable, corresponding to the fact that runs are themselves resumable.

This is where the event contract turns into production infrastructure. To follow runtime work, the contract has to map run IDs, thread IDs, parent run IDs, and step names. Ordered value deltas, tool calls, outputs, and results have to be tracked so the frontend does not duplicate side effects. When the UI generates a customer-visible action, it should trace back to the run, using the same identifiers that appeared in the tool card showing the “billing API failed” error.

Honeycomb’s agent work points in the same direction. Its Agent Timeline product frames agent debugging as a flight recorder. The same information used to present active work in the UI can be displayed in a review of an incident that involved that agent. If the UI showed a step title and the “billing API failed” tool card, these same run, step and tool identifiers are necessary in the trace for the subsequent steps to accurately present the work done by the agent.

Integrated agents raise the bar here. In the integrated-agent piece, the point was that agents deliver value when they operate within existing enterprise systems and workflows. Processing work through a UI is not enough. The tool’s event stream has to integrate with the product, the platform, and the observability system. They need the same event contract.

What to demand from an AI agent framework.

The buying checklist is getting sharper.

Can the framework stream messages, tool calls, state patches, subagent progress, approvals, errors, custom domain events, and final output as typed projections?

Can a product subscribe to a subagent, tool call, or state channel without draining the entire run, then reconnect later with the correct run IDs, state, and pending interrupts?

Can frontend tools stay under application control while the agent requests them through a documented schema?

Can event identifiers map cleanly into traces, logs, eval records, and incident tickets?

Does the frontend UI produce useful UI for product work even for partial completion of work by an AI agent?

A token stream makes things look active. An agent event stream gives a product handles.

Generative UI needs these handles: lifecycle events for run boundaries, tool events for actions performed through tools, state events for shared context, subagent events for delegated work, approval events for control, error events for recovery, and observability hooks for the receipts.

A chatbox is okay. People know how to use a chatbox. But a chatbox cannot be the UI for an agent-based product.

Make the stream a contract. Then build the UI on purpose.

Agent Monitoring Is an Infrastructure Workload | Focused Labs

Austin Vance — Thu, 28 May 2026 04:05:50 +0000

I added agent monitoring to the list of reporting work that has crossed over into SRE production infrastructure, which is annoying but real enough. The trace used to explain a single request. Now it has to carry the agent run through tool calls, subagents, sandboxes, services, approvals, retries, and side effects. It has to support SREs reading the trace a week or so after it happened, when no one remembers the details. The trace must support rollback and the other production troubleshooting work SREs do. And it must be understandable by an SRE who has not already read through the full raw event log for the agent run.

First off, Sarah Cat made the core point that managing and monitoring agents requires rethinking infrastructure because existing systems were not designed for agent scale. Then Harrison Chase added that the same point applies on the monitoring side. Charity Majors made the observability version sharper: there is a huge problem tracking long-running async AI sessions with the usual transaction and trace building blocks.

Observability for long-running agent sessions is turning into the storage, identity, retention, correlation and control-plane for the behavior of AI agents.

Fine. Call it monitoring if the label helps. Just don’t staff it like reporting.

The trace is carrying a bigger object

A web request goes in through a handler, calls a database (maybe a queue), and then returns a response to the person who made the request. This is the shape of a transaction and this is what people have built their observability tools around. This is the world of spans, of logs, of metrics, of viewing individual exemplars, of looking at a service map, of building a dashboard.

LangChain’s observability docs say agents require visibility into tools, prompts, decisions, tool calls, model interactions, and decision points. LangSmith’s dashboard docs turn that surface into operating metrics: trace count, latency, error rates, token usage, cost, tool run counts, tool errors, tool latency, run types, and feedback scores.

Read that list slowly. The list describes the beginning of an agent control plane.

I like traces. I like dashboards. I do not like pretending raw logs are the record.

The trace has to follow both the decision and the side effect.

For a single agent run, the logs from that run are fine to read as logs. However, for monitoring a production system where logs become evidence for SREs or security investigators later, the trace following the commands in the log also has to follow decisions and side effects across the boundaries the agent crosses: tool boundaries, MCP boundaries, sandbox boundaries, and other runtime boundaries. Again, Agent Traces Need to Cross the MCP Boundary.

Long-running sessions break request-shaped thinking

A transaction trace is too small for a long-running agent session.

Honeycomb is running its 2026 Innovation Week around understanding, debugging, and improving AI systems in production. Its agent observability launch focuses on tracing agent activity, reconstructing decision paths, and debugging failures without manual log dives. The manual log diving is where immature agent systems die.

Without a session id across the systems, incident review degenerates into archaeology with search bars. Someone opens LangSmith to trace the model calls. Someone opens Honeycomb to trace the service operations. Someone checks the workflow queues. Someone reads application logs. Someone asks whether the sandbox still exists. Someone searches Slack for approvals. Then the team starts piecing together guesses.

That is archaeology with search bars.

A transaction trace is too small for a long-running agent session.

Stateful agents raise the pressure again. The DeltaBox paper frames AI agents as systems that perform high-frequency state exploration, relying on checkpoint and rollback of complete sandbox state, including files and process state. With change-based sandbox state, the paper reports 14 ms checkpoint and 5 ms rollback. Prior approaches copy the full state to roll back, adding hundreds of milliseconds to seconds of latency during evaluation. The resulting system enables high-frequency state exploration, and supports a wide range of applications.

This also shows on the UI side. In Agent UI Is Runtime Infrastructure we noted that the agent event stream contains state that a user can act on. Monitoring has to share those identifiers. The backend trace and user-visible event belong to the same run.

AI SRE starts before the incident prompt

The term “AI SRE” is likely to be abused in 2026, but the useful meaning is mundane: an AI agent helps run software, and that software provides enough structure so the agent does not blow the incident budget interpreting things.

The Causely paper presents hard numbers around AI in SRE. In their paper, the authors describe how workflows using AI derive the environment state from raw telemetry, paying for it in tokens, latency, and interpretation errors. In a 24-microservice OpenTelemetry demo application with injected faults, causal grounding reduced mean time to diagnosis by 63 percent, token consumption by 60 percent, tool calls by 78 percent, and improved root-cause accuracy from 75 percent to 100 percent.

AI SRE is going to get abused, but the useful frame is simple: an agent helps operate software, and the software gives that agent enough structure to avoid burning the incident budget on interpretation.

Monitoring an agent means building production monitoring for the service that runs the agent. That work is data modeling, retention, identity propagation, correlation, storage, schema, and permissions. Boring work. That is why no one wants to do it. That is also why it matters, as long as the work stays boring and does not turn into AI SRE snake oil.

It also has to close the loop from signal to issue. The signal should open a ticket and attach evidence to that ticket, and then continue to update the evaluator or release gate based on what the human does with that work. We wrote about how agent failures should open tickets in Agent Failures Should Open Tickets.

Ownership belongs below the dashboard

Agent monitoring fails when ownership only exists inside the dashboard.

Who Owns This Agent? observes an attribution gap in which the agent’s behavior is visible to affected parties, but the responsible operator or account is not identifiable. The enterprise version is obvious enough. A finance agent updates a customer record. A security agent modifies a ticket. A coding agent opens a pull request. A support agent sends an email. In each case, the record should say: system, account, policy, and operator.

The answer cannot be "check the logs." Come on.

Agent attribution forms part of monitoring, along with the permission state, session identity, and the rollback context of the deployment. The user-visible event and the backend action that caused it need a shared record. Supervisor and swarm systems spread decisions across agents with different responsibilities, which is why multi-agent orchestration has to be an architecture choice, not a vibe.

A production record has to survive outside the dashboard. I want a ledger that can answer boring questions without detective work: which run, which owner, which permission, which tool, which checkpoint, which evaluator, which side effect. The dashboard can stay a view. The record has to be the substrate.

Build the monitoring substrate before the agent fleet arrives

Monitoring an AI agent is a platform capability, with its own ownership, budget, and failure modes. Treating it that way eventually makes vendor comparisons useful. The steering deck can wait.

For incident tracking and monitoring to work, the evidence needs structure. As we argued in Agent Failures Should Open Tickets, the entire prompt may not be safe to store, and raw logs are not enough. Store the fields that make incident response possible: tool called, rejected alternatives, human policy decision, permissions, session cost, retry reason, checkpoint pointer, evaluator result, and final side effect. Govern the sensitive information. Keep the operational information queryable.

Connecting monitoring to repair work matters too. Failed evaluators should create issues. Cost anomalies should page the owner or throttle the workflow. Repetitive tool errors should open a repair loop. A rollback must attach the checkpoint, diff, and human decision path. A human approval to modify customer data should attach to the trace, not to a screenshot in Slack.

This is the unglamorous part of the AI agent infrastructure. But it is also where production agents become manageable. The agent can be clever. The runtime cannot be casual.

Monitoring an AI agent is an infrastructure workload. The agent is a running system with memory, side effects, cost, ownership, state, and judgment loops. A dashboard may report the smoke. The monitoring substrate has to explain the fire.

Agent Failures Should Open Tickets | Focused Labs

Austin Vance — Thu, 21 May 2026 09:29:57 +0000

Agent traces should create work.

An AI agent workflow can fail twice. So, if it does fail twice, it should create a ticket with an owner and linked evidence and have that be something that the team can check for regression down the line. Instead, tracing an AI’s processes and reviewing its outputs can be pretty and searchable but still be essentially worthless as something that can be used for anything other than replaying the series of mistakes over and over. It’s what I’ve begun to call “replayable regret,” expensive, and painful to behold.

LangChain this week identified a critical gap in tooling: traces of AI agent work can be traced and reviewed, but the error identification and corresponding merged fix are still manual and slow. Harrison Chase called this out this week too and noted that LangChain is building out an “issue bench”, already using it internally, but still early for this class of tooling.

That phrase matters because the unit of work changes. The trace stops being the artifact everyone stares at after the failure. The recurring failure becomes the artifact the team improves against.

Traces should not die in Slack

The common failure loop is quite dumb.

Step 1 for handling the common failure loop: an agent fails in a workflow, someone opens up the trace from that work item, and the team can see the failure in the trace steps. The tool call timed out. The planner picked a weird branch. Retrieval pulled stale context. The evaluator fired. A user left negative feedback. All of that fails when a trace link is added to Slack with the word “interesting” on top of it. In a word, that’s vibes, not work.

Traces are good! I recently wrote about why traces from agent workflows should cross the MCP boundary. And making traces visible is the first half of work here. Traces are good because their visibility says this happened. But that trace, or set of traces, must also say this failure family is owned, fixed, covered, and blocked from coming back.

That second half is the AI agent workflow people keep skipping.

In the recent Engine thread, the team at LangChain identified the right inputs into the AI agent workflow: tool call failures and timeouts, online eval failures, trace anomalies, negative feedback, and unusual behavior. Today those inputs are treated as interesting patterns to watch as dashboard widgets. Instead, they should be treated as signals for a queue of work, where each is eligible to become a named issue with severity, linked traces, suspected boundary, and release condition.

The trace is evidence. The issue loop turns it into engineering work.

The good version is pretty straightforward and mechanical: a trace anomaly turns into an issue, new traces get clustered with it, and it gets an owner. Said owner then makes a change which in turn adds a new evaluator or updates an existing one. The change is then released through a particular release gate, which in turn runs that new evaluator. And if it introduces any regressions, said issue reopens.

No ceremony. Just a loop.

The ticket needs a shape

An agent issue is not a Jira card with “LLM flaky” in the title. That card should be illegal (morally, at least). The issue needs the same hard edges as a production defect.

An agent issue has to have the same characteristics of a production defect issue:

Failure name: “refund flow calls payment API before policy check”
Workflow: refund, plan upgrade, incident triage, research synthesis
Severity: customer impact, data risk, financial risk, operational drag
Evidence: linked traces, failed eval runs, user feedback, tool responses
Boundary: prompt, tool contract, context source, model route, permission, downstream API
Owner: team or service owner, not “AI”
Fix status: proposed, merged, reverted, blocked
Regression coverage: benchmark eval, coverage eval, release gate
Reopen rule: the exact signal that opens it up again and puts it back in the queue

Agent failures are hard to test because they manifest differently based on the input, the branch under test, and the tools the agent touched before it failed. Without an issue name, every failure trace becomes a new issue to debug rather than another data point in the failure family that production tests are supposed to remove.

If LangSmith Engine emits issue.created and issue.trace.added events, then stable event IDs can handle dedupe, severity can travel with the event, and the shared request ID can group deliveries from the same upstream action. That’s all that’s required for this. No need for a religion. Use the existing webhook shape to get failures into queues, boards, and CI jobs.

The boring webhook handler should do four things:

Dedupe on event ID.
Group related deliveries by request ID.
Attach trace evidence to the existing issue when the cluster already exists.
Trigger the right owner workflow for the right reasons, meaning severity and recurrence justify the cost of that workflow.

This is a small piece of work. It is also how agent quality work avoids getting lost on Tuesday.

Benchmarks are pointing at the same problem

Long-horizon agent work fails in similar ways to engineering work. Rather than one incorrect result, the failure is a series of small errors that creep up over time, leaving a final result that is less than useful.

RoadmapBench exists to evaluate long-horizon software development tasks: 115 tasks spread across 17 repositories and 5 languages. The median task modified 3,700 lines of code in 51 files. For tasks at that size, the best model resolved 39.1% of them. The useful analysis is where the generated plan went wrong, which files inside the task became riskier, and which requirements got orphaned.

The CLI project pipeline for LongCLI-Bench uses the same kind of scoring to compare tool performance on long-horizon programming tasks: fail-to-pass, pass-to-pass, and step-level progress. It reports pass rates below 20% for state-of-the-art agents. In terms of stalls, there is a big difference between a late red X for failing to hit the ultimate task goal and an early red X that points to a tool loop, the wrong files, or a pass-to-fail regression.

Phoenix-bench: Locating the oracle for file-level actions on hardware tasks added only 1.4% to resolution. A single round of feedback from testbench logs increased the resolved rate from 42% to 45%. It turns out that pointing to the right general area for a human to improve long-horizon programming tasks is of limited value. Providing actionable feedback that improves the task under consideration is valuable.

This is the issue-loop argument dressed up in benchmark clothing. Better testing of AI agents requires more than a simple test suite. It requires a workflow that can expose issues, allow them to be fixed, and verify the fix inside the same workflow.

The eval suite should grow from resolved issues

Closed issues should feed the test suite.

LangSmith describes evaluators as workspace-level resources that can be attached to tracing projects and data sets in the same workspace. They can be suggested by Engine for detected issues where custom evaluators could be developed and then added as trace evidence for the closed clusters that caused the issue in the first place.

Brace Sproul’s distinction between benchmark evals and coverage evals maps onto this. One set of evaluators for fast benchmarks on known workflows. A second, more exhaustive set of evaluators for longer paths, product commitments, and stranger trajectories. Trying to use one suite for both ends turns evals into the tax nobody wants to pay.

Resolved issues should feed the right suite, not one giant eval blob.

Severity-0 resolved issues, like refund errors in critical workflows, should be evaluated with the fast benchmarks. A rare edge case in a long multi-hop research workflow is probably better served by the broader coverage suite, high cost and long run time included. Severity-0 policy violations may belong in both suites.

However it gets cut, this is work. Every workflow change can introduce failure modes the system has not seen before. The test that proves a fix worked is different from the test that guards the same problem against a later regression. And then there is the matter of the gate.

The queue is where agency gets real

The harder discipline is developing agents so they improve. That is much harder to demonstrate than the capabilities an agent can apply to tasks and workflows.

A harder thing to demo is building AI agency into an agent. Developing AI agency was always about that discipline. It shows up in a particular way: when something fails, the team can explain what happened next.

A good issue queue for a development team debugging a failure answers these questions. The team cannot get all this information from a single trace.

Is this failure new or recurring?
Which workflow owns it?
Which traces point at the same root cause?
Did the fix land?
Which evaluator covers it now?
Which release gate blocks a regression?
Who gets paged if the issue comes back?

Again, this is normal software development, complicated by workflows that fail through complex, probabilistic, variable paths. Same defect, different costumes.

The LangChain survey of production AI agents found that 57.3% of respondents already have agents in production. The number one production blocker cited by respondents was quality, at 32%. This sits next to 89% observability adoption for production agents, far ahead of offline evaluation at 52.4% and online evaluation at 37.3%. There is already a sea of visibility for production agents. The work to convert that visibility into closed quality issues is still barely underway.

Honeycomb’s new investigation features for agent observability start to address the same problem, with Agent Timeline built to reconstruct complex multi-agent, multi-trace workflows. But reconnecting that path to specific owned work, and making sure the work is covered, is still the large gap.

That is where the issue queue comes in.

Own the loop

The AI agent workflow I want is not fancy.

Signal failure -> create issue -> add evidence -> assign owner -> propose fix -> add evaluator -> run release gate -> reopen regression.

This workflow looks less interesting week to week than announcing a new AI model. But to the buyer with an agent touching refunds, support tickets, infrastructure changes, or account data, this is the kind of work that matters week to week. Last week’s failure needs to become this week’s guardrail.

Agent failures should open tickets.

The ticket is where the trace becomes work. The work is where the system gets better.

Agent Traces Need to Cross the MCP Boundary | Focused Labs

Austin Vance — Tue, 19 May 2026 21:23:21 +0000

Observability for AI agents running through MCP has a new failure point: the MCP tool call.

Good. The broad version of this conversation has already been beaten to death. Agents need traces. Agents need evals. Agents need feedback loops. Fine. The sharper production question is what happens when the agent leaves the planner and crosses into a tool server owned by another team, another vendor, another runtime, or another cloud account.

That boundary is where the trace disappears.

Honeycomb is running O11yCon in San Francisco this week. Christine Yen's line in the announcement gets at the issue: agents are writing code, agents are triaging incidents, agents are running production through orchestration, and engineering has little visibility into what the agents did, let alone whether they added value. The visibility gap for these agents is along the path between the model's decision, the tool server, and the downstream services affected by the action.

The production shape is distributed tracing with a model in the loop.

A planner says "tool failed." An MCP server just sees an unrelated tools/call. A database sees a single query. A payment API sees a single request. The observability backend sees all these individual pieces and, operationally, has no idea what to do with them. Nobody can say whether the model chose the wrong tool, the planner's MCP client lost context somewhere along the line, the server failed to accept the call, or the downstream service simply timed out.

Logs within a given service are comfortable to view because the local nature of the stream makes them easy to interpret. However, as soon as an incident affects multiple services or tools, that comfortable stream of logs disappears.

MCP made tool integration portable. It did not magically make tool behavior observable. Focused has been pushing this shape for a while. In Developing AI Agency, the point was that useful agents need real engineering systems around them. In Streaming agent state with LangGraph, the point was that intermediate state matters while long-running work is happening. MCP adds a protocol boundary to that same production story. If the trace cannot cross it, the agent becomes opaque at the exact moment it starts doing useful work.

MCP gave us the carrier

MCP made tool integration for production tool calls easier. Making the behavior of those calls observable is a different job.

This brings us to a simple and useful place: SEP-414 reserves the W3C trace keys for W3C trace context propagation through MCP. So the MCP tools/call request can include trace context as part of params._meta, next to the tool name and arguments.

MCP typically wants _meta keys that start with a DNS-prefixed name. SEP-414 makes an exception for the three W3C trace keys so existing OpenTelemetry propagation can work without creating twelve slightly different names for the same thing. traceparent stays traceparent, tracestate stays tracestate, and baggage stays baggage.

Tiny standardization, huge operational consequence.

A universal set of properties for W3C trace context is a small thing to request. Without SEP-414, every agent stack invents its own set of properties in params: io.modelcontextprotocol.traceparent, otel_trace_parent, correlation IDs encoded in a vendor envelope, plus the special shape required by a proprietary monitoring stack. The resulting observability swamp would be indistinguishable from what exists today with services and their HTTP traces.

First, the agent runtime starts a new span or continues an existing one. Then the MCP client for that runtime injects W3C trace context into params._meta for the call. When the MCP server processes the call, it extracts the W3C trace context from params._meta. Then the server creates a new server span. Tool code invoked by that server, including API calls to databases, queues, workflow engines, and other services, runs under the same trace context.

The tool boundary is where agent observability either survives or dies.

HTTP spans will not save the agent loop

A tempting shortcut is to assume the transport already has tracing. The MCP server runs over HTTP. The ingress span exists. The collector sees requests. Done.

Nope.

That is why OpenTelemetry's MCP semantic conventions matter: HTTP spans only contain information about transport. Streamable MCP transports can contain more than one request, and one MCP operation can spread across retries and transports. The transport context and MCP context are related, but different.

A streamable HTTP request can sit under multiple MCP messages. A retry can create multiple transport-level attempts for one logical operation. Stdio has no HTTP request to hang a trace on at all. If instrumentation stops at the transport layer, the team is just looking at plumbing. The production question lives one layer up: what MCP method was called, what tool was called, what session was involved, what error type was returned, and which downstream spans received the trace context.

A trace is useful when it follows the boundary. In the simplest case, a single trace starts with a span created by the agent runtime. The span name should be boring and low-cardinality, with names like tools/call get_weather, tools/call query_customer, or tools/call create_ticket. The attributes carry the information that matters in production: mcp.method.name, gen_ai.tool.name, mcp.session.id, mcp.protocol.version, network.transport, and error.type. OpenTelemetry warns against adding high-cardinality resource URIs to span names by default. That creates backend cardinality problems for no benefit.

The same thing is true for baggage. Baggage is useful for correlation. It is also an attractive nuisance. A tenant hint here, a route class there, an evaluation cohort for a particular set of runs. Fine. But prompts, secrets, user emails, access tokens, and customer data do not belong in baggage because trace context is supposed to cross service boundaries.

Google's Cloud Trace documentation treats tracing through remote MCP request metadata as an implementation detail. A remote server can accept traceparent in headers or _meta. Once that tracing information is accepted and the trace is sampled, the server emits spans for the requested operation, including failures caused by the agent or by the tool, and latency caused by the client, network, or server processing.

Sampling policy becomes relevant for observability of the agent's tool work. If the agent's tool work is not sampled, the tool's work cannot be reconstructed later by whoever wired up the chat UI.

Fragmented truth still loses the incident

Separate traces can be valid. A vendor-operated MCP server may want a clean service boundary. A client team may not own the server. Langfuse's docs make that distinction directly: Langfuse's MCP tracing docs. But default separation is awful for incident management when the agent itself is causing a user-visible problem.

The agent chooses a tool. The MCP server executes the request. The database locks. The tool returns a timeout. The planner retries with slightly different arguments. The user waits. Each system can tell the truth from inside its own box. The operator still has to stitch together causality by timestamps, request IDs, Slack screenshots, and vibes (the official fourth pillar, apparently).

Without propagation, every system tells the truth in isolation.

In production flow, agent traces should form a chain that represents both the decision process for a request and the execution process carried out by services. The tool spans from an agent trace should link to the corresponding service spans. Having the agent's processing stages with nothing from subsequent services is model theater. Service spans without the corresponding tool decision are classic APM with no agent-specific information.

Honeycomb has been going down a similar route. Their Innovation Week writeup describes agent workflows that branch, retry, call tools, hand off, and trigger services. They frame Agent Timeline. The resulting view places the agent's work inside the incident loop and shows the causal chain behind a prompt log.

The implementation surface is small

Here is a concise specification for adding distributed tracing to an agent-enabled workflow:

inject traceparent into MCP params._meta
extract it on the MCP server
name spans by MCP method plus stable tool or prompt name
attach MCP and GenAI attributes with low cardinality
propagate trace context to following API and database calls
keep sensitive data out of baggage
send the result to a backend that can show agent and service work together

The ecosystem around the MCP contract already does a decent amount of the heavy lifting. Grafana's MCP server docs include attributes such as gen_ai.tool.name, mcp.method.name, and mcp.session.id, with W3C trace context propagation from _meta. MCP Toolbox telemetry docs cover attributes for MCP method, transport, protocol, toolset, tool name, and error type. LangSmith accepts OpenTelemetry ingestion, which means MCP spans do not have to sit in an observability island away from LangChain or LangGraph applications.

In practice, agent systems run across different runtimes, including planners, graphs, model gateways, tool registries, MCP clients and servers, legacy APIs, databases, queues, approval steps, and eval jobs. Evidence of proper orchestration cannot scatter across architecture components and still be reviewable by team members from AI, platform, service, and business functions. We discussed the tradeoffs in Multi-Agent Orchestration in LangGraph. For trace propagation, the same reasoning applies. Architecture can be decomposed into modular components. Evidence for correct runtime behavior cannot.

A decent review checklist is simple:

Can an operator start from a failed agent run and find the corresponding MCP tools/call span for the tool that failed?
Can they see the exact tool name without exploding cardinality?
Can they jump from the client span to the server span?
Can they see the downstream API, database, or queue work under the same trace?
Can they distinguish model/tool selection failure from tool/server failure?
Can they see error.type, latency, tokens, and quality signals near the same workflow?
Can they prove no secrets or PII are leaking through baggage or span attributes?

Call it what it is: a pull request.

The owner is the team that owns the boundary

The trick keeping observability in agent systems stuck is assigning MCP observability to the AI team, or to the MCP tools team, or to the database platform team, while claiming the boundary is too hard for any one team to own.

There are four parties involved here: the vendor of the tool, the platform team, the AI team, and the service team. The vendor exposes spans. The platform team runs a collector to gather those spans. The AI team creates a planner span that is passed as context to tools. The service team instruments downstream API and database calls made by tools within an agent run. Someone has to own the boundary between those groups.

Own the boundary.

For an internal MCP server, trace propagation belongs in the server template for all calls. It should not be left to individual tools. For vendor-provided MCP servers, test the contract by sending traceparent in params._meta and verifying that the backend receives the linked span. Test trace propagation from the agent runtime for every tool call after context injection, without needing to chase separate dashboards. Baggage should have a clear policy before developers discover it as a convenient place to add sensitive information.

AI agent observability will continue to sound mysterious when production monitoring means staring at transcripts of model dialogs. A transcript is one artifact. It will never show the intent behind a command, the tools used to execute it, the side effects, the latency, the errors, or the downstream work required by systems that had to deal with the output of those tools.

MCP made tools portable. SEP-414 and the OpenTelemetry MCP conventions make the tool boundary traceable. The work is wonderfully unglamorous: pass the context, name the spans, control the attributes to keep cardinality low, protect baggage from sensitive information, and then follow the tool calls as the trace crosses the same boundary as the agent.

Follow the trace, follow the agent.

AI Agent Orchestration Needs Receipts | Focused Labs

Austin Vance — Sun, 17 May 2026 21:13:49 +0000

Orchestrating AI agents breaks in the boring place of all: between issuing a tool call and the tool call having its intended side effect.

As tool calls transition from being client tools executed by application code to server tools executed by models, there is a point in the system where the language and the abstraction used to describe the tool use breaks down. A tool call becomes a runtime transaction. The work done by a tool affects databases, makes payments, sends emails, creates tickets, etc. A retry storm, or even a simple retry, now has significant production consequences.

Agent tools need receipts.

Tool Calls Are Side Effects With Better Marketing

Anthropic's tool-use docs split server tools from client tools. A client tool is executed by application code, and then the application sends tool_result back to the model. This is where language ends and production begins. Databases get mutated. Payments get made. Emails get sent. Tickets get updated. Credentials get used.

I see this boundary get described as a function call. Better: side-effect boundary. These systems do not have a durable receipt right now.

What proves the side effect in an agent runtime? The request IDs from external vendors, the changed rows in the business system, and the receipt the runtime saved before the model moved on. It takes human eyes reading through three different systems (and writing glue code along the way) to answer questions like "Did this exact tool intent already cause this exact side effect?" if the runtime cannot track the side effects caused by tool calls inside the model loop.

The Old Backend Pattern Still Applies

Normal API work has already figured this out. For example, Stripe supports idempotent requests for POST, so a caller can retry after a network failure without charging the customer twice. It tracks the original parameters for a given idempotency key, so if the key is reused with different parameters, it will not be treated as the same operation.

AWS Lambda Powertools describes idempotency records with INPROGRESS and COMPLETE states, payload hashes, stored responses and an expiration for the record. This is a tiny state machine around a side effect. That's all that's required for an agent runtime to safely handle model-intent-to-change-the-world calls.

The transactional outbox pattern: write the business state and the outbound message in one database transaction, then deliver from the outbox. AWS writes about the duplicate-message problem for this style of delivery and recommends idempotent consumers that track processed message IDs.

The deterministic backend, for example a Java or Python service, calls a service endpoint with fixed intent semantics. Booking a hotel room is boring in exactly the right way. An agent tool call is produced by a model loop that can re-plan, retry, branch, summarize state, and call the same tool again. The runtime has to record the intent before the side effect is produced.

What the Ledger Has to Know

Tool Ledger. Side-Effect Journal. Orchestration Transaction Table. The name is unimportant. It is a table with a specific shape.

The side-effect ledger is the boundary between model intent and production side effects.

A side-effecting tool call needs a record before execution:


create table agent_tool_ledger (
  id uuid primary key,
  run_id text not null,
  step_id text not null,
  tool_name text not null,
  input_hash text not null,
  operation_key text not null,
  status text not null check (status in (
    'planned',
    'in_progress',
    'succeeded',
    'failed',
    'compensating',
    'compensated'
  )),
  receipt jsonb,
  compensation jsonb,
  error jsonb,
  run_trace_id text,
  owner_service text not null,
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now(),
  unique (tool_name, operation_key)
);

That unique constraint is the point.

The record would hold: tool name, normalized input hash, run ID, graph step, owner service, run trace ID, status, receipt, and compensation metadata. On conflict, the application checks the stored input_hash against the new input_hash. Same key with different input is a bug. The receipt is the external fact: Stripe charge ID, Zendesk ticket ID, GitHub comment URL, invoice number, database primary key, email provider message ID.

No receipt, no production claim.

Retry Safety Has to Be Designed Before the Retry

A retry policy is essentially a duplicate side-effect generator wearing a reliability costume.

Retries become safe only after the runtime has a durable place to check intent and receipts.

Temporal's Activity documentation recommends idempotent Activities because they can be retried. A non-idempotent Activity can corrupt application state even when the distributed system is functioning correctly. The runtime's retry policy does not make the agent reliable by itself.

This is where agent systems get uncomfortable. Because we've instrumented our system to retry on transport failure, we can easily believe that we're retrying on transport failure, when in reality we're just retrying on a model of the world that observes a timeout and decides to go down a different path. So, for example, after refunding a customer the model may decide to create a support note, and then the model may decide to refund the customer again in a summary step, losing the receipt from the first attempt. The model may ask a human for confirmation in the meantime and then resume with stale tool context. The model may even run a background subagent that decides to go down a different path in order to arrive at the same conclusion.

This intent cannot be raw JSON. Models produce irrelevant differences. Field order changes. Natural-language notes shift. A good operation key comes from the business operation. The model's token stream is too noisy. refund:{tenant_id}:{payment_id}:{reason_code} beats a hash of the entire prompt. comment:{repo}:{pull_request}:{review_run_id} beats a blob of generated markdown.

That ownership boundary corresponds to the ownership of the credentials for the tool. In agent systems, the authentication of the agent to the external system should start with the workload identity. In AI Agent Authentication Starts With Workload Identity, we discussed the reasons why the secrets should not be passed around like party favors. This same principle applies here. The runtime should not make up the side-effect semantics for a tool that is not owned by the runtime.

Observability Without the Receipt Is Theater

But traces do not, by default, create a business-level uniqueness boundary.

Joining traces to ledger entries changes what agent observability can do. The trace explains the path after the incident. The ledger table can drive behavior during the incident: suppress the duplicate, resume from a receipt, trigger compensation, alert the owning team, or block the next step until a human approves the ambiguous side effect.

That is the difference between a dashboard and a control surface. The trace is evidence. The ledger is state.

Evaluations also get a lot better. In place of "the model called the refund tool", the useful check is one planned refund, one succeeded ledger entry, one receipt, zero duplicate external effects after a simulated timeout. In Everybody Tests, we recognized that people are already testing with the feedback loops they have today. The transcript is too thin to capture all the detail.

The Tool Interface Should Expose the Contract

The contract for a side-effecting tool should be defined near the definition of the tool itself. That contract should describe the operational facts that the runtime can enforce for that tool. A side-effecting tool contract should answer:

Is the tool read-only or mutating?
Who owns the tool?
Which fields form the operation key?
Which external receipt proves success?
What status means the side effect is safe to retry?
What compensation path exists when the effect is wrong?
How long does the ledger entry live?

This is where MCP and other tool packaging efforts need to "grow up" to support packaging of tools for agents to use in production. Such interfaces are not just "packaging" and must be agent-operable - typed, permissioned, inspectable, retryable, and owned by a service. This is the real product, and it is a far cry from a mere interface for the agent to discover and call a tool.

A tool registry that simply says a tool exists is table stakes. A registry that says a write tool mutates customer billing, requires workload identity, lists the operation-key fields, emits a specific external receipt, and pages the service owner on ambiguous completion starts to look like production infrastructure.

Boring. Also useful.

The Runtime Should Refuse Unsafe Writes

Ledger policies for mutating tools run the show.

Read-only search tools remain lightweight, (retrieval, ranking, summarization, classification). Write tools charge cards or email customers. Write tools have their own set of problems but follow a different set of rules. For write tools the runtime should require a ledger policy before registration. The tool owner supplies the operation-key builder, receipt parser, retry rules, and compensation metadata. The runtime supplies the reservation, status transitions, trace joining, and audit events. The rest of the orchestration layer checks the side-effect ledger before running the tool and after it fails. The eval harness tests the duplicate paths for the tool. The on-call team can see stuck in_progress rows before the customers do.

LangGraph Agent Error Handling in Production. Here, handling errors in tools called by an agent is more than simply handling exceptions that occur when the tool is called. The side effects that occur before the error is surfaced, especially around a timeout, are the real problem the error handling has to address. The ledger is where the system goes looking for evidence.

That last point matters. Agents can keep going after an error has occurred. But in production, continuing can be reckless.

Own the Receipt

The gold rush version of AI agent orchestration wants better planners, bigger context windows, and more tools. Fine. Those help.

The production version needs a boring table that answers whether a tool call already did the thing.

That table won't demo well. Nobody cheers for a simple unique index on (tool_name, operation_key). But that's exactly what this table is. And it will save a team from having to refund, email, provision, delete and apologize (for the mysterious model) twice.

The model can be probabilistic. The side-effect boundary cannot.

Own the receipt.

Agentic AI Implementation Runs Through Change Control | Focused Labs

Austin Vance — Sun, 17 May 2026 21:13:16 +0000

There’s been a big mis-selling in Agentic AI implementation. People compare its implementation to software enablement. But this breaks when the agent can change a workflow.

The agent approves a refund, opens an incident, updates a customer record, begins onboarding for a new customer, or escalates a support ticket. At that point a training calendar and a Slack message are not enough for a rollout plan.

It needs a change record.

Enterprise AI adoption has a naming problem. Work ‘adoption’ gets viewed through the same lens as software ‘usage’. Thus work is framed in terms of seats, office hours, examples of how to properly format a prompt, and wait for it to kick in. But then the work actually gets executed out through an agent that in turn changes a workflow.

The system has entered the process.

Microsoft's 2026 Work Trend Index frames this shift as an operating-model problem. WorkLab analysis finds that employees may be ready for AI, while the systems around work are not. Agent approvals, open incidents, and changed customer records create a different implementation roadmap.

That changes the implementation roadmap.

The Rollout Surface Changed

Agents behave differently from a chat tool. An agent is released through a system.

ServiceNow announced Action Fabric at Knowledge 2026, explicitly opening its governed system of action to agents. The MCP Server gives agents access to workflows, playbooks, approvals, catalog requests, and business rules. All of which run through identity verification, granted permissions, and audit trails.

Within an enterprise the enterprise agent problem manifests itself when an agent has moved from the edge of a process, creating a summary of work done, to inside the process, making a move.

The first key question that comes to the surface for the enterprise is no longer "who should have access to this tool" and rather "what change is this tool going to drive for the business, and who is going to own that change (ie: the teams that run the production systems, compliance to regulations, promises to customers, incident response, and the overall economics of the workflows that this will insert into)".

The reality of the enterprise is well captured in a preview for LangChain's Interrupt 2026: the initial excitement to have agents proving work in production will quickly give way to questions about the team, tooling and infrastructure required to support agents that are no longer ‘proof-of-concept’ work (LangChain Interrupt 2026 preview LangChain Interrupt 2026 preview). My experience with clients has been the same: there is initial excitement with the first useful agent, overlap of work with the second and finally ownership problems with the third.

Fine. That is the good version.

The bad version of this is quiet. A team enables an agent with a service account, an admin token, a dashboard that nobody looks at. It looks good during the demo, and then a change in a source system happens (e.g. a field name changes), a policy document drifts, an approval queue gets renamed, a customer edge case gets found out, and the agent keeps moving. Nobody owns the change because nobody treated the agent as a change.

The rollout path gets safer when every promotion carries evidence, scope, and a rollback owner.

The Change Record Is the Agent Spec

Atlassian describes IT change management as planning, reviewing, approving, and deploying changes to services with as little disruption as possible. Boring. Also the right object.

Agentic AI needs the same boring object.

A change record should specify which human role loses or gains work, which systems the agent can interact with, which actions require approval, which actions are forbidden, which metrics define harm, which traces prove behavior, and which owner can roll back changes made by the agent when something goes wrong.

Rather than going straight to a typical roadmap of discovery, pilot, platform choice, training, and rollout, I would put a change-control spine through each step of that typical roadmap.

By discovering the workflows instead of thinking of all the cool things an AI can do, we can categorize “Summarize account notes” and “renew an enterprise contract” for example into different risk classes. For example, pilot work should run in a sandbox that is production-like in terms of data and failure handling. Limited rollout of an agent should in the first place constrain the authority of the agent before it’s given to more people. And production should have a clear owner, and the agent and all its traces should be kept for a defined amount of time, after which they can be evaluated for performance, and in case of an incident there should be a clear path to resolve it.

This keeps the agent’s actual permissions from being discovered during an incident review.

By embedding service ownership into an organization’s way of working, these implementation dangers can be mitigated by establishing contracts between teams, a sandboxed deployment, and an appropriate rollout sequence. The AI team can be left to own the things they know best, i.e. the evaluation harness, the evals, model routing, and deployment mechanics. The business process owner must own the workflow semantics. Security, operations, and the relevant parts of legal or compliance must own the permission envelope, production response, and the consequences of non-compliance (respectively).

Shared ownership is annoying. So is production.

This is why I keep harping on service ownership for agent work. LangGraph for enterprise agent development made the runtime version of this point. Production agents have operational contracts. A clever graph is not enough. It can fall apart after the first model swap, policy change, or integration outage.

The change record is the handoff object between business process, agent runtime, security, and operations.

The Metrics Already Exist

No need for another exotic agent scorecard. The software delivery world already has the basic bones. DORA's software delivery metrics track change lead time, deployment frequency, failed deployment recovery time, change fail rate, and deployment rework rate.

Change lead time: time from proposing agent behavior to approving production behavior. Deployment frequency: rate of safe promoting of an agent to production, such as adding an agent to a tool registry, policy pack, an organization’s memory schema, retrieval index, or a workflow. Failed deployment recovery time: time to reverse an action of an agent, such as reverting a prompt or policy that was added to production, removing a permission that was granted to an agent, or switching back to a previous workflow. Change fail rate: percentage of changes to agents that require intervention.

This would all be nice and clean if an agent’s behavior failed in a binary way, like an exception being thrown. But it does not. It produces a technically correct answer that just happens to be wrong in the context of the workflow. Which is why the failure is behavioral, not binary, and is invisible to a deployment platform that only knows how to scream when a process fails to start.

So the metric needs evidence.

In the end, the production agent rollout should collect all traces of decisions (tool calls, approval steps etc), rejected actions (e.g. because of insufficient privileges), user corrected mistakes as well as any failures of the eval routine. Business outcomes should also be added to that list of the things changed for a release story and then the team has the evidence for the change board that they’re approving of “stuff” with a slightly nicer UI.

This is where Everybody Tests comes in. Testing cannot be relegated to downstream QA when an agent can affect a live workflow. Product, engineering, operations, security, and enterprise systems teams should be able to run the test. Ideally, they should understand it, too. The eval suite tests behavioral regressions. Traces reveal runtime drift. Approval logs expose authority escalation. Business metrics surface harm the model never sees.

All of them are part of the change.

The Roadmap Is a Promotion Ladder

Start with read-only assistance. The agent assists with summarization, search, templates, classification, and process explanation. That finds workflow fit and failure modes without giving the system authority to act.

Next, the team gradually grants more permission inside well-defined boundaries. Completing low-dollar refunds, updating internal tickets, sending non-regulated customer messages, changing low-risk account fields, deploying to test environments. The goal is to prove bounded authority before scope expands.

This promotion path pays for itself by preventing a business process from being secretly screwed by an AI that nobody can explain.

Make each step on the promotion ladder concrete. Human-in-the-loop needs a named reviewer, a review surface, override power, correction capture, and a rule for when the agent stops asking. Same for guardrails, observability, and governance. Each word should collapse to an owner, system, threshold, and audit trail.

McKinsey's 2026 AI trust survey is useful here because it separates adoption from maturity. Strategy, governance, and controls for agentic AI remain the weak spots. Security and risk concerns remain the main barrier to scaling. Which tracks.

Boring. Beautiful.

Own the Change

So long as an organization treats an enterprise AI agent like another tool intended to spread to more people in the organization with the same amount of enthusiasm, then the AI agent’s implementation will fail shortly after the first collisions with the organization’s permission models, its customers’ reporting structures, its compliance requirements, its process exceptions and its sheer number of customers.

I have no particular interest in helping to recreate the CAB theater for Enterprise Agents. Meetings with 8 approvers (or more!) for a password reset workflow that they cannot even understand is a huge waste of time and effort. Yes, review is reasonable in regulated paths, but that should be the exception, not the rule. And it should be as trivial and technical as possible, ideally close to where the work is actually being done. (In this case a simple approval in the workflow UI).

Put the agent change record next to the PR, the eval report, the trace sample, the permission diff, and the rollback plan. Have the workflow owner sign the semantics; security sign the authority; engineering sign the runtime; and operations sign the incident path.

Then ship.

That is what an AI implementation roadmap needs now: a promotion path for systems that can act.

Production always gets weird.

Agent Benchmark Scores Are Measuring the Harness, Not the Model | Focused Labs

Austin Vance — Sun, 17 May 2026 21:13:13 +0000

The difference between the leading agentic coding models is much smaller than the difference between two distinct configurations of a single model on the same benchmark. Anthropic just quantified it: a six-percentage-point gap on Terminal-Bench 2.0 between the most- and least-resourced setups, p < 0.01. Same model. Same task set. Same harness. The only variable was the resource budget given to the pod.

This is larger than the spread between most frontier models on the public leaderboard.

The number the enterprise picked as "the best agent model" is mostly the amount of CPU and RAM that the eval team assigned to the pod for the test. Welcome to production.

The benchmark is not what the benchmark claims to measure

Static evals score a model's output directly. Agentic coding evals score a model in a runtime, and the runtime itself decides whether a container gets OOM-killed for a transient memory spike, whether a pip install command finishes, whether a test subprocess ever returns a result. Two agents at different resource budgets will be taking different tests.

Anthropic ran Terminal-Bench 2.0 across six resource configurations, from strict enforcement of the per-task specs all the way to completely uncapped. They observed 5.8% of tasks failing on pod errors unrelated to model capacity at strict enforcement, compared to 0.5% at uncapped. Success scores at 1x through 3x were largely within noise (p=0.40), since the agent was going to fail those tasks anyway. However, past 3x, success scores climbed faster than infra errors declined. The extra headroom gave the agent room to attempt new approaches that only work when given more generous allocations, such as installing several large packages at once, running memory-hungry test suites, or spawning subprocesses that take extra time to complete.

The benchmark shifted. Previously it was measuring how capable the model was. Now it is measuring how much budget the harness gives the agent to brute-force the answer.

This is not a bug in Terminal-Bench. It is the nature of agentic evaluation: the runtime is not a passive container, it is an active part of the problem-solving process.

When the benchmark does not include the exact hardware and resource configuration, it ships a number that can't be compared to anyone else's number. Nobody is measuring the same thing.

The model is mostly plumbing

Harrison Chase has been making a variant of this argument for about a year. The agent is not the model. The agent is the harness, memory, tools, prompts, retries, state machines, guardrails, and context windows, with a model call buried somewhere in there.

The Anthropic data is the experimental confirmation of the harness sitting at the heart of the agent. Flip the pod resource limits and the "same" agent is a different agent inhabiting a wildly different reality. Flip the sandbox provider and the same leaderboard score means a completely different thing. The vast majority of the decisions that go into building an agent are about tuning the harness.

Anna Bernad posted a Twitter thread last week after looking at 36 production agent harnesses. Her take is far sharper than mine.

"Every harness I studied that actually ships does the same underlying move, and guess, it's not separation. It's making the context describe a different room."

If the context reads as "teammate shipped work, I'm the reviewer, pipeline wants green," the agent soft-approves with a minor note. Not because the model is bad. The agent is trying to fit the response to the context, and soft approval is the only way to complete the pattern.

The harness is the room. The model is the tenant.

What this does to enterprise procurement

Agent performance based on a benchmark consistently deviates from expectations once a client engages with our service. The model selected for the agent's function is sound. The "harness" through which the model is commanded to operate is what impedes the application. The runtime may not give the tools sufficient compute to act effectively. The retry mechanism built to improve throughput actually masks critical errors until it is far too late. The context window is being consumed by boilerplate system prompts the procurement team didn't know existed.

The enterprise then concludes "AI doesn't work for us" and abandons the effort. The model vendor is blamed. Nobody audits the scaffold.

Vendor benchmark claims aren't automatically disbelieved, but those claims become purely marketing when translated into an "eval score" meant for buyers to use in evaluating vendors. If the eval score is only reproducible on the vendor's Kubernetes cluster with their sandboxing solution and their machine resources, it's safe to say the score has no procurement value.

The LangSmith Signal report this week puts billions of agent runs behind the month's trends. Anthropic grew 73% in users, gaining 39% of share. Gemini rose after the release of Gemini 3. OpenAI remained the largest at around 80% of volume but didn't move up or down. Those are usage numbers, not capability numbers. People are moving around based on what actually works in their harness, not based on what a leaderboard says.

How to read a benchmark

Three questions, in order.

The first question is what the harness actually was. If the eval team doesn't publish the scaffold, retry policy, context budget, tool set, and resource configuration tradeoffs, the number is a picture of one run on their box and not comparable to anything.

Second: what is the infra error rate? Anthropic reported 5.8% of Terminal-Bench 2.0 tasks failing on pod errors at strict enforcement, a 5x margin above the spread between most frontier models. An eval that doesn't separate "model failed" from "container got killed" introduces a lot of noise in the headline number.

Third: does my production environment resemble the eval environment? If the eval runs uncapped on a data-center GPU cluster, the score is going to have almost no predictive value for me, since my agent runs in a sandboxed environment such as a Lambda function with a 512MB memory cap. An agent can win the competition by brute-forcing the space of scikit-learn installs and then fail silently at ship time because it consumes too much memory in the production environment. A lean, efficient agent that loses the benchmark will ship just fine.

What to do instead

Build the harness first. Run the model last.

The analysis has to translate to production. Production tools. Production retry budget (or lack thereof). Production memory store. Production prompt scaffolding. Production runtime limits. Wire it up with observability that traces trajectories through the system, not individual LLM calls. Then swap different models in and see what changes.


# Shape of an internal model bake-off in 2026.
# LangChain 1.x, LangGraph 1.1.9, LangSmith.

from langchain.agents import create_agent
from langsmith import Client, traceable
from langsmith.evaluation import evaluate

CANDIDATES = [
    "anthropic:claude-opus-4-7",
    "openai:gpt-5.1-pro",
    "google:gemini-3-pro",
]

def build_agent(model: str):
    # Same tools, same prompt, same retry budget, same memory store.
    # The ONLY variable is the model string.
    return create_agent(
        model=model,
        tools=PRODUCTION_TOOLS,
        prompt=PRODUCTION_SYSTEM_PROMPT,
        middleware=[
            PIIMiddleware(config=PROD_PII_CONFIG),
            HumanInTheLoopMiddleware(escalation_policy=PROD_POLICY),
        ],
        context_schema=ProductionContext,
    )

client = Client()
dataset = client.read_dataset(dataset_name="production-trajectories-q2")

for model_id in CANDIDATES:
    agent = build_agent(model_id)
    evaluate(
        lambda inputs: agent.invoke(inputs),
        data=dataset,
        evaluators=[
            trajectory_match,       # compares actual tool-call path to reference
            tool_call_precision,    # did the agent use the right tool at the right time
            final_output_rubric,    # LLM-as-judge on the end state
        ],
        experiment_prefix=f"harness-bakeoff-{model_id}",
        max_concurrency=8,
    )

All tests run using the same harness, the same tools, one variable at a time. The goal is to select the model that actually works within the production stack, not the one that earned points on a public leaderboard running on a Kubernetes cluster someone else had tuned.

This is where the engineering work is. This is also why the agent harness is where the engineering work lives now, and why a lot of clients call us. The model picker is not the problem. The harness design is the problem. The eval infrastructure is the problem. The trajectory observability is the problem.

The harder truth

The methods for finding genuinely good agents tended to favor simplicity and efficiency. The reason is that we were looking for agents that could write efficient code quickly. In contrast, agents that had plenty of resources available tended to do better when there were plenty of resources available. Both types of agents are useful to test for, and both correspond to realistic scenarios. Neither of them can fairly be collapsed into a single number on a leaderboard.

Many of the agents we deploy to enterprises run on some sort of strict budget for resources such as memory and CPU. Beyond these general limits, there are often specific restrictions on things like subprocess runtime and the number of times an API can be called within a window, largely because of cost. The model that wins with unlimited resources is a different model than the one that wins under strict limits.

Pick the model that performs in the harness. Own the harness. Measure the trajectory. The benchmark is not the product.

The harness is the product.

AI Agent Authentication Starts With Workload Identity | Focused Labs

Austin Vance — Wed, 13 May 2026 14:55:56 +0000

AI agent authentication starts when the system can answer which actor is allowed to make a tool call.

The model can propose the action. The runtime has to attach authority to it.

Most teams start with the fastest answer: an API key in an environment variable. The agent reaches Salesforce, GitHub, Jira, Snowflake, Stripe, whatever system makes the first useful proof feel real, and everyone moves on.

That proof matters. It shows the agent can reach the systems where work actually happens. It also hides the first product decision: who is acting when the tool call leaves the runtime?

The agent gets memory. The agent runs in the background. The agent forks into subagents. The agent retries failed operations. The agent calls tools after the user has walked away. The agent lands in an enterprise workflow where the work has value, the logs have value, and breaking something has a consequence.

A shared API key starts as configuration. Then it quietly becomes the identity of the agent.

An ugly place to stumble into by accident.

The secret becomes the actor

Early security models for agents tend toward good vibes with a bearer token. The prompt gives instructions. The tool schema lists calls. Hard-coded secrets in the runtime decide what actually gets done based on the input, the agent, and whatever authority those secrets carry.

The secret wins.

The agent has all of those powers if the same key can read every customer record, submit refunds, update tickets, and write to production data. Carefulness in the prompt is theater at that point. The tool description can say those powers apply only when appropriate. The audit log will still show one credential able to perform a pile of different tasks.

There is already a category for this outside agents: OWASP's Non-Human Identities Top 10. Production applications identify themselves as non-human identities. Agents are adding themselves to that growing list of stranger workloads, running differently than normal services, but still requiring access to systems and data.

The important step for me is naming the agent as a workload, because the architecture gets less magical and more useful.

Workloads have identities. Workloads can request scoped credentials for those identities. A workload can be denied a credential. A workload can rotate credentials. A workload can leave an audit trail that survives the model, the prompt, and the v2 or v3 abstraction barrier the team is currently working around.

Baseline authentication for production AI agents.

The runtime should issue tool-specific credentials instead of letting the agent carry a shared key everywhere.

Workload identity is the boring answer

This part is old. Good.

Kubernetes already considers service accounts to be identities of processes running in Pods, and the current docs describe short-lived, automatically rotating ServiceAccount tokens issued through the TokenRequest API. SPIFFE generalizes that into workload identity documents, including short-lived X.509 and JWT SVIDs that a workload can use to authenticate itself to other workloads.

Cloud platforms are heading in the same general direction. AWS STS can issue temporary security credentials after a workload has identified itself using OpenID Connect. Google Cloud Workload Identity Federation allows external workloads to access Google Cloud resources without service account keys. Azure managed identity docs describe workload identities as machine and non-human identities associated with compute resources.

The industry knows how to keep long-lived secrets out of the hot path. It just keeps giving agents interfaces that make the old mistake easy.

A developer writes a tool wrapper. The tool wrapper needs credentials. The fastest way to configure it is to add an API key to an environment variable and add a TODO to remove it later. The TODO gets pushed to production because now the agent answers support tickets, reconciles invoices, or looks at CI.

I've worked with teams who reviewed the model, tuned prompts, drew diagrams for tool selection, created a few secrets in deploy config, and crossed their fingers that the tool descriptions would shore it all up.

They are not enough.

Delegation is the missing primitive

In many applications, the agent should rarely hold the credential it uses to act.

Put an identity assertion in the flow. This agent. This tenant. This user context if present. This policy version. This tool request. This approval state. That assertion is exchanged for a credential only when the action needs one.

OAuth was designed to support exactly this shape. RFC 8693 defines token exchange, describing how one temporary credential can be exchanged for another temporary credential intended for a different context. In the agent case, the model proposes an action, the runtime checks policy, the broker issues a credential for that action and tool context, the call happens, and the credential dies.

It does not expire after a quarter. It does not expire after someone remembers to rotate it. It expires because the system puts expiration in the path.

That changes the damage pattern. A compromised tool wrapper no longer implies broad access to every downstream system. A prompt injection has to cross approval, run, tenant, and policy boundaries. A subagent that escapes its execution boundary cannot reuse credentials after the run, approval, or tenant context has expired.

The agent is still useful. It just has to query through a production boundary that understands production concerns.

This is why integrated agents are valuable and dangerous at the same time. The valuable integrated agents do not live in a chatbot tab. They integrate with real systems. Once an agent is tied to real systems, authentication becomes product architecture rather than cleanup work hidden in deployment.

The runtime owns the identity boundary

A model provider should not own this boundary. A prompt should not own this boundary. A tool schema should not own this boundary.

The runtime owns it because the runtime follows the whole path.

It connects agent definitions to threads or runs, tenants, and identity information, including the user who initiated the work, whether the work is backgrounded, whether a human approved a risky step, which tool is being called, and which downstream credential is being requested. It can attach those facts to an identity assertion and make a policy decision before any assertion leaves the process.

That policy decision can be boring and explicit:

The refund tool can request a payment credential for the current tenant.
A GitHub tool can request a write credential after CI has produced an eval pass.
The Snowflake tool can request a read credential for one warehouse, one role, and one time window.
A subagent can run with a delegated identity, but only with fewer capabilities than the parent run.

The list is not impressive, which is why it is powerful.

This is also where multi-agent orchestration gets serious. A supervisor handing work to a subagent creates a delegation relationship along with the task description. The child process needs enough authority to perform the work at hand and no more. The audit log must reflect that chain of trust cleanly or troubleshooting becomes an exercise in futility.

The worst setup is a swarm of agents all sharing the same service account. Simple enough to get going. Terrible when it comes time to debug an incident. Every action has been performed by the same principal, authenticated with the same key, and observed through the same useless blur.

The incident has no useful actor. Just a shared key with a long memory and no accountability.

Short-lived delegated credentials make the agent run, policy decision, tool call, and audit trail line up.

Audit follows identity

Agent observability without identity is half a story.

A trace for the agent step called refund_customer can include latency, tool arguments, model output, retries, all visualized in a convenient span tree. Useful. Then someone asks who had authority to issue that refund, and the trace turns into archaeological excavation.

The right trace shows the tool call connected to a principal. Not just a service account. A principal with an agent ID, run ID, tenant, user context, policy decision, credential scope, and expiration time.

This is what allows a team to answer questions after the tool call has done real work.

Who granted access? What user context did it use? What broker generated the credential? What version of policy allowed it? What downstream resource accepted it? What subagent inherited it? Can that credential be used for something else?

Those questions determine whether there is a real postmortem or just hand waving about the agent doing something weird.

The same principle applies to testing. In Everybody Tests, I argued that every team already tests whether they admit it or not. Agent identity needs that same honesty. If a runtime can create delegated credentials, tests should verify that the boundary holds. A refund agent should fail against the wrong tenant. A code agent should fail when eval gates are red. A research agent should fail when it asks for write access to a system it only reads.

Not a single npx this and that in the whole codebase. Test it in CI.

Shared keys hide product decisions

The fastest credential story hides the decisions that matter most.

A shared key hides tenancy. It hides user context. It hides the identity of the agent performing an action. It hides which subagent inherited authority. It hides whether approval was granted. It hides whether the action matched the original request. It hides rotation until rotation becomes an outage.

OWASP's secrets management guidance recommends dynamic secrets where possible to reduce credential reuse and limit the damage when credentials leak. Agent systems need the same pressure, with the additional constraint that the credential must represent the run instead of only the application.

A normal backend service is expected to behave predictably and follow a reliable lifecycle. It accepts requests, implements endpoints, and changes through controlled deployments. An agent runtime for integration automation can select different tools per request, execute work in subagents, retry steps, and continue running after initial user interaction has completed.

So identity has to be more exact.

The credential loaned to the system should assert what it is currently allowed to do. The operating policy should be visible enough to understand the motivation behind the action. The audit trail must persist long enough for a human to traverse the events as they happened.

A boundary-based platform does not need a full rewrite. Start with one boundary.

Put an identity broker between the agent runtime and the first high-risk tool. Give the agent runtime a workload identity. Have the broker exchange that identity for a tool credential. Associate the decision with tenant, run, and operation. Record the policy decision in the trace. Add a CI test that proves the wrong tenant fails. Expire the credential quickly. Make the failure visible when the broker returns no.

Then move the next tool behind the boundary.

The production line

AI agent authentication is the control plane for non-human actors who do work across systems.

Ownership matters here. Security cannot retroactively add this after the agent and its resources have shipped. Platform cannot stash it in a vault path. Product cannot mark it as a checkbox in consent. Identity, delegation, expiration, and audit have to be inherent in the runtime of the agent and how it executes.

The agent should actually be able to act. That is, after all, why we are doing AI agency in the first place. That agency should have a workload identity.

Production systems have already worked out parts of the problem. Kubernetes, SPIFFE, OAuth token exchange, cloud workload federation, managed identities, dynamic secrets. They exist because static secrets rot and shared principal accounts make bad worse.

It is a mistake to grant agents an exemption because the interface is conversational.

The model can decide on the next step. The runtime decides whether that step gets a credential.