<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Austin Vance</title>
    <description>The latest articles on DEV Community by Austin Vance (@austinbv).</description>
    <link>https://dev.to/austinbv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F305023%2Fc978f899-9fa5-4b1e-9ad9-e3f60313fd65.jpeg</url>
      <title>DEV Community: Austin Vance</title>
      <link>https://dev.to/austinbv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/austinbv"/>
    <language>en</language>
    <item>
      <title>Enterprise AI Agents Have a Control Plane Now | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 25 Jun 2026 22:39:16 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/enterprise-ai-agents-have-a-control-plane-now-focused-labs-4enb</link>
      <guid>https://dev.to/focused_dot_io/enterprise-ai-agents-have-a-control-plane-now-focused-labs-4enb</guid>
      <description>&lt;p&gt;Cheap to create, expensive to manage.&lt;/p&gt;

&lt;p&gt;Added into Microsoft 365, Google Cloud, ServiceNow, Slack, data warehouses, support queues, and custom applications, enterprise AI agents have become an operating estate that the market wants to operate. So the market is now increasingly focused on the operating aspects of agents rather than the small trick of getting an agent to respond in a chat. I just read about &lt;a href="https://learn.microsoft.com/en-us/microsoft-agent-365/overview" rel="noopener noreferrer"&gt;Microsoft Agent 365, which is framed around observing, governing, and securing agents through a unified registry&lt;/a&gt;. &lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/introducing-gemini-enterprise-agent-platform" rel="noopener noreferrer"&gt;Google announced Gemini Enterprise Agent Platform&lt;/a&gt;, build, scale, govern, and optimize. &lt;a href="https://www.servicenow.com/uk/blogs/2026/ai-summit-how-overcome-4-ai-barriers" rel="noopener noreferrer"&gt;ServiceNow is talking about AI Control Tower&lt;/a&gt;. &lt;a href="https://www.langchain.com/blog/introducing-langsmith-fleet" rel="noopener noreferrer"&gt;LangChain describes LangSmith Fleet&lt;/a&gt; as the management layer for ownership, authentication, auditing, sharing, and permissions.&lt;/p&gt;

&lt;p&gt;The layer to care about is the control plane above the builder.&lt;/p&gt;

&lt;h2&gt;
  
  
  The builder layer is no longer enough
&lt;/h2&gt;

&lt;p&gt;For the first phase of enterprise AI, the focus was business building AI agents quickly. A platform team could create a vendor intake workflow agent in minutes. A data team could create an agent that reads from a warehouse with a click of a button. A consulting team could create a research agent within an afternoon and deploy it to Slack in minutes. The focus now moves to who owns an agent, whose credentials it uses to access systems, what systems it touches, and what happens when the owner leaves, the workflow changes, the prompt drifts, the cost spikes, or the tool that the agent was built for gets deprecated.&lt;/p&gt;

&lt;p&gt;As agent creation gets cheap, the pressing problem of AI agent management does not go away by itself. A business team describes a workflow. A platform team wraps a Jira action. A data team grants read access to a warehouse. A consulting team builds the Slack research bot. The only thing that looked hard last quarter becomes a thing a team can ask for before lunch.&lt;/p&gt;

&lt;p&gt;Who owns the agent?&lt;/p&gt;

&lt;p&gt;Whose credentials does it use?&lt;/p&gt;

&lt;p&gt;What systems can it touch?&lt;/p&gt;

&lt;p&gt;What happens when the owner leaves, the workflow changes, the prompt drifts, the cost spikes, or the tool gets deprecated?&lt;/p&gt;

&lt;p&gt;It turns out, agent programs are more like semi-autonomous workers than ordinary applications. They have memory and operating instructions. They access and manipulate data through APIs. They work through human collaboration surfaces, which means agent behavior crosses and spreads across app state, permissions, approvals, and the frontend runtime surfaces of applications. We have written before about &lt;a href="https://focused.io/lab/enterprise-ai-agents-are-leaving-the-server" rel="noopener noreferrer"&gt;enterprise agents leaving the server&lt;/a&gt;. The same problem shows up one level higher. The enterprise needs a single place to name, manage, govern, monitor, and eventually retire the operating agents.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F2wpnzitfzjtqbxtb8u8w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F2wpnzitfzjtqbxtb8u8w.png" alt="Architecture diagram showing multiple enterprise AI agents governed by a shared management control plane with registry, owner, identity, policy, observability, cost, approvals, and decommissioning." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The control plane is where agent creation turns into an operating model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The market has already named the missing layer
&lt;/h2&gt;

&lt;p&gt;Another way to look at the control plane is that it is the administrative surface for the set of enterprise AI agents: registry, owner, identity, policy, and all the boring bits that follow. &lt;a href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ai-agents/governance-security-across-organization" rel="noopener noreferrer"&gt;Microsoft's Cloud Adoption Framework says every agent must be observable, governed, and secure&lt;/a&gt;. Leaders have to know the AI agents in the organization, who owns them, what they do, and which ones should be stopped. That is an operating model before it is a prompt-engineering checklist.&lt;/p&gt;

&lt;p&gt;Agent 365 turns these ideas into a surface to administrate agents. Register. Manage. Permissions. Policies. Reviews. Entra. Purview. Defender. The builder of the agent becomes another object managed inside the enterprise by Agent 365's administrative surface.&lt;/p&gt;

&lt;p&gt;Google is moving Vertex AI into the same broader control-plane view. The &lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/introducing-gemini-enterprise-agent-platform" rel="noopener noreferrer"&gt;Gemini Enterprise Agent Platform announcement&lt;/a&gt; details Agent Identity, Agent Registry, Agent Gateway, runtime, memory, sandboxing, runtime monitoring, and governance. Build and connect sit in the interface, while govern, optimize, and monitor are the verbs that make it an enterprise platform.&lt;/p&gt;

&lt;p&gt;ServiceNow approaches the problem from operations. Shadow AI, adoption problems, inefficiencies at scale, and fragmented data all have to be addressed. No surprise that the AI governance vendor also describes how to manage AI. AI Control Tower allows IT and business leaders to see what has been deployed, review usage of models and associated skills, and ask whether the work is aligned to company strategy.&lt;/p&gt;

&lt;p&gt;LangChain's Fleet post starts from the mess created when agent programs become easy to create. The hard part becomes who owns which agents, how they authenticate across tools, who can audit what the agents are doing, and how a good agent gets shared safely. Same shape again. Registry. Identity. Permissions. Audit. Sharing. The control plane is emerging because the builder layer has finally reached critical mass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sprawl is the failure mode
&lt;/h2&gt;

&lt;p&gt;I argue that enterprise AI fails as an unmanaged worker estate with tool access growing as semi-autonomous workers of ambiguous purposes, expanding scopes, and stale owners. These enterprise AI agents have shared identities, little or no audit trails, silent cost growth, overlapping jobs, and no clear way to be retired or decommissioned. One embarrassing misstatement by a chatbot is what everyone sees; the unmanaged worker estate with tool access is what has to be governed.&lt;/p&gt;

&lt;p&gt;A governance maturity paper calls this &lt;a href="https://arxiv.org/html/2604.16338v1" rel="noopener noreferrer"&gt;agent sprawl: redundant, ungoverned, conflicting agents across business functions&lt;/a&gt;. A healthcare lifecycle paper describes the regulated version: duplicated agents, unclear accountability, inconsistent controls, tool permissions that persist beyond the original use case, and decommissioning tied to credential revocation and audit logging. The control-plane layers that travel across both domains are the useful part: identity registry, mediation, bounded context, runtime policy, lifecycle, decommissioning, credentials, audit logging.&lt;/p&gt;

&lt;p&gt;Governance follows the execution path. &lt;a href="https://focused.io/lab/ai-agent-governance-follows-the-execution-path" rel="noopener noreferrer"&gt;That earlier piece&lt;/a&gt; matters because written policies cannot possibly know that the vendor-intake AI has just gained email capabilities through a new tool. A launch approval from three months ago does not know that a different region is using the same sales-research agent. A static spreadsheet will never know that two teams have built agents to calculate renewal risk with different scoring.&lt;/p&gt;

&lt;p&gt;Agents act through paths. Therefore, the control plane has to live near those paths.&lt;/p&gt;

&lt;p&gt;As an alternative to building a single massive platform for every enterprise AI problem, the control plane can be assembled from basic data structures and operating systems already lying around: identity, a registry, a gateway, observability data, CI processes, runtime policy, and existing platform management data. A company could run that inside Microsoft, Google, ServiceNow, LangSmith, or a custom internal platform. Importantly, someone in the organization needs to own the control plane, the inventory of agents, and the action boundary for that inventory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identity turns management into live work
&lt;/h2&gt;

&lt;p&gt;Agent identity is where the conversation starts to become concrete.&lt;/p&gt;

&lt;p&gt;Microsoft's Agent 365 sharing docs detail three forms of access: delegated access, app agent access, and an agent with its own user identity. The third one is spicy. &lt;a href="https://learn.microsoft.com/en-us/microsoft-agent-365/share" rel="noopener noreferrer"&gt;Agents with their own identity can be added to Teams, Outlook, Office documents, SharePoint, and OneDrive&lt;/a&gt;. The agent can accumulate access over time and receive responses based on the agent's full access unless guardrails exist.&lt;/p&gt;

&lt;p&gt;That is a runtime boundary.&lt;/p&gt;

&lt;p&gt;An agent with persistent identity is a workforce of one that should be &lt;a href="https://focused.io/lab/ai-agent-authentication-workload-identity" rel="noopener noreferrer"&gt;managed like a workload&lt;/a&gt;. The agent has a purpose, an owner, a scope, credentials that need to move through a lifecycle, and a pile of boring audit questions that need answers after the agent does something useful or dumb. Who made the change? Who invoked the agent? What policy allowed the tool call? What data did the agent have access to? What trace shows how the agent arrived at that decision? What kill switch can be flipped at 2:00 a.m.?&lt;/p&gt;

&lt;p&gt;Also key to our view of AI inside the enterprise is the notion that &lt;a href="https://focused.io/lab/ai-agent-governance-runs-before-the-tool-call" rel="noopener noreferrer"&gt;policy has to run at the action boundary&lt;/a&gt;. Policy that runs after the action leaves the team doing a post mortem. The lock has to be on the write operation before it occurs, whether the action is a ticket transition, a file move, a payment operation, a database query, or a customer email.&lt;/p&gt;

&lt;p&gt;Truly managing AI agents, as with any workload, means managing identity to action to evidence for the actions the agent performs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lifecycle is longer than creation
&lt;/h2&gt;

&lt;p&gt;Control planes make everything after creation visible.&lt;/p&gt;

&lt;p&gt;The work must become boring after creation: register the agent, assign the owner, determine the purpose, bind identity, approve tools and data sources, deploy the agent through known and managed deployment paths, observe behavior in the runtime environment, review cost and usage, update evidence of changes, suspend the agent when it behaves badly, retire it when the workflow ceases to exist, and revoke the credentials. Boring is the point. Boring will survive growth.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fp0987jmnjy5d96th25d1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fp0987jmnjy5d96th25d1.png" alt="Swimlane diagram showing the lifecycle of an enterprise AI agent from creation through registration, identity binding, policy approval, deployment, observation, update, suspension, and retirement." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Creation is the easy row. The control plane owns the rest of the lifecycle.&lt;/p&gt;

&lt;p&gt;Microsoft's manage-agents guidance translates create and register into enterprise words: integrate, manage, operate, standardize, secure, comply, retire. &lt;a href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ai-agents/integrate-manage-operate" rel="noopener noreferrer"&gt;Agents have to move from isolated pilots into managed assets with deployment, operation, standardization, cost control, security, compliance, and retirement&lt;/a&gt;. Without that oversight, Microsoft warns about shadow AI proliferation, budget overruns, and unused agents expanding the attack surface.&lt;/p&gt;

&lt;p&gt;This pattern predates AI. Cloud, SaaS, RPA, and Kubernetes follow a similar lifecycle. First, there is the initial thrill of having an accelerator. Later, as the bill arrives, the speed is replaced by naming, owners, policies, access, monitoring, incident response, and lifecycle management. Agents add the nastier bit: the managed objects can reason about tasks, call tools, and maintain context between tasks.&lt;/p&gt;

&lt;p&gt;No mysterious new governance discipline for AI. Normal operating discipline is enough for a new type of actor in the runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would look for
&lt;/h2&gt;

&lt;p&gt;When we evaluate an enterprise AI agent platform now, I care less about the first five minutes and more about month five.&lt;/p&gt;

&lt;p&gt;Does it list all deployed agents and show the owner of each agent? Does it keep builder identities separate from runtime identities? Can it list the tools, data sources, channels, and memory stores each agent can interact with? Before tool calls are made, can it enforce the right policies? Can it show receipts for the actions performed by the agents, instead of a pile of logs? Can it connect usage and cost to a business workflow? Can it suspend an agent without deleting evidence of what the agent did when running? Can it later retire that agent and revoke credentials without a scavenger hunt?&lt;/p&gt;

&lt;p&gt;A single control plane is still an early product claim. Do not get suckered by a vendor screenshot of a control plane. Match that against the actual operating model. It is entirely possible to compose together existing tools, IAM, CI, policy management, distributed tracing, deployment metadata, to serve as a rough control plane for AI agents. The real question is whether the estate has &lt;a href="https://focused.io/lab/agentic-ai-implementation-change-control" rel="noopener noreferrer"&gt;change records and rollback owners&lt;/a&gt; for agents that are doing real work for the business.&lt;/p&gt;

&lt;p&gt;This also changes the buyer question. The weak question is, "How fast can this tool create an agent?" Speed matters, but it is table stakes now.&lt;/p&gt;

&lt;p&gt;The better question is what the agent does after it starts acting for the business.&lt;/p&gt;

&lt;p&gt;Registry. Owner. Identity. Tools. Data. Channels. Runtime evidence. Cost. Updates. Suspension. Retirement. If these elements exist, the organization has the germ of an operating model. A prompt. A Slack channel. A shared API key. A dashboard that nobody ever looks at. Current state: drifting agent estate.&lt;/p&gt;

&lt;p&gt;Enterprise AI agents are not waiting for a management layer. Microsoft, Google, ServiceNow, LangChain, and the research community are circling this primitive. The builder creates the agent. The control plane decides whether the agent belongs in live systems.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>MCP Security Starts After Tool Approval | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 24 Jun 2026 04:35:06 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/mcp-security-starts-after-tool-approval-focused-labs-48b3</link>
      <guid>https://dev.to/focused_dot_io/mcp-security-starts-after-tool-approval-focused-labs-48b3</guid>
      <description>&lt;p&gt;Approving an MCP server once for production is the first step in securing MCP. The real danger comes after that when the surface that the model is interacting with changes slowly but fundamentally.&lt;/p&gt;

&lt;p&gt;A read-only customer lookup tool becomes an export tool. A database helper adds a required raw-SQL parameter. A local file search tool starts calling an external API. Same server. Same connection. Same green check from the original approval screen.&lt;/p&gt;

&lt;p&gt;This agent does not remember the approval granted last Tuesday. All it sees at runtime is the current tool description, the current schema, the current return shape and current affordance (i.e. what the tool allows the model to do). The agent then acts accordingly.&lt;/p&gt;

&lt;p&gt;This is the runtime security problem that MCP is walking into.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool metadata is runtime authority
&lt;/h2&gt;

&lt;p&gt;MCP tools spec explicitly calls out that tools are model-controlled. i.e. language model can automatically discover and use tools based on context and prompts. tools/list and tools/call clients on &lt;a href="https://modelcontextprotocol.io/specification/2025-11-25/server/tools" rel="noopener noreferrer"&gt;official MCP server tools specification&lt;/a&gt; as well as notifications/tools/list_changed for servers that notify clients when available tool list changes for that server.&lt;/p&gt;

&lt;p&gt;That small protocol shape matters. A tool definition is no longer something a team documents next to an API, it is the API’s documentation brought to life to teach the model about the world of actions the API supports.&lt;/p&gt;

&lt;p&gt;The description tells the model the intent of the tool. The input schema tells the model how to use the tool to ask a question. The annotations tell the host and user interface how the tool behaves. Finally, the response from &lt;code&gt;tools/call&lt;/code&gt; becomes evidence for the next step for the agent. Using a framework like LangChain, a team can for example use &lt;code&gt;MultiServerMCPClient.get_tools()&lt;/code&gt; to fetch all the MCP tools for a client, and then pass them to &lt;code&gt;create_agent()&lt;/code&gt;, as explained in the current MCP guide for LangChain &lt;a href="https://docs.langchain.com/oss/python/langchain/mcp" rel="noopener noreferrer"&gt;current LangChain MCP guide&lt;/a&gt;. This is useful, because a changed tool surface now directly maps to changed executable behavior of the agent.&lt;/p&gt;

&lt;p&gt;I like MCP because it makes the work of integrating different things look similar. As Focused argued in a recent article, MCP makes integration work runtime work &lt;a href="https://focused.io/lab/salesforce-mcp-turns-crm-integration-into-an-agent-runtime-problem" rel="noopener noreferrer"&gt;MCP turns integration into runtime work&lt;/a&gt;. And yes, shared shape of integration work does not automatically make the work of that runtime safe. But it makes it worth securing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Approval is a snapshot
&lt;/h2&gt;

&lt;p&gt;Admission-time controls answer the narrow question of whether it should be safe to connect to a given server at admission time. Admission time controls include information about the server itself (e.g. the server’s identity) as well as the various trust roots, signature information, the registry for the server, scopes for the server and various human consent decisions made by a human administrator for the server. It’s good to make these admission time security decisions.&lt;/p&gt;

&lt;p&gt;The fresher question is: at the time of the call, does the tool that is running to perform the call, still fall within the capability surface that the team had approved for the server to which the call is being made, admission-time security having long since having passed.&lt;/p&gt;

&lt;p&gt;This issue came up recently in a MCP community discussion (May 2026) about &lt;a href="https://github.com/modelcontextprotocol/modelcontextprotocol/discussions/2826" rel="noopener noreferrer"&gt;runtime tool drift detection after admission-time security&lt;/a&gt;. The admission-time security checks verify the server’s identity, that it is in the correct trust boundary, etc. However, after that connection is made, the tool that was connected to that server could change from being read-only to a full mutate tool, add new PII data classes, etc., and the server’s identity wouldn’t change until later.&lt;/p&gt;

&lt;p&gt;OWASP calls this type of attack a ‘rug pull’ where after approving an MCP server a server can change the tool definitions. The &lt;a href="https://cheatsheetseries.owasp.org/cheatsheets/MCP_Security_Cheat_Sheet.html" rel="noopener noreferrer"&gt;OWASP MCP Security Cheat Sheet&lt;/a&gt; identifies the problem and also lists the broader MCP attack surface, including tool poisoning, tool shadowing, confused deputy, over-scoped tokens, replay attacks, and even sandbox escapes. Note that this cheat sheet suggests hashing or pinning tool definitions and alerting on changes to tool descriptions, function and stored procedure registry, including their descriptions, parameter names, parameter types, and return schemas. That is the basis for the sane MCP security best practices for this category of problem. Treat the entire tool definition as security-sensitive material.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9gwbnpz73790rh9bpl5w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9gwbnpz73790rh9bpl5w.png" alt="Architecture diagram showing admission-time MCP tool approval compared with runtime drift verification before a tool call." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Approval is a snapshot. Runtime security checks the tool that actually ran.&lt;/p&gt;

&lt;h2&gt;
  
  
  The runtime has to carry a capability manifest
&lt;/h2&gt;

&lt;p&gt;An approved capability manifest is a useful primitive for &lt;a href="https://focused.io/lab/ai-agent-governance-runs-before-the-tool-call" rel="noopener noreferrer"&gt;runtime governance before the tool call&lt;/a&gt;. That primitive is not just paper work to be approved before deploying a server to the environment. It is an artifact that the Runtime can compare with the actual server/tool surface before executing any tool calls.&lt;/p&gt;

&lt;p&gt;The practical details of what to put in the capability manifest can be worked out, but for now, a simple and approved capability manifest for server tools is required. This is not a piece of paper, but an artifact that the runtime can check against before running the tools. A simple way to look at this is as a list of fields corresponding to points of drift. The description field steers the model, the input and output schema fields open up new paths, the effects field changes the nature of a lookup, the data classes for sensitive data changes a metadata call to a personal data call. All of these fields must be approved by a human, but approval for a new optional description field should not require a page to the CISO, approval for a new required input parameter, external destination, sensitive data class, or declared effect should quarantine the tool until its surface has been reconciled with the approved surface for that tool.&lt;/p&gt;

&lt;p&gt;The runtime surface of the tools under management is then compared against the manifest for the server prior to execution, as early as when a new tool definition is added to the managed tool set. Does the new surface look the same as the approved manifest, or is it a different shape to be classified?&lt;/p&gt;

&lt;p&gt;Classify it. Cosmetic changes to the name of a tool will likely generate little interest from the CISO or security team of an organization. A new optional string field however may require review by the approval team. A new required parameter, external destination, sensitive data class or an increase in the declared effects of a tool that currently acts as a read-only lookup of metadata and outputs results to a log or dashboard for example, would require the tool to be quarantined until the tool’s surface has been reconciled by a human.&lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://focused.io/lab/ai-agent-governance-runs-before-the-tool-call" rel="noopener noreferrer"&gt;runtime governance before the tool call&lt;/a&gt; is more than just policy. The runtime will either allow the call to happen, deny the call, prompt the human for input, or put the tool in isolated mode while the server is brought back to proper configuration to reconcile the surface with the deployed approved surface.&lt;/p&gt;

&lt;p&gt;Auditing of tool capabilities before deployment is still relevant. In &lt;a href="https://arxiv.org/html/2603.21641v1" rel="noopener noreferrer"&gt;Auditing MCP Servers for Over-Privileged Tool Capabilities&lt;/a&gt; (paper “mcp-sec-audit”) the authors describe a protocol-aware toolkit for auditing the code and the metadata of MCP server capabilities. The paper comes with static rules for auditing as well as with the description for dynamic sandboxing using Docker and eBPF. In contrast to runtime detection of drift, this kind of audit is meant to be conducted before a server actually enters the environment where it will be used. Runtime detection of drift then takes over as the server’s reports go stale.&lt;/p&gt;

&lt;h2&gt;
  
  
  List changes are not enough
&lt;/h2&gt;

&lt;p&gt;There is already a signal in the current MCP implementation: &lt;code&gt;notifications/tools/list_changed&lt;/code&gt;. This signal is useful as it allows a server to notify the client that there are available tools in the &lt;a href="https://modelcontextprotocol.io/specification/2025-11-25/server/tools" rel="noopener noreferrer"&gt;tools capability section of the spec&lt;/a&gt;. We will use this signal as is, i.e. as a signal, not as a security control.&lt;/p&gt;

&lt;p&gt;Similar ideas have already been brought up regarding &lt;a href="https://focused.io/lab/stop-eager-loading-mcp-tools" rel="noopener noreferrer"&gt;lazy MCP tool loading&lt;/a&gt;. Loading the description of every single approved tool to the context window has already been argued to be detrimental to model focus. But filling the context window with the descriptions of approved but unused tools would be even worse, since each of them would then be promptable into authority in addition to just sitting there. Instead load the smallest set of tools relevant for the model, and then bind each call of each of those tools to the current evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-call receipts beat trust vibes
&lt;/h2&gt;

&lt;p&gt;It won’t fit in a security team’s incident response to debug an incident from a single screenshot of an approval dialog (as cool as that may be). They need the actual call record.&lt;/p&gt;

&lt;p&gt;I’d also looked at per-call signed records. The GitHub discussion points strongly in that direction: input, outcome, effects, authorization decision, risk determination, and a hash chain back to the decision made for that call. A drift detector generates recomputable evidence: approved surface hash, current surface hash, classified delta, policy decision, and observed effect.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7bm89m2t3kq7ru5fic0q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7bm89m2t3kq7ru5fic0q.png" alt="Data-flow diagram showing user intent, policy decision, MCP tool call, side effects, and a signed call record feeding traces and evals." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The runtime should leave a receipt for the decision, the call, and the effect.&lt;/p&gt;

&lt;p&gt;We at Focused use the term &lt;a href="https://focused.io/lab/ai-agent-orchestration-needs-a-side-effect-ledger" rel="noopener noreferrer"&gt;side-effect ledger&lt;/a&gt; for this reason. “Log” is too weak a word to describe what we need here. A log simply reports that something has happened. A receipt, on the other hand, contains a record of decisions made, operation keys, tool versions, etc., and the effects of those decisions, plus a record of how to undo any harm caused by incorrect decisions.&lt;/p&gt;

&lt;p&gt;This leads directly into the next point. As we already established, &lt;a href="https://focused.io/lab/agent-traces-need-to-cross-the-mcp-boundary" rel="noopener noreferrer"&gt;trace evidence across the MCP boundary&lt;/a&gt; is required. This is because an agent trace that stops at &lt;code&gt;tools/call&lt;/code&gt; does hide the crucial information that production teams need to investigate. This information is the MCP call, downstream API span, policy decision, and the returned data class. All of this information must be included in the same investigation path.&lt;/p&gt;

&lt;p&gt;Also the Honeycomb production feedback framing makes sense here. As Austin Parker writes in Honeycomb’s AI agents and production feedback Q&amp;amp;A &lt;a href="https://www.honeycomb.io/blog/your-questions-about-ai-agents-production-feedback-answered" rel="noopener noreferrer"&gt;Honeycomb's AI agents and production feedback Q&amp;amp;A&lt;/a&gt; the term drift can be used for two things: first for agents that deviate from their intended functionality and second for agents that silently produce wrong output while the output still looks correct. To detect such kind of drift the observability platform needs to be able to observe the interactions of the agent with the outside world. For MCP tool drift this outside-world interaction itself has changed shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP security belongs in the runtime
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://redcanary.com/blog/threat-detection/mcp-ai-workflows" rel="noopener noreferrer"&gt;The double-edged sword of MCP&lt;/a&gt; is blunt about the ownership line. The protocol standardizes access. The runtime owns enforcement.&lt;/p&gt;

&lt;p&gt;A production MCP runtime should do five boring things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bind each approved tool to a capability manifest.&lt;/li&gt;
&lt;li&gt;Diff live tool definitions against that manifest.&lt;/li&gt;
&lt;li&gt;Score drift by effect, data class, external reach, schema shape, and credential scope.&lt;/li&gt;
&lt;li&gt;Quarantine high-severity drift before the model can use the tool.&lt;/li&gt;
&lt;li&gt;Write per-call receipts into traces, incident evidence, and eval inputs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So basic but important. MCP moves the integration surface closer to the model. Thus security moves closer to the call.&lt;/p&gt;

&lt;p&gt;I'm not envisioning a one-and-done approve button (nice as it might be with a prettier modal dialog, for example). What I'm envisioning is something more that functions at runtime to be able to describe what an agent was allowed to do, what a tool claimed to be able to do, what happened (i.e. what changed), who approved the change, etc. (and so on). This is the nature of MCP security after tool approval.&lt;/p&gt;

&lt;p&gt;That is MCP security after tool approval.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>AI Agent Governance Runs Before the Tool Call | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 24 Jun 2026 04:34:34 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/ai-agent-governance-runs-before-the-tool-call-focused-labs-ldk</link>
      <guid>https://dev.to/focused_dot_io/ai-agent-governance-runs-before-the-tool-call-focused-labs-ldk</guid>
      <description>&lt;p&gt;AI agent governance after the tool call is audit paperwork, not control.&lt;/p&gt;

&lt;p&gt;Even small actions have side effects: sending an email, updating a customer's details, approving a refund, updating a Salesforce opportunity, or even simply spending money from a wallet. Each of these actions has a useful side effect and a less useful side effect. If the runtime does not have enough context to prevent the less useful side effect from happening then there is value in auditing what happened.&lt;/p&gt;

&lt;p&gt;Gartner predicted in May that uniform governance applied to AI agents with different autonomy levels and access rights would cause either over restriction or under restriction. In its forecast for 2027, Gartner is predicting that &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2026-05-26-gartner-says-applying-uniform-governance-across-ai-agents-will-lead-to-enterprise-ai-agent-failure" rel="noopener noreferrer"&gt;by 2027, 40% of enterprises will either demote or decommission autonomous AI agents&lt;/a&gt; after discovering governance related gaps while running them in production.&lt;/p&gt;

&lt;p&gt;The fact that a read-only summarizer and an agent that modifies account data both go through the same review process for governance is already bad enough. But that a drafting assistant agent and a production operations agent would both be subject to the same approval rules is simply inane. Likewise, an agent that recommends a refund and an agent that actually issues the refund should be in different runtime envelopes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The action boundary is the governance boundary
&lt;/h2&gt;

&lt;p&gt;Enforcing agentic AI governance at the action boundary is cleanest when the model is proposing an action. The runtime can then decide what to do with that proposed action. The runtime knows about the user, the thread, the run, the attempt, the name of the tool that was proposed, the arguments for that tool, the prior state of the runtime (i.e. the runtime's store), the approval state of the runtime (i.e. what agents have been approved for what), and the credentials that are available to the runtime. So until the model has had a chance to run and cause a side effect, the action boundary is a clean place for governance.&lt;/p&gt;

&lt;p&gt;On agentic AI governance, Focused has already written about &lt;a href="https://focused.io/lab/ai-agent-governance-follows-the-execution-path" rel="noopener noreferrer"&gt;AI agent governance in the execution path&lt;/a&gt;. The bigger problem is the form that this control takes and in particular how identity, policy, evals, traces, approvals, and recovery can be compiled into a single decision at the point of execution.&lt;/p&gt;

&lt;p&gt;A useful contract then can answer the simplest of questions about the execution of actions by an agent, as follows: 1) May this agent call this tool for this user? 2) Was this action part of a workflow that granted a permission envelope for this agent to act? 3) Does this action comply with the account/region level constraints? 4) Is approval required for this action? 5) What is the appropriate recovery for this action that violates a soft constraint?&lt;/p&gt;

&lt;p&gt;Without a runtime decision however the governance of a system is just a Slack archeological dig: people search for a prompt, the tool schema for that prompt, a human approval, the corresponding trace, the Jira ticket, the diff of the last deployment and the change of a metric which caused all this trouble in the first place. Everybody pretends the system had a control plane because all the artifacts were somewhere.&lt;/p&gt;

&lt;p&gt;Existence is not enforcement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1rbojx66q0il1vnzjmdu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1rbojx66q0il1vnzjmdu.png" alt="Architecture diagram showing a behavioral contract evaluator running before an agent tool call and writing a decision receipt." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The contract gate belongs before the side effect, while the runtime still has a choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Behavioral contracts give governance a runtime shape
&lt;/h2&gt;

&lt;p&gt;There is increasing academic research aimed at governing agentic AI, much of it focusing on contracts for such agents. Recently a paper has been published on &lt;a href="https://arxiv.org/html/2602.22302v1" rel="noopener noreferrer"&gt;Agent Behavioral Contracts&lt;/a&gt;, a contractual representation of an agent's behavior, decomposed into Preconditions, hard Invariants, soft Invariants, hard/soft Governance Constraints and Recovery mechanisms, with results from 1,980 sessions (average of 88-100% hard constraint compliance, with "drift" bounded by 0.27 below which cases are considered to be within tolerance, all with average per-action processing time of under 10 ms). The contracts surfaced 5.2 to 6.8 soft violations per session that would have otherwise gone undetected by uncontracted baselines.&lt;/p&gt;

&lt;p&gt;A behavioral contract translates trace into evidence. The trace of a contracted agent can become a list of records that links to &lt;a href="https://focused.io/lab/agentic-ai-implementation-change-control" rel="noopener noreferrer"&gt;change records and permission envelopes&lt;/a&gt;, the machinery already used by the organization in production rollout. A read-only summarizer is a different thing from a production accountable autonomous AI agent. A drafting assistant is different from a production operations agent. A refund recommender and a refund issuer each fall within a different runtime envelope.&lt;/p&gt;

&lt;h2&gt;
  
  
  Runtime context is the missing input
&lt;/h2&gt;

&lt;p&gt;Runtime governance is closest to the tool layer. The documentation for the LangChain runtime describes the &lt;code&gt;create_agent&lt;/code&gt; method for LangGraph's runtime, which has context, store, stream writer, execution info and server info available to tools and to middleware. The docs expose user context (e.g. authenticated user id, user email), store access, thread id, run id, attempt number, server info (e.g. server id, version), and related runtime details through the ToolRuntime surface. &lt;a href="https://docs.langchain.com/oss/python/langchain/runtime" rel="noopener noreferrer"&gt;That is exactly the data a governance decision needs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Guards can be enabled at the agent level to restrict the start and end of execution, as well as calls to models and tools within a runtime. There are pre-built and pluggable middleware including PII removal as well as human-in-the-loop approval for sensitive commands that require human intervention to approve before execution. &lt;a href="https://docs.langchain.com/oss/python/langchain/guardrails" rel="noopener noreferrer"&gt;The docs put enforcement around execution&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;And by the way, Open Policy Agent already &lt;a href="https://openpolicyagent.org/docs/latest/" rel="noopener noreferrer"&gt;decouples policy decision-making from enforcement&lt;/a&gt; and has a big user base around Kubernetes admission control, API gateways, service authorization, CI, and infrastructure policy.&lt;/p&gt;

&lt;p&gt;Agent governance needs to go down a similar route with agent-specific inputs: the policy decision function needs to know the proposed action for the tool that the agent is about to invoke for that user. It needs to know the workflow info to determine the permission envelope for that invocation. It needs to know account and region constraints for that action, and whether that action requires approval. Finally, the decision function should return more than simple allow/deny: it should return a receipt for the action that was allowed (e.g. all change history for that record), it should block the action (e.g. reject the payment), it should attempt to recover from the action (e.g. retry the remediation after a delay), or it should escalate the action to a human and resume that same thread after approval by that human.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proportional governance beats one big gate
&lt;/h2&gt;

&lt;p&gt;Uniform governance looks clean in a committee deck. All agents are reviewed the same way. They go through the same checklist. They inherit the same constraints. That is how low-risk AI agents become stupid and high-risk AI agents become dangerous: the first get buried in process, while the second get controls written for a chatbot.&lt;/p&gt;

&lt;p&gt;A proportional model for governance starts with the agent's level of autonomy, and its potential blast radius (i.e. the amount of damage it can cause if left unwatched or unleashed from its usual boundaries). Observers need agents with scoped read access, possibly with credentials or scope set programmatically and logged. Agents that can be tested. Advisors are dangerous because their outputs are taken at face value by other agents and by humans, so they must be checked for quality, with various test cases. Agents that Approve actions for other agents have the greatest potential for good and evil: they can be programmed to Approve in the absence of a human, but their decisions must then be subject to human review and approval in turn, with audit trails of their decisions and incident reports for when things go wrong. Finally, there are fully Autonomous agents, whose behavior must be monitored in real time with hard invariants, with provisions for recovery from failure, circuit breakers, owner-based escalation of problems to other agents or teams, and continuous monitoring of their performance.&lt;/p&gt;

&lt;p&gt;There are also references here to the important concept of &lt;a href="https://focused.io/lab/ai-agent-authentication-workload-identity" rel="noopener noreferrer"&gt;workload identity&lt;/a&gt;. The contract cannot reason about authority if the tool call arrives as a generic API key sitting in a prompt or environment variable. It needs workload identity, user or actor identity, a scoped credential, and a record of why that credential was usable for this action. Without those pieces, people are back to hoping a black box behaves.&lt;/p&gt;

&lt;p&gt;Skills, MCP tools, and imported agent capabilities are executable artifacts too. &lt;a href="https://focused.io/lab/agent-skills-are-a-software-supply-chain-surface" rel="noopener noreferrer"&gt;Executable agent artifacts&lt;/a&gt; should carry provenance, version, permission scope, eval evidence, and rollback behavior. The runtime should be able to inspect those artifacts before letting them act.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5vlek4lhxr9v58xbb2cw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5vlek4lhxr9v58xbb2cw.png" alt="Matrix mapping agent autonomy levels to runtime governance controls." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Governance gets heavier as the agent gets closer to unsupervised side effects.&lt;/p&gt;

&lt;h2&gt;
  
  
  The receipt is the product of governance
&lt;/h2&gt;

&lt;p&gt;The receipt, as it pertains to governance, is close to &lt;a href="https://focused.io/lab/ai-agent-orchestration-needs-a-side-effect-ledger" rel="noopener noreferrer"&gt;side-effect receipts&lt;/a&gt; produced before the primary side effect of a command's execution. The timing matters. The governance receipt records the policy decision before the mutation runs.&lt;/p&gt;

&lt;p&gt;As noted above, a receipt of a governance decision should contain the necessary information to allow someone to reproduce and review the trace that resulted from the acceptance of the proposed action and to know where to look for the changed trace (and thus, the trace resulting from the deployment as well as the change record and incident channel for the deployment as well). It should include: 1) a description of the proposed action, 2) the contract version(s) used to evaluate the action, 3) the runtime input(s) used by the contract to make its decision, 4) the policy branch(es) and approval/recovery path(s) executed by the contract to arrive at a decision, and 5) a link to the resulting trace. The resulting receipt should enable someone to review the acceptance of the action and see the resulting change in the trace.&lt;/p&gt;

&lt;p&gt;Benchling's LangChain interview is a good production reminder here. Their team runs multiple model providers on scientific tasks because provider families make different mistakes, and they review production traces with a rotating fire chief. &lt;a href="https://www.langchain.com/blog/benchling-max-agency-podcast" rel="noopener noreferrer"&gt;Trace review is part of the operating model&lt;/a&gt;, not a dashboard someone checks when everything catches fire.&lt;/p&gt;

&lt;p&gt;Honeycomb's Canvas feature turns the existing investigation interface into a collaborative workspace where teams can pin queries, charts, findings, recommended actions, hypotheses, and text summaries. Auto-investigations can be triggered from existing triggers and anomalies. &lt;a href="https://www.honeycomb.io/blog/honeycomb-canvas-multiplayer-workspace-for-agentic-era" rel="noopener noreferrer"&gt;That shared investigation space&lt;/a&gt; is much more powerful when the trace being investigated includes governance receipts instead of just the span for the tool that went wrong.&lt;/p&gt;

&lt;p&gt;But when things go sideways the receipt will provide valuable information about what happened. The customer record was updated when it shouldn't have been. A payment was issued twice. The AI generated email was outside of policy. The autonomous operations closed loop kept trying to remediate a problem with a bad solution. The question will not be what was the approval process for the use of AI. It will be what decision in the runtime allowed the AI to take that action.&lt;/p&gt;

&lt;h2&gt;
  
  
  Own the contract like production code
&lt;/h2&gt;

&lt;p&gt;AI agent governance is becoming a software artifact: it has versions, it has owners, it can be tested, it can be deployed as part of a larger workload, it can be rolled back, it has audit evidence, it has release processes and incident processes, etc. Treat it like production code.&lt;/p&gt;

&lt;p&gt;Contracts are produced for the agent runtime, and thus get treated similarly to other tool definitions. They go through review. They get tests for representative actions, regression cases for bad actions, fixtures for policy inputs, and trace assertions for receipts. They can be reverted when a contract blocks healthy work or allows unsafe work to continue.&lt;/p&gt;

&lt;p&gt;This is less glamorous than an enterprise AI governance platform slide. Good. Glamour is a terrible control surface.&lt;/p&gt;

&lt;p&gt;Everything else is paperwork unless it changes a decision before a side effect occurs.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Agent Handoffs Turn Routing Into Runtime State | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 24 Jun 2026 04:34:31 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/agent-handoffs-turn-routing-into-runtime-state-focused-labs-col</link>
      <guid>https://dev.to/focused_dot_io/agent-handoffs-turn-routing-into-runtime-state-focused-labs-col</guid>
      <description>&lt;p&gt;Agent handoffs are where multi-agent systems stop being routing diagrams and start becoming production software.&lt;/p&gt;

&lt;p&gt;Handoffs between agents in a multi-agent system, i.e., how a customer is being transferred from one agent to another, typically a specialist to another specialist for another question, in practice are not trivial and affect the trace of the conversation of that customer. In particular, the first handoff does go wrong: billing believes that support is still answering the same question, support believes that billing has taken over the conversation with the customer, and fraud learned the whole story of the customer and now needs the approval status from the first agent.&lt;/p&gt;

&lt;p&gt;The hard part of a multi-agent system is the handoff. Handoff contracts make ai agent orchestration into runtime software for buying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ownership moves, or it does not
&lt;/h2&gt;

&lt;p&gt;Orchestration, as described by OpenAI, makes an important distinction between handing over a customer to a specialist continuing a conversation and using agents as tools a manager uses to arrive at a final answer while that manager retains responsibility for the answer to the customer’s question (&lt;a href="https://developers.openai.com/api/docs/guides/agents/orchestration" rel="noopener noreferrer"&gt;OpenAI describes that split directly&lt;/a&gt;). This is what needs to concern buyers of a multi-agent system: is responsibility for a customer’s query being transferred to a specialist, or is a manager retaining responsibility for a customer’s question even while using agents to gather information and do work?&lt;/p&gt;

&lt;p&gt;Further down in the architecture stack, other terms are used to refer to the same structure. In the Microsoft Agent Framework, for example, the handoff of a customer from one agent to another is modeled as a directed graph where the nodes are the agents in the system and the edges are the allowed handoffs between them. A handoff between two agents is modeled as a synthetic tool that the first agent can call (&lt;a href="https://devblogs.microsoft.com/agent-framework/a-tour-of-handoff-orchestration-pattern" rel="noopener noreferrer"&gt;Jacob Alber’s tour of the pattern is unusually concrete&lt;/a&gt;). Amazon Bedrock on the other hand refers to supervisors and collaborators; here is how a supervisor can be configured to function as a router, and how the collaborator can then send the final answer to the customer, in the setup docs for Bedrock (&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/create-multi-agent-collaboration.html" rel="noopener noreferrer"&gt;Bedrock’s setup docs spell out the two modes&lt;/a&gt;). LangChain on the other hand models handoff orchestration as state-driven behavior, where each tool updates the current step in the conversation or designates a new active agent in the system (&lt;a href="https://docs.langchain.com/oss/python/langchain/multi-agent/handoffs" rel="noopener noreferrer"&gt;the LangChain docs use those state variables explicitly&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Different vocabulary. Same pressure point.&lt;/p&gt;

&lt;p&gt;A handoff is the point in a multi-agent system at which a system decides who will be speaking next, who will be acting next, and what evidence from prior actions will be kept by the next agent in a conversation. This should be treated as a runtime state change, not as a clever paragraph in a prompt.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3ajz9r8klhbhqynvox3v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3ajz9r8klhbhqynvox3v.png" alt="Comparison matrix showing how handoffs, agents as tools, and supervisor routing differ by ownership, context, tools, trace, and state." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pattern choice is a runtime ownership choice.&lt;/p&gt;

&lt;p&gt;This is why &lt;a href="https://focused.io/lab/3-things-i-learned-while-building-my-first-multi-agent-architecture" rel="noopener noreferrer"&gt;multi-agent architecture decisions&lt;/a&gt; are more than neat routing diagrams for customer calls. Instead, they make decisions about supervisors, specialists, routers, swarms, and handoff graphs to create production systems that answer a single business question: who will create the next customer-visible result.&lt;/p&gt;

&lt;h2&gt;
  
  
  The context boundary is the product boundary
&lt;/h2&gt;

&lt;p&gt;LangChain documents a production landmine for handoffs inside conversation history. When a handoff uses &lt;code&gt;Command.PARENT&lt;/code&gt;, the parent history must contain the &lt;code&gt;AIMessage&lt;/code&gt; that called the tool and a matching &lt;code&gt;ToolMessage&lt;/code&gt; acknowledging the handoff. Without both messages, the receiving model has to interpret malformed history (&lt;a href="https://docs.langchain.com/oss/python/langchain/multi-agent/handoffs#context-engineering" rel="noopener noreferrer"&gt;LangChain calls out the pair in its handoff context section&lt;/a&gt;). That little detail is the abstraction. The handoff is a state update with a receipt.&lt;/p&gt;

&lt;p&gt;Context engineering research reveals a set of terms that are important to describing the boundary of transfer in conversation. The arXiv paper on context engineering lists relevance, sufficiency, isolation, economy, and provenance (&lt;a href="https://arxiv.org/abs/2603.09619" rel="noopener noreferrer"&gt;the arXiv abstract lays out the five criteria&lt;/a&gt;). In particular, a handoff of conversation state between agents could be said to transfer the facts relevant to the current task, enough state to complete the task at hand, sufficient isolation from other competing contexts, an economically sized message to minimize cost and attention, and provenance, i.e. a record of where the information came from in the first place.&lt;/p&gt;

&lt;p&gt;On the other hand, a fuzzy boundary causes problems. In the worst case, all the history of a conversation in raw form is passed to the next agent in the chain, including tool chatter, unfinished analysis, stale approvals, and duplicated information that the previous specialist gathered about the customer. The token cost of passing information between agents goes up sharply as the next specialist in line has to redo work that the previous specialist already did. On the other hand, the summary provided to the next specialist in line can be too thin, and the specialist may end up asking the same question that the previous specialist already answered. Time disappears from the system, and it becomes hard to audit what was done by whom.&lt;/p&gt;

&lt;p&gt;The contract should be boring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;current owner&lt;/li&gt;
&lt;li&gt;allowed next owners&lt;/li&gt;
&lt;li&gt;state delta&lt;/li&gt;
&lt;li&gt;context payload&lt;/li&gt;
&lt;li&gt;tool envelope&lt;/li&gt;
&lt;li&gt;approval status&lt;/li&gt;
&lt;li&gt;checkpoint id&lt;/li&gt;
&lt;li&gt;trace span id&lt;/li&gt;
&lt;li&gt;receipt for the transfer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Boring is good. Boring survives incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handoff graphs are guardrails
&lt;/h2&gt;

&lt;p&gt;For example, within Microsoft’s e-commerce graph, production teams may want to operate under a structure that supports more than simple triage routing. In Microsoft’s case, billing, returns, and return-to-fraud activities follow edges in the graph, with the latter leading to a potential fraud decision point as outlined in the Microsoft blog post on the topic (&lt;a href="https://devblogs.microsoft.com/agent-framework/a-tour-of-handoff-orchestration-pattern" rel="noopener noreferrer"&gt;the Microsoft blog walks through that topology&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The developer owns the topology as code. The agent owns the routing decision within that topology.&lt;/p&gt;

&lt;p&gt;This is also where &lt;a href="https://focused.io/lab/the-agent-harness-is-the-new-lock-in-layer" rel="noopener noreferrer"&gt;the harness owns routing and state&lt;/a&gt;. The model is developed by the production team. The durable behavior to be supported by this model is in the layer that determines which agent is active, which tools are mounted by that agent, which tool envelope that agent is using, what the pending approvals are, and which traces will support a decision by that agent.&lt;/p&gt;

&lt;p&gt;The handoff graph is something that should be reviewable in code review, something that should be testable with examples, something that generates a trail of traces, with before and after traces to all agents active in between, traces for the tool that initiated the handoff, and the tool envelope that the handing-over agent mounts in turn. I want to see permission changes as runtime events, for example the fraud agent suddenly getting access to the case-management tool after the case has been moved to fraud. That is &lt;a href="https://focused.io/lab/ai-agent-governance-runs-before-the-tool-call" rel="noopener noreferrer"&gt;runtime governance before tool calls&lt;/a&gt; stuff.&lt;/p&gt;

&lt;p&gt;Otherwise, the team is watching a model being polite to describe the org chart.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F2htybs9posj06du5l1b2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F2htybs9posj06du5l1b2.png" alt="Flow diagram showing an agent handoff from user turn through handoff tool, state update, ToolMessage receipt, active agent switch, checkpoint, and trace." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The useful handoff leaves a state delta and a receipt, not just a prompt handoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traces have to show responsibility transfer
&lt;/h2&gt;

&lt;p&gt;Agent traces which record calls an agent has made without also recording who the agent was to make those calls are incomplete traces of handoffs.&lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://focused.io/lab/agent-monitoring-is-an-infrastructure-workload" rel="noopener noreferrer"&gt;agent monitoring becomes infrastructure&lt;/a&gt;. The trace of a handed-off case is now something that can be monitored by runtime events such as active agent and runtime governance. As Honeycomb’s 10-year manifesto notes, with AI-generated code the work of the software developer will increasingly move from generating software to learning from it and operating it safely (&lt;a href="https://www.honeycomb.io/blog/honeycomb-10-year-manifesto-part-1" rel="noopener noreferrer"&gt;Charity Majors makes that case in Honeycomb’s 10-year manifesto&lt;/a&gt;). That is equally true for AI-agent-orchestration. Monitoring transfers of work between specialists is important to safe and effective use.&lt;/p&gt;

&lt;p&gt;After these transfers of work between specialists of an AI system, side effects, i.e. the effects of the work of the first specialist, can have unexpected costs. The typical case is a refund which has been initiated by the first agent, and then the case is handed over to the billing specialist. In order for the billing specialist to do the work, that specialist needs more than the sentence which initiated the refund. The specialist needs the operation key, the current status of the case, the receipt of the case, the owner of the case, and the retry rules of the case. In the absence of these, a retry, a resume of work, or even the work of another specialist can result in duplicate work. Thus, &lt;a href="https://focused.io/lab/ai-agent-orchestration-needs-a-side-effect-ledger" rel="noopener noreferrer"&gt;side-effect receipts after a handoff&lt;/a&gt; must also be included in the runtime.&lt;/p&gt;

&lt;p&gt;These should be queryable, in particular the transfer receipt, the approval state, the tool envelope, and the currently active agent. Such queryability is the difference between an operating model for routing and controlling AI calls and a pretty transcript of a conversation, although both might be useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Central controllers turn into bottlenecks
&lt;/h2&gt;

&lt;p&gt;Another common trap is to bring in a central manager for safety, or control. They sit in the middle of the communication, delegate out tasks, merge together the results from the various sub-agents, and then provide a final answer to the customer. Control sounds safe and clean. In reality, such a manager becomes a bottleneck.&lt;/p&gt;

&lt;p&gt;Decentralized multi-agent systems research finds that centralized systems can enable sub-agents to run in parallel, but that this does not result in parallel coordination of the main task because work that has been rewritten by one agent and then selectively exposed to other agents is routed through the main agent that wrote the work (&lt;a href="https://arxiv.org/html/2606.10662v1" rel="noopener noreferrer"&gt;the DeLM paper names that coordination bottleneck directly&lt;/a&gt;). So instead of a centralized system with sub-agents, the alternative uses agents, a shared verified context, and a task queue where each task item represents verified progress upon which other agents can build.&lt;/p&gt;

&lt;p&gt;A supervisor does not have to be in a fully decentralized shared context by default. The handoff contract for a handoff to a supervisor still matters. If all the useful state of a support system were to live in the prompt of the supervisor, that would be to pay a coordination tax on every move of the system. Verified progress should live in runtime state. Then agents can coordinate around the facts of the situation rather than around rewritten manager messages.&lt;/p&gt;

&lt;p&gt;Use a manager when the manager owns the answer. Use a handoff to transfer ownership of the turn. Use shared verified state when multiple agents need to build on verified progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  The production checklist
&lt;/h2&gt;

&lt;p&gt;A production handoff should declare more than a target agent.&lt;/p&gt;

&lt;p&gt;A production handoff needs to contain the owner of the next response, the legal next agents, the state delta, and the message receipt. It needs to declare the context that crosses with the agent and the context that stays behind. It needs to declare the tool envelope that becomes active after transfer of ownership. It needs to declare approval and resume behavior. The handoff should contain enough trace evidence that an engineer debugging a customer interaction can read it without having to read the entire conversation.&lt;/p&gt;

&lt;p&gt;That is the bar for ai agent orchestration with multiple specialists. Anything less enables prompt routing of calls with a prettier interface and different branding.&lt;/p&gt;

&lt;p&gt;I like handoffs. In their simplest form, they map well to organizational models. Triage hands off to specialists. Then the specialist hands off to risk. Risk hands off to human approval. Human approval then hands back to the system. And just like that, simple handoffs of work across an organization mirror real work that gets translated into software. That work has owners. There are ledgers. There are receipts. There are permissions. And there is state. All of that dull boring stuff that we hate so much needs to be included in agent handoffs as well.&lt;/p&gt;

&lt;p&gt;Routing picks the next node.&lt;/p&gt;

&lt;p&gt;A handoff moves responsibility.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Enterprise AI Agents Are Leaving the Server | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 16 Jun 2026 22:09:22 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/enterprise-ai-agents-are-leaving-the-server-focused-labs-5740</link>
      <guid>https://dev.to/focused_dot_io/enterprise-ai-agents-are-leaving-the-server-focused-labs-5740</guid>
      <description>&lt;p&gt;Enterprise AI agents are leaving the server boundary.&lt;/p&gt;

&lt;p&gt;A boundary that looks deceptively small until the agent starts acting on behalf of a person inside a browser tab, a desktop application, a row on a grid, a locally saved draft, a clipboard, a device permission, an approval flow, and the rest of the mess. That person’s work does not always translate into a server-side record, so server-only agent tools are insufficient as the primary integration model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backend tools cannot see the product moment
&lt;/h2&gt;

&lt;p&gt;A server tool can update an account, search a knowledge base, create a ticket, or call an ERP workflow. This is the “record after” the product has turned intent into a stored fact.&lt;/p&gt;

&lt;p&gt;The product moment arrives earlier.&lt;/p&gt;

&lt;p&gt;A user selects three bullets from a proposed set of actions in a workflow. A sales engineer is editing the pricing for a set of products and has made unsaved changes to the discount for each. A support rep is viewing an incident timeline of incidents for a set of customers. A product manager has selected a cohort of customers for analysis. The client knows where the cursor is, what the user has selected, the scroll position in the product, the current route the user is on, the unsaved form data for the current step in the workflow, the dimensions of the current viewport, the current browser permission state, and the last UI action that the user performed. The server knows nothing, or it knows a stale object model for a set of records.&lt;/p&gt;

&lt;p&gt;That gap is why &lt;a href="https://www.langchain.com/blog/agents-and-applications" rel="noopener noreferrer"&gt;LangChain's architecture for headless tools&lt;/a&gt; is so important. To the model, the tool is just another normal tool with a name, description, schema for the parameters, and result. The significant aspect of this is that the tool is being executed on the client.&lt;/p&gt;

&lt;p&gt;This also shifts the focus of integration in the enterprise significantly. As we wrote about &lt;a href="https://focused.io/lab/salesforce-mcp-turns-crm-integration-into-an-agent-runtime-problem" rel="noopener noreferrer"&gt;CRM integration moving into the agent runtime&lt;/a&gt;, identity, approval, retry, idempotency and tracing decide whether the integration is safe. And as we laid out this week, that same model is now crossing over into the browser as well. The backend runtime is still the place to put enterprise integration for that backend service. But the selected object in Figma, unsaved field in a CRM modal, or even more simply, the browser permission prompt are all now in the agent’s execution path.&lt;/p&gt;

&lt;p&gt;The client runtime becomes part of the execution surface.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92jtex7qvg9fehttq473.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92jtex7qvg9fehttq473.png" alt="Side-by-side architecture diagram comparing server-only agent tools with client-runtime frontend tools." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The client runtime becomes part of the execution surface when the agent has to act on state that only exists in the application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frontend tools are contracts, not UI glue
&lt;/h2&gt;

&lt;p&gt;The lazy approach is the side channel: serialize application state, send that off to the server as a big ol’ binary blob, let the model generate a reply, then ask the app to patch the UI from the result. Sure, that works the first time. Then the shape of the serialized data changes in a way that is not obvious even to the author of the code, the model starts operating off stale context, and nobody knows whether the current UI came from a user action, a tool execution, or the model making a blind guess while the app team followed it.&lt;/p&gt;

&lt;p&gt;Frontend tools make the contract explicit. &lt;a href="https://docs.ag-ui.com/concepts/tools" rel="noopener noreferrer"&gt;AG-UI describes tools&lt;/a&gt; as frontend-defined functions passed to the agent at runtime with a name, description and a JSON Schema for the parameters. The frontend implements the argument validation, invocation of the tool after the call has completed, and insertion of the tool’s result into the conversation history. Simple.&lt;/p&gt;

&lt;p&gt;The important part is the control the frontend has over the capabilities passed to the agent. For each tool, the frontend can decide whether it should be added or removed from the runtime based on user permissions, application context, and state (&lt;a href="https://docs.ag-ui.com/concepts/tools" rel="noopener noreferrer"&gt;AG-UI tools&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;A quote editor for example might decide to allow &lt;code&gt;insertApprovedClause&lt;/code&gt; only when the record the quote is for is editable, the clause was chosen from the approved library and the user has permission to change quotes. A support console on the other hand might allow &lt;code&gt;draftCustomerReply&lt;/code&gt; freely but require &lt;code&gt;sendCustomerReply&lt;/code&gt; to be approved. A design tool might allow &lt;code&gt;summarizeSelectedFrame&lt;/code&gt; without approval but require &lt;code&gt;replaceSelectedFrameCopy&lt;/code&gt; to be approved.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F51uefebs9yqtwumcxgn3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F51uefebs9yqtwumcxgn3.png" alt="Swimlane diagram showing a frontend tool call lifecycle across agent, server runtime, client runtime, user approval, local action, and trace receipt." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A client-side tool call carries validation, approval, execution, and evidence through one lifecycle.&lt;/p&gt;

&lt;p&gt;We argued earlier that &lt;a href="https://focused.io/lab/agent-ui-is-runtime-infrastructure" rel="noopener noreferrer"&gt;agent UI is runtime infrastructure&lt;/a&gt; because event streams give products typed handles for tools, state, approvals, subagents, errors, and observability. Client-executed tools make that argument less theoretical. A product UI is no longer merely a shell around an agent. It owns executable capabilities the agent cannot safely fake from the server.&lt;/p&gt;

&lt;h2&gt;
  
  
  AG-UI is the protocol layer showing up on schedule
&lt;/h2&gt;

&lt;p&gt;MCP provides a standard interface to Tools and Data for Agents, A2A provides a standard interface for Agents to interact with other Agents. AG-UI is targeting the Agent-to-user-facing-application interface. In this space, events (programmed or human triggered) and the streaming of updates to the UI, as well as, multi-modal input (e.g., speech and ink), shared state, frontend tool calls, and human-in-the-loop interrupts all need to be dealt with by the UI. This is the scope of the functionality currently defined by &lt;a href="https://docs.ag-ui.com/introduction" rel="noopener noreferrer"&gt;AG-UI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There’s a clear boundary in the system at the point where the user-facing application can determine the facts of runtime: who is currently present; what has the user selected; what has changed locally on the user’s workstation that will affect the tool results; what can be undone on the user’s workstation; and what, on the user’s workstation, requires a human click before a particular set of side effects can occur on the server. The &lt;a href="https://focused.io/lab/mcp-is-packaging-agent-operable-interfaces-are-the-product" rel="noopener noreferrer"&gt;agent-operable interface is the product&lt;/a&gt; once the tool moves from brochureware integration within the product to production action within the product.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://learn.microsoft.com/en-us/agent-framework/integrations/ag-ui/" rel="noopener noreferrer"&gt;Microsoft's Agent Framework AG-UI integration&lt;/a&gt; points in the same direction. Its documentation lists real-time streaming, session and thread management, state synchronization and sharing, human-in-the-loop approval workflows, custom and generative UIs, tool execution, and tool-result streaming for web and mobile clients.&lt;/p&gt;

&lt;p&gt;Demos can rely on a program that sends out text, for example “Approved,” to a panel and then checks whether the approved text shows up in the right place. &lt;a href="https://focused.io/lab/langchain-bridging-the-gap-to-production-grade-ai-agents" rel="noopener noreferrer"&gt;Production-grade enterprise AI agents&lt;/a&gt; have to account for the client action requested, the user's approval, the data under execution, and whether the action was actually sent somewhere else.&lt;/p&gt;

&lt;h2&gt;
  
  
  Visual builders will not own this boundary
&lt;/h2&gt;

&lt;p&gt;OpenAI's AgentKit page now says that Agent Builder and Evals will wind down after November 30, 2026 (&lt;a href="https://openai.com/index/introducing-agentkit/" rel="noopener noreferrer"&gt;OpenAI AgentKit update&lt;/a&gt;). The same update points teams toward the Agents SDK for workflows that should continue as code and Workspace Agents for natural-language prompting. Visual builders can still sketch intent. Durable agent integration keeps returning to application-owned code.&lt;/p&gt;

&lt;p&gt;A canvas can sketch a workflow. It cannot check whether the active browser selection still matches the tool call arguments. It cannot own a local permission rule unless the application gives it one. It cannot prove that an approval prompt reflected the side effect about to occur. For enterprise AI agents, the durable boundary is application architecture: typed tools, scoped credentials, state synchronization, reviewable side effects, and traces that follow the action.&lt;/p&gt;

&lt;p&gt;This is why &lt;a href="https://focused.io/lab/ai-agent-governance-follows-the-execution-path" rel="noopener noreferrer"&gt;AI agent governance follows the execution path&lt;/a&gt;. Governance for AI agents, using tools such as LangGraph, AG-UI, headless tools and SDKs, follows the execution path of the application running under the control of the AI agent. It does not follow the server path, and thus is distinctly different from governance of server-side applications. As before, the key to successful governance of AI agents, is the same as for any application: the application and its AI, must be owned by a product team, who can define the capabilities of the AI, and review the runtime facts of the AI operated by the application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Client actions have to be observable
&lt;/h2&gt;

&lt;p&gt;Backend-only traces don’t work when the browser is executing part of the agent’s plan. That means the agent can send a command to a client tool. The client tool can then validate local state. The user can approve the action. The browser can then call an external API. And the backend can store the result of the action. If these spans do not form a connected trace, then incident review turns into screenshots and Slack messages read one at a time in reverse chronological order.&lt;/p&gt;

&lt;p&gt;The Honeycomb blog recently published a write-up on using OpenTelemetry in the browser (&lt;a href="https://www.honeycomb.io/blog/observable-frontends-opentelemetry-browser" rel="noopener noreferrer"&gt;Honeycomb on OpenTelemetry in the browser&lt;/a&gt;). As the author points out, instrumenting frontend code is a difficult, messy problem because the code runs in surprise environments (i.e. under simultaneous and unpredictable user input). The post describes how browser instrumentation can propagate trace context to subsequent backend requests, and discusses the use of session IDs as a way to correlate together traces generated by the frontend code of different users within the same session.&lt;/p&gt;

&lt;p&gt;Honeycomb’s &lt;a href="https://www.honeycomb.io/blog/beyond-backend-honeycomb-frontend-observability-ga" rel="noopener noreferrer"&gt;frontend observability GA&lt;/a&gt; post pushes end-to-end user flows, high-cardinality data, user interaction context, custom attributes, and debugging specific user-impacting behavior. Add an agent to the frontend and the trace has to carry agent identifiers, tool-call IDs, approval decisions, permission outcomes, state versions, and receipt IDs for every action executed on the client.&lt;/p&gt;

&lt;p&gt;A good result from a tool running on the client is more than just “ok: true”. It needs to include information about the command that was executed, the state that the tool read, the permission that was opened, who approved the action, the change that was made, the actions that can be undone, and the trace id.&lt;/p&gt;

&lt;h2&gt;
  
  
  Own the client runtime before the agent does
&lt;/h2&gt;

&lt;p&gt;The production checklist is straightforward.&lt;/p&gt;

&lt;p&gt;Define client tools as code, which means typed contracts, not callback-style functions buried inside a component. Use the permission rules of the tool rather than heuristics in system prompts. Include the latest state version in each tool call so the client can reject stale requests. Route approvals through the product workflow with exact side-effect descriptions. Record a receipt for every client-executed action. Follow the execution path across browser, agent runtime, backend service, and external API. Build undo paths for actions that modify local or remote state. Someone has to own the interface.&lt;/p&gt;

&lt;p&gt;Enterprise AI agents are leaving the server because the work was never only on the server. The work is in the messy middle where application state, user intent, approval, and side effects meet. This is where AI agent integration lives now.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>AI Agent Cost Is a Runtime Signal | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 16 Jun 2026 22:09:19 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/ai-agent-cost-is-a-runtime-signal-focused-labs-5772</link>
      <guid>https://dev.to/focused_dot_io/ai-agent-cost-is-a-runtime-signal-focused-labs-5772</guid>
      <description>&lt;p&gt;AI agent cost management fails when cost is treated as a monthly bill as opposed to a runtime signal.&lt;/p&gt;

&lt;p&gt;A coding agent spends money differently than a bill. The agent could spend money by selecting a model for a specific task, using dragged-in context from past tasks, calling tools along the way from past tasks as well as other subagents that are passed work to complete. And the spending happens in a loop of retry and re-evaluation and passing of work until a harness or developer stops the agent from running more.&lt;/p&gt;

&lt;p&gt;First, LangChain concisely outlined the problem with spend from coding agents in its June 15 post about making coding-agent spend predictable: &lt;a href="https://www.langchain.com/blog/how-we-made-coding-agent-spend-predictable" rel="noopener noreferrer"&gt;a heavy user can spend thousands per week before anyone notices&lt;/a&gt;. Anthropic's 2026 coding-agent report gives the other side of the same pressure. Developers already use AI in roughly &lt;a href="https://resources.anthropic.com/hubfs/2026%20Agentic%20Coding%20Trends%20Report.pdf" rel="noopener noreferrer"&gt;60% of their work while fully delegating only 0 to 20% of tasks&lt;/a&gt;. So even if the behavior of an agent could be made predictable and therefore controllable, the agents are already active enough that they consume tokens, pay for tools, and pay for reviewer time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bill arrives after the behavior
&lt;/h2&gt;

&lt;p&gt;The spend is treated as finance information and finance sees the spend after the system has already behaved. Engineering owns the behavior of the coding agent.&lt;/p&gt;

&lt;p&gt;A normal cloud bill has a simple shape. One can attribute the cost to a service, environment, team, or cluster. The spend of coding agents has a different shape. An expensive coding agent can have a simple spend behavior, for example a retry policy, a growing context window over the lifetime of a task, or a fallback policy that promotes a cheap model call to an expensive reasoning model. The cost of a coding agent is heavily influenced by the tool loops that it executes, especially search that stops only when a confidence threshold is reached.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://orq.ai/blog/ai-agent-finops" rel="noopener noreferrer"&gt;Orq defined an AI agent FinOps framing&lt;/a&gt; that points at the right unit. Agent costs are determined by runtime behavior. When finance and agents come together, the useful metric is cost per outcome rather than cost per token. A five dollar workflow that produced the correct fix for a support case can be cost efficient compared to a fifty cent workflow that opened trace after trace of garbage work for a human to resolve.&lt;/p&gt;

&lt;p&gt;The monthly invoice cannot tell the difference.&lt;/p&gt;

&lt;p&gt;This is why the cost conversation has to happen next to traces, evals, owner fields, and policy changes. We have been making this point for months, just in different words: traces have to become operational evidence, not mere exhaust. Cost is another field in that evidence.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3if0wpx7xx06cb2cwdzj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3if0wpx7xx06cb2cwdzj.png" alt="Runtime spend signal loop showing an agent request flowing through an LLM gateway, model and tool calls, trace spans with cost attribution, eval results, and harness policy changes." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cost control works when spend, traces, outcomes, and harness policy sit in one loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caps are crude without traces
&lt;/h2&gt;

&lt;p&gt;This same approach shows up in LangChain's internal rollout of cost control to coding agent usage, using LangSmith LLM Gateway budgets scoped by &lt;a href="https://www.langchain.com/blog/how-we-made-coding-agent-spend-predictable" rel="noopener noreferrer"&gt;organization, workspace, user, and API key&lt;/a&gt;. That system includes monthly, weekly, daily, and hourly budgets, with allowances for users and projects in special circumstances to spend more than a default window allows. Treat those as production controls, not account codes.&lt;/p&gt;

&lt;p&gt;The same simple monthly budget can be killed by one expensive task while the rest of the month looks fine. That same hourly limit that stops the expensive task in its tracks can cut into a long-running cheap task that is producing value. A cap only helps when the team has a good sense of the cost's runtime path.&lt;/p&gt;

&lt;p&gt;The good question here is mundane and hard to pull off: which trace went over budget; what model route did it use; who or what workflow triggered it; which tool calls were involved; what retry or eval loop drove up the cost; and was the outcome worth the burn?&lt;/p&gt;

&lt;p&gt;Similarly, tracing the harness or agent for a single run is the same information required to &lt;a href="https://focused.io/lab/debugging-llm-pipelines-with-langsmith-why-prompting-alone-isnt-enough" rel="noopener noreferrer"&gt;debug a pipeline rather than a single prompt&lt;/a&gt;. A prompt trace alone will not expose enough information to diagnose the run. The trace of the harness shows whether the agent cost more because it was doing hard work, stuck in a failure-and-retry loop, or carrying junk in the context backpack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget windows should expose the loop
&lt;/h2&gt;

&lt;p&gt;Agent cost controls fail when they are simply yes or no.&lt;/p&gt;

&lt;p&gt;Instead of simply approving or denying cost, the budget window has to do a real job. It warns early; attributes the trace that is blowing through the money; throttles expensive loops with no end; permits named exceptions; and preserves enough detail for the next time the harness sees the same shape.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4pb6r5ej3w8o9zgy9eau.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4pb6r5ej3w8o9zgy9eau.png" alt="Budget windows versus runaway loop diagram showing hourly, daily, weekly, and monthly budget controls with warning, throttle, exception, and hard-stop paths around the same coding-agent workflow." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The budget window should expose the loop before the invoice does.&lt;/p&gt;

&lt;p&gt;A coding-agent workflow can include searching for code in the repository, writing and running tests, setting up packages, opening a browser, and asking a stronger model for more reasoning around a failure. Once agents and models get integrated into developer work, the cost of running that agent starts to matter alongside the developer's time.&lt;/p&gt;

&lt;p&gt;Finout's 2026 spending analysis shows volatility in individual inference calls. A single call can vary by &lt;a href="https://www.finout.io/blog/predicting-ai-spending-in-2026-what-the-data-actually-tells-us" rel="noopener noreferrer"&gt;model choice, context length, and agentic workflow shape&lt;/a&gt;. Small changes in routing or context can create large changes in cost with little change in the feature being built. Aggregate spend hides the actual path.&lt;/p&gt;

&lt;p&gt;The path is the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  The harness owns the budget
&lt;/h2&gt;

&lt;p&gt;Spend ownership lands in the harness.&lt;/p&gt;

&lt;p&gt;The harness determines the model route to be used for the agent. The harness determines the retries for the agent. The harness determines if the context retrieved from storage should be trimmed, summarized, retrieved in full, or left untouched for future retrieval. The harness determines if subagents should be spawned to perform subtasks. The harness determines which tools the agent may use to complete a task. The harness determines what to do with the failing eval: retry, escalate, open a ticket, or stop.&lt;/p&gt;

&lt;p&gt;That is why &lt;a href="https://focused.io/lab/the-agent-harness-is-the-new-lock-in-layer" rel="noopener noreferrer"&gt;cost policy belongs in the harness around the model&lt;/a&gt;. A proxy that sits alone is too thin. It can only say yes or no.&lt;/p&gt;

&lt;p&gt;There is a large variance between a cheap model and an expensive model. A cheap model can end up being far more expensive due to the way it is used to complete tasks, namely looping on the same task. Conversely, an expensive model can end up being cost effective because it completes tasks quickly and does not have to be called again. Retrieval steps work the same way. They may save money by reducing context, or they may waste money by including irrelevant information. Tools can eliminate token churn, or they can create retry storms because the API contract for the tool is mushy.&lt;/p&gt;

&lt;p&gt;Cost policy has to understand the architecture of cost. Let incident work cost more for customers. Stop refactoring work that costs too much after an hourly window crosses a threshold. Use stronger models only after cheaper models fail in specific ways. Stop retrying, even if retry is configured, after the same error in the same trace occurs three times. File an issue if the same workflow suddenly and repeatedly costs more over time.&lt;/p&gt;

&lt;p&gt;This does not belong in a spreadsheet. Put it in the harness. The harness already controls model selection and usage for the agent workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost per outcome is the only number that travels
&lt;/h2&gt;

&lt;p&gt;Token counts are useful until they become a proxy for value.&lt;/p&gt;

&lt;p&gt;Reducing tokens is a good thing right until the cheaper answer is still a bad answer. Caching prompts, trimming context, selecting a different model, or reducing evals all decrease token counts, but the work still has to execute successfully.&lt;/p&gt;

&lt;p&gt;And for cost per outcome: for each outcome the customer wants, how much did the system have to spend to produce it? Cost per merged PR, cost per resolved ticket, cost per validated migration, cost per successful workflow, cost per human escalation avoided. Those are easier to take into a budget conversation than a pile of token counts. A lower token number says the team reduced tokens. Fine. The answer still has to work.&lt;/p&gt;

&lt;p&gt;Honeycomb's stance on OpenTelemetry for AI makes sense because it highlights a real market tension. Richer dimensions of observation are required to understand how AI and agentic systems work: &lt;a href="https://www.honeycomb.io/blog/how-honeycomb-supercharges-opentelemetry-for-ai" rel="noopener noreferrer"&gt;inference, prompts, tokens, RAG context, and request behavior&lt;/a&gt;. The platform cost conversation wants to reduce all of that to a single number. A team cutting the wrong field to hit a spend number removes the evidence required to spend money in a way that drives value.&lt;/p&gt;

&lt;p&gt;The LangSmith documents around trace routing take a good stance here. Traces can be routed in LangSmith, OpenTelemetry, or both with &lt;a href="https://docs.langchain.com/langsmith/log-traces-to-project#route-between-langsmith-and-opentelemetry-destinations" rel="noopener noreferrer"&gt;runtime trace destination controls&lt;/a&gt;. The trace and its spend data should travel with the work through systems for reliability, review, security, and finance.&lt;/p&gt;

&lt;p&gt;Cost is an attribute on the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expensive traces should become owned work
&lt;/h2&gt;

&lt;p&gt;The owner looks at the work that crossed the budget window and follows it through. The alert carried the trace, user, team, workflow, model route, tooling, retries, evals, outcome, and cost. The owner can decide whether the work was worth it and relax the cost policy for similar work in the future. If it was waste, alter the harness to prevent it.&lt;/p&gt;

&lt;p&gt;This work is the same as &lt;a href="https://focused.io/lab/agent-failures-should-open-tickets" rel="noopener noreferrer"&gt;turning recurring agent failures into owned issues&lt;/a&gt;. An expensive trace carries a bug report with a dollar sign attached to it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://focused.io/lab/ai-agent-evaluation-steers-the-harness" rel="noopener noreferrer"&gt;Evaluation steers the harness&lt;/a&gt; by connecting cost, behavior, and outcome. Evaluations that detect repeated low-confidence tool use can function as cost control. A regression suite that proves a cheaper model still fulfills the task is cost control already. A trace query that shows a customer workflow accounting for a large part of spend is a product signal already.&lt;/p&gt;

&lt;p&gt;The best cost work looks suspiciously like reliability work. Same owners. Same traces. Same release discipline. Same boring meetings where somebody has to explain why the graph did what it did.&lt;/p&gt;

&lt;p&gt;Good.&lt;/p&gt;

&lt;h2&gt;
  
  
  Treat the invoice as a postmortem artifact
&lt;/h2&gt;

&lt;p&gt;The invoice is useful, and late.&lt;/p&gt;

&lt;p&gt;Agent spend is incurred in runtime decisions: where work flows, how context is aggregated, which retries are attempted, which tools or models are called, which model gets escalated. By the time finance sees the money spent, the relevant engineering question has already been asked and answered by the runtime.&lt;/p&gt;

&lt;p&gt;AI cost optimization sounds like something to purchase from procurement until the cost work is wired into the runtime of the agents within the platform harness. The useful cost work looks suspiciously identical to reliability work: same owners, same traces, same release frequency. Yes, the same boring meetings where someone tries to explain why the graph behaved the way it did.&lt;/p&gt;

&lt;p&gt;Put cost where the agent spends it.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>The Agent Harness Is the New Lock-In Layer | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 09 Jun 2026 15:44:49 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/the-agent-harness-is-the-new-lock-in-layer-focused-labs-3mnf</link>
      <guid>https://dev.to/focused_dot_io/the-agent-harness-is-the-new-lock-in-layer-focused-labs-3mnf</guid>
      <description>&lt;p&gt;The agent harness is where model lock-in gets expensive.&lt;/p&gt;

&lt;p&gt;In cloud computing the compute itself is rarely the issue that locks a company into a provider, rather it is the tools and layers around that compute. ARM templates, CloudFormation, Terraform and others created a new infrastructure boundary that a team could control. Now a similar shift is occurring with agents in AI. The model calls are easy enough to switch out, painful yes, but evaluation work already absorbs that pain. The lock-in surface is shifting into the agent harness.&lt;/p&gt;

&lt;p&gt;Neil Dahlke wrote a useful post on model neutrality and AI vendor lock-in: &lt;a href="https://www.langchain.com/blog/model-neutrality" rel="noopener noreferrer"&gt;model neutrality and AI vendor lock-in&lt;/a&gt;. As Dahlke wrote in that post, compute is rarely the lock-in point. The tooling above the compute is. ARM templates, CloudFormation, and Terraform tried to provide a neutral provider-agnostic control layer above the compute providers. The useful layer let a team mix and match, negotiate, and keep options open.&lt;/p&gt;

&lt;p&gt;Teams realize that while raw compute (i.e. servers) is cost competitively available in the cloud, it is the tooling about the compute that matters. Teams spend time with ARM templates, Terraform, CloudFormation, CI, policy, and review. That effectively becomes a new lock-in surface. With agents, that lock-in surface has moved up one level of abstraction again.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiog2zzzl039n3m5ig59h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiog2zzzl039n3m5ig59h.png" alt="Side-by-side stack comparison showing cloud-era infrastructure lock-in and agent-era harness lock-in." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The lock-in surface moved from infrastructure templates into the agent harness.&lt;/p&gt;

&lt;h2&gt;
  
  
  The expensive dependency lives above the model
&lt;/h2&gt;

&lt;p&gt;As noted above, model calls (whether powered by closed models like Claude and GPT or open models) are becoming increasingly easy to switch out, but that is to say they are still painful to switch out (and as with anything else that has evaluation work involved, the pain typically migrates down the call stack in subtle ways that are not immediately apparent. For example, prompts begin to drift in more 'subtle' but still real ways, the behavior of tool calls involved in the workflow also begins to drift, latency and cost profiles begin to shift in non-obvious but again real ways).&lt;/p&gt;

&lt;p&gt;The real dependency is in the tooling surrounding the model call.&lt;/p&gt;

&lt;p&gt;The lock-in surface has moved up the stack from raw compute to a tooling level within the agent harness, and provider SDKs are reaching up. OpenAI's agents SDK guide &lt;a href="https://developers.openai.com/api/docs/guides/agents" rel="noopener noreferrer"&gt;OpenAI's agents SDK guide&lt;/a&gt; covers orchestration, tool evaluation, state, approval, and related controls for multi-step work performed by specialist agents inside the harness. OpenAI's tracing documentation &lt;a href="https://openai.github.io/openai-agents-python/tracing/" rel="noopener noreferrer"&gt;OpenAI's tracing docs&lt;/a&gt; describes traces for generations, tool calls, hand-offs, guardrails, and workflow spans. The Claude Agent SDK for Python repository &lt;a href="https://github.com/anthropics/claude-agent-sdk-python" rel="noopener noreferrer"&gt;Claude Agent SDK Python repo&lt;/a&gt; contains APIs for session-based work, hooks, permissioned tool use, resumable conversations, and a bundled Claude Code CLI. Vertex AI's Agent Builder &lt;a href="https://docs.cloud.google.com/agent-builder" rel="noopener noreferrer"&gt;Vertex AI Agent Builder&lt;/a&gt; enables users to build, govern and run production agents within Gemini's Enterprise Agent Platform.&lt;/p&gt;

&lt;p&gt;That is the move. Providers are not stopping at the endpoint. They are moving into the harness.&lt;/p&gt;

&lt;p&gt;Once a neutral layer of logic is locked into a provider-specific harness, such model switching becomes a matter of migrating the business logic in the harness. Traces, evals, and retries can then be easily moved to other models. Memory, credentials, and release gates are already managed by the harness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent neutrality happens during the run
&lt;/h2&gt;

&lt;p&gt;Cloud neutrality was a boardroom phrase until something broke. A renewal came up. A region failed. A bill got weird. A team had to decide whether it could move a workload.&lt;/p&gt;

&lt;p&gt;Model neutrality has a shorter clock. It shows up inside a single agent run.&lt;/p&gt;

&lt;p&gt;Supporting a customer service representative, such an agent could use one model to classify customer intent and another to write policy-grounded answers. That answer could then be used to generate structured JSON for a customer relationship management (CRM) system, with a smaller model used to summarize the entire conversation for memory. A software agent might use a planning model to plan out a series of actions, a code edit model to make changes to software, a long-context repository search model to search through large amounts of code to find specific information, and then an open model for background triage, etc. The governed workflow is the unit that matters.&lt;/p&gt;

&lt;p&gt;That is why &lt;code&gt;agent harness&lt;/code&gt; is the useful phrase here. Generic lock-in misses where the surface of lock-in has moved from creating templates for infrastructure to actually harnessing agents and models.&lt;/p&gt;

&lt;p&gt;We have already covered why &lt;a href="https://focused.io/lab/agentic-ai-architecture-needs-model-routing" rel="noopener noreferrer"&gt;agentic AI architecture needs model routing&lt;/a&gt;. The argument now moves up a level, to whether routing is still acceptable when it is offered by the provider as a feature. Provider-supported routing can help. Provider-owned routing can become another part of the tool the customer is locked into.&lt;/p&gt;

&lt;h2&gt;
  
  
  The harness is the control plane for behavior
&lt;/h2&gt;

&lt;p&gt;Sydney Runkle of LangChain wrote recently on &lt;a href="https://www.langchain.com/blog/how-to-build-a-custom-agent-harness" rel="noopener noreferrer"&gt;building a custom agent harness&lt;/a&gt;. Her view is model + harness. The model supplies language and reasoning. The harness supplies the context for the model to run within, including tools, policies, environments, memory, delegation, approval, retries, and the rest of what makes the model function as a system.&lt;/p&gt;

&lt;p&gt;That definition is more than a neat formula. It explains why enterprise agent work keeps drifting toward infrastructure.&lt;/p&gt;

&lt;p&gt;The harness is where a team defines what context the model will have. The harness is where a team defines what tools the agent will use. The harness is where business policies sit. The harness is where environments get tested. The harness is where memory, delegation, approval, retry logic, guardrails, cost logic, data boundaries, task boundaries, and workflow boundaries live.&lt;/p&gt;

&lt;p&gt;This is why at Focused we have long taken the opinionated position that LangGraph is the right infrastructure for the enterprise for building out AI Agent architectures &lt;a href="https://focused.io/lab/langgraph-enterprise-agent-development" rel="noopener noreferrer"&gt;LangGraph as the enterprise agent foundation&lt;/a&gt;. The fans of particular frameworks can get worked up about their pet frameworks, but what matters about infrastructure is who owns the graph, the state, the interrupts, the model routes, and the traces. Production contracts, for release, usage, and operation, are what matter. That is why we keep pulling enterprise agent architecture toward LangGraph.&lt;/p&gt;

&lt;h2&gt;
  
  
  Runtime routing is already a code path
&lt;/h2&gt;

&lt;p&gt;The model-neutrality argument can sound abstract until it shows up as a few lines of middleware. LangChain's current docs for &lt;a href="https://docs.langchain.com/oss/python/deepagents/models#select-a-model-at-runtime" rel="noopener noreferrer"&gt;selecting a model at runtime in Deep agents&lt;/a&gt; use runtime context, &lt;code&gt;wrap_model_call&lt;/code&gt;, &lt;code&gt;ModelRequest&lt;/code&gt;, &lt;code&gt;init_chat_model&lt;/code&gt;, and &lt;code&gt;request.override(model=model)&lt;/code&gt;. The core LangChain docs show the same pattern for &lt;a href="https://docs.langchain.com/oss/python/langchain/middleware/custom#dynamic-model-selection" rel="noopener noreferrer"&gt;dynamic model selection middleware&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.agents.middleware&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ModelRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ModelResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wrap_model_call&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.chat_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;init_chat_model&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;init_chat_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic:claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;structured_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;init_chat_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai:gpt-5.4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;long_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;init_chat_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google_genai:gemini-3.5-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;background&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;init_chat_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai:gpt-5.4-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@wrap_model_call&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;neutral_model_route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ModelRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;ModelRequest&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ModelResponse&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ModelResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;background&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;override&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;background&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="n"&gt;middleware&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;neutral_model_route&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;context_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Draft a refund-policy summary.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]},&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;structured_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84c70d3jfhdh1b7tsfyd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84c70d3jfhdh1b7tsfyd.png" alt="Flow diagram showing a neutral agent harness routing one workflow across different models while keeping policy and traces shared." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Model choice should happen inside one governed run, not in a rewrite project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability belongs above the provider
&lt;/h2&gt;

&lt;p&gt;Provider-native traces are pretty great. I wouldn't even say they are bad to allow inside an org. If all an SDK does for generation after generation is record spans for tool calls and handoffs, including guardrails, that is useful.&lt;/p&gt;

&lt;p&gt;But the enterprise observability problem is cross-provider by nature. The production question is not only, "what happened inside the OpenAI run?" It is, "which workflow ran, which route was selected, which credentials were exposed, which tools mutated state, which evaluator fired, and did the same failure family appear when the planner used a different model?"&lt;/p&gt;

&lt;p&gt;That question belongs to the harness owner.&lt;/p&gt;

&lt;p&gt;Governing a software agent rhymes with evaluating a language model, except the evaluated unit is the harness around the model. &lt;a href="https://focused.io/lab/ai-agent-evaluation-steers-the-harness" rel="noopener noreferrer"&gt;AI agent evaluation steering the harness&lt;/a&gt; and &lt;a href="https://focused.io/lab/agent-benchmarks-measure-the-harness" rel="noopener noreferrer"&gt;agent benchmark scores measure the harness&lt;/a&gt; both point at the same control surface: routing, prompts, tool selection, policy, and release gates.&lt;/p&gt;

&lt;p&gt;A neutral harness keeps traces and evals attached to the business workflow. A provider harness keeps them attached to the provider path. That difference may seem small until an incident review needs to span across three models, two tool systems, and one customer-facing workflow. It is hard to replay that in a vendor console.&lt;/p&gt;

&lt;h2&gt;
  
  
  The enterprise mistake is buying the harness by accident
&lt;/h2&gt;

&lt;p&gt;Enterprise teams rarely set out to give a model provider control over the agent operating model. It happens through convenience.&lt;/p&gt;

&lt;p&gt;A provider will have an excellent quickstart that gets a team up and running in minutes. Another provider might have a great hosted trace viewer. And when a team is first implementing AI agents at a company, there will be people who want to use the agent to approve an action (e.g. sending an email, creating a new record, etc.). Naturally, that provider's SDK will have the hooks to implement that approval logic. Memory will be stored there. Evaluators will be added there. The agent will be deployed there. Credentials will be managed there. And after 6 weeks or so, even though the agent can technically be used with other models, the entire system will have been set up within the confines of that single provider's SDK.&lt;/p&gt;

&lt;p&gt;This is how lock-in works when the surface area is productive for long enough.&lt;/p&gt;

&lt;p&gt;The evidence needed for review should stem from the harness and relate directly to the workflow. If a routing change needs a pull request, that pull request should contain the updated route table, the diff between previous and new eval sets, a sample trace showing tool permissions stayed bounded, and a rollback section. Provider-console inspection covers one path. Harness-owned evidence covers the workflow operations and compliance will inspect after the first production incident.&lt;/p&gt;

&lt;p&gt;Reject provider SDKs? No, because they are productive. But don't let them inadvertently build out an operating model that becomes hard to undo later. Decide who is going to create durable contracts in the end, as if they were not durable already, and make sure that layer sits outside any one provider SDK.&lt;/p&gt;

&lt;p&gt;Focus on owning the workflows and pass the enterprise contracts into the provider SDKs. A workflow should be able to run against Claude, Gemini, GPT, or an open model by selecting the route and deploying to the platform the organization supports. This ensures that as models and vendors go through cycles of birth and death, the harness remains constant, the highest value component of the architecture, and only the neutrality layer changes with vendor differences.&lt;/p&gt;

&lt;p&gt;Useful model neutrality preserves the right to exploit model differences.&lt;/p&gt;

&lt;p&gt;Claude might be best for a single turn of dialog. Gemini could follow for subsequent turns. GPT might be right for structured output. An open model might handle background work outside the critical path. The value of enterprise AI is letting the business choose the model that fits the workflow, then inspecting and testing that choice inside a neutral harness.&lt;/p&gt;

&lt;p&gt;The agent harness is the new lock-in layer.&lt;/p&gt;

&lt;p&gt;Better to own it on purpose.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>Salesforce MCP Turns CRM Integration Into an Agent Runtime Problem | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 09 Jun 2026 15:44:17 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/salesforce-mcp-turns-crm-integration-into-an-agent-runtime-problem-focused-labs-37lp</link>
      <guid>https://dev.to/focused_dot_io/salesforce-mcp-turns-crm-integration-into-an-agent-runtime-problem-focused-labs-37lp</guid>
      <description>&lt;p&gt;&lt;a href="https://x.com/hwchase17/status/2062917963934560397" rel="noopener noreferrer"&gt;Harrison Chase wrote the useful version of the trend in one sentence&lt;/a&gt;: every platform will have a headless version. Salesforce, as we all know, made that useful trend concrete by providing Headless 360 for agents through APIs, MCP tools, Salesforce tools, and CLI commands. &lt;a href="https://www.salesforce.com/news/stories/salesforce-headless-360-announcement/" rel="noopener noreferrer"&gt;Salesforce Headless 360 exposes Salesforce capabilities through APIs, MCP tools, and CLI commands&lt;/a&gt;, so no one has to pretend the agent's only path is to click through Lightning.&lt;/p&gt;

&lt;p&gt;This is a more significant integration change than the announcement suggests. The caller changes from a person or batch of data running through a deterministic code path to an agent running through a reasoning loop. That agent can pick tools, decide the order in which to invoke them, re-run failed work, invoke other products, and handle cross-product setup and coordination for the selected task.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CRM caller changed
&lt;/h2&gt;

&lt;p&gt;Salesforce describes Headless 360 as a platform shift for agents that call APIs, invoke MCP tools, and run CLI commands. &lt;a href="https://trailhead.salesforce.com/content/learn/modules/salesforce-headless-360-quick-look/get-to-know-salesforce-headless-360" rel="noopener noreferrer"&gt;Trailhead's quick look says Salesforce becomes a programmable system where data, workflows, and business logic are accessible through APIs, MCP tools, and CLI commands&lt;/a&gt;. Integration patterns are not changed by this new platform offering, yet the call scale, execution order, metadata exchanged, idempotence, governance, and testing all do change. &lt;a href="https://www.salesforce.com/blog/headless-360-integration-architecture/" rel="noopener noreferrer"&gt;Salesforce's own integration architecture post names the new consumer plainly: in 2026, the new consumer is the agent&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Enterprise teams should realize that the protocol is making the Salesforce platform available to the agent runtime. Claude Code, Codex, Agentforce, Cursor, or a custom LangGraph agent can all sit on the other end of the protocol. Availability is table stakes. Runtime behavior is where safety lives.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe8gmogv9r5thhbr3o8is.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe8gmogv9r5thhbr3o8is.png" alt="Side-by-side architecture showing old CRM UI integration beside an agent runtime calling Salesforce MCP tools." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Headless Salesforce moves the fragile boundary from browser clicks to the runtime that owns the agent loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  UI guardrails stop counting
&lt;/h2&gt;

&lt;p&gt;The simplest way to get in trouble with Headless CRM is to view MCP as the cleaner automation channel: less screen scraping, fewer fragile black-box UI selectors, better tool descriptions. Fine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.salesforce.com/blog/headless-360-integration-architecture/" rel="noopener noreferrer"&gt;Salesforce warns that a business rule that exists only in the user interface will not apply to agents interacting through MCP&lt;/a&gt;. Page and object layouts, Dynamic Forms, read-only fields and objects, guided screen flows, and the rest apply to humans interacting with the UI. They do not automatically apply to agents running tools and automations via MCP.&lt;/p&gt;

&lt;p&gt;A tool that updates an opportunity stage, for example, has to enforce the appropriate rule on the platform or service layer. The tool to cancel bookings has to handle duplicates. What happens if an agent triggers this again before the cancellation actually takes effect? A tool to create quotes has to first understand which permissions and fields are allowed to be modified by this model and what the relevant side effects of updating a quote would be. That latter class of model can be clever and powerful. Also bad if designed lazily.&lt;/p&gt;

&lt;p&gt;But in order to work well, the simplest tool has to be able to answer basic questions: who or what is behind the current tool call, which Salesforce identity backs it, what permissions belong to that identity, whether the call is safe to retry, which side effects need approval, what happens if tool B is called before tool A, and which trace shows the model decision that led to the CRM record mutation.&lt;/p&gt;

&lt;p&gt;These are boring questions. They are also the integration architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP packages the interface
&lt;/h2&gt;

&lt;p&gt;MCP is doing real work here. &lt;a href="https://developer.salesforce.com/blogs/2025/06/introducing-mcp-support-across-salesforce" rel="noopener noreferrer"&gt;Salesforce's MCP support announcement says one MCP server can plug into any AI app or agent that understands MCP&lt;/a&gt;. Agentforce includes a native MCP client, central registry support, policy support, and MuleSoft support to connect existing APIs to MCP servers.&lt;/p&gt;

&lt;p&gt;That packaging layer is valuable. The catalog still has to be loaded, selected, scoped, and audited, which is why &lt;a href="https://focused.io/lab/stop-eager-loading-mcp-tools" rel="noopener noreferrer"&gt;eager-loading every MCP tool into context&lt;/a&gt; turns into a runtime decision rather than a feature added for convenience.&lt;/p&gt;

&lt;p&gt;The interface an agent uses to interact with a service to produce work is a product, not a protocol checkbox. While a platform can be fully packaged as a service with a good interface for human interaction, the interface used by an agent to execute work in the platform as a programmable system is a different animal altogether. I would make the same argument we made in &lt;a href="https://focused.io/lab/mcp-is-packaging-agent-operable-interfaces-are-the-product" rel="noopener noreferrer"&gt;MCP Is Packaging. Agent-Operable Interfaces Are the Product&lt;/a&gt;: the product is the safe agent-operable interface.&lt;/p&gt;

&lt;p&gt;The Salesforce Hosted MCP Server post makes the consequence concrete. &lt;a href="https://developer.salesforce.com/blogs/2026/05/connect-claude-with-salesforce-hosted-mcp-servers" rel="noopener noreferrer"&gt;Salesforce says Claude Desktop and Claude Code can run SOQL queries, modify records, execute actions, invoke flows, call Apex, use Apex REST, call AuraEnabled methods, and work through Named Query APIs&lt;/a&gt;. Standard Salesforce user permissions still apply, and connection happens through an External Client App and OAuth.&lt;/p&gt;

&lt;p&gt;Workload identity is the concept of having a delegate or workload authenticate to a system or service. The runtime should own the credential for delegated identity and manage expiration, auditing, and scope. This is the same boundary in &lt;a href="https://focused.io/lab/ai-agent-authentication-workload-identity" rel="noopener noreferrer"&gt;AI Agent Authentication Starts With Workload Identity&lt;/a&gt;, now applied to Salesforce MCP APIs instead of generic tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  The runtime owns the loop
&lt;/h2&gt;

&lt;p&gt;Agent runtime is an abstract term, partly because it gets abused. Here it is concrete. The Salesforce MCP call becomes a unit of work stored in a task queue. If the agent created a quote and then the worker that did the work expired before the opportunity update, the runtime has to decide whether to replay the lost task, compensate the quote, or ask a human for assistance.&lt;/p&gt;

&lt;p&gt;Recent refinement of vocabulary at LangChain HQ. First, to clarify a couple of terms: &lt;a href="https://www.langchain.com/blog/agent-frameworks-runtimes-and-harnesses-oh-my" rel="noopener noreferrer"&gt;Harrison Chase had laid out a framework/runtime/harness view of DeepAgents, with LangChain being the framework, LangGraph the runtime for DeepAgents, and DeepAgents themselves being a harness for LangGraph work&lt;/a&gt;. Then &lt;a href="https://www.langchain.com/blog/runtime-behind-production-deep-agents" rel="noopener noreferrer"&gt;LangChain’s production Deep Agents runtime post lays out what he means by the runtime in that model: durable execution of runs of arbitrary length, memory, multi-tenancy, human-in-the-loop, observability, sandboxes, scheduling, MCP, A2A, and webhooks&lt;/a&gt;. That runtime, in turn, is made agent-operable by the existence of Model Context Protocol tools exposed as CRM actions, with a managed task queue that automatically checkpoints work interrupted for any reason, for example interruption of long-running account cleanup, restart of the deploy that was running the agent in the first place, and so on. &lt;a href="https://docs.langchain.com/langsmith/core-capabilities" rel="noopener noreferrer"&gt;LangSmith core capabilities docs spell out managed task queues with automatic checkpointing as a core runtime capability&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcqh69734g9anl0n831zi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcqh69734g9anl0n831zi.png" alt="Layered flow showing an agent loop passing through runtime controls before reaching Salesforce MCP tools and CRM systems." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The runtime turns a model's tool choice into an owned, retryable, observable unit of CRM work.&lt;/p&gt;

&lt;p&gt;A CRM integration without a runtime layer to manage all of the long running tasks (account cleanup, etc.) will be great until that task gets approved by a human, the deploy that had the agent running on it restarts, and nobody knows what Salesforce record was modified by what model decision. The runtime conversation from &lt;a href="https://focused.io/lab/langchain-interrupt-moved-into-runtime" rel="noopener noreferrer"&gt;LangChain Interrupt Moved into the Runtime&lt;/a&gt; matters outside of LangChain announcements. The Salesforce MCP makes CRM action agent-operable. The runtime makes that agent operable and governable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Browser automation does not go away
&lt;/h2&gt;

&lt;p&gt;Headless platforms do not kill off browser agents, because products still expose only human surfaces to execution. Agents can then be used to automate those human surfaces in order to generate compact snapshots, refs, sessions, recording, etc. including the ability to playback, as well as full control over the network activity and even the on-disk state of the agent under test. An example of such a product is &lt;a href="https://agent-browser.dev/" rel="noopener noreferrer"&gt;Vercel's agent-browser, built for products that are exposed only through human interfaces&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A critical insight is that if Salesforce exposes MCP tools and corresponding APIs to agents, then forcing the agent to interact with the corresponding human interface (in a browse-oriented manner) is simply a matter of self-inflicted latency. &lt;a href="https://arxiv.org/html/2410.16464v3" rel="noopener noreferrer"&gt;API-Based Web Agents reported WebArena success rates of 14.8% for regular browsing agents, 29.2% for agents accessing APIs directly, and 38.9% for hybrid agents&lt;/a&gt;. Machine interfaces are substantially shorter trajectories than the corresponding human interface for a large set of tasks, and thus easier to get right. Shorter trajectories can fail faster, which is useful in the end, even. A better tool can therefore quickly produce better data, and a more portable protocol can make a wrong action portably wrong.&lt;/p&gt;

&lt;p&gt;The trajectories though are shorter. Badly chosen tools will fail fast. Better tools, or even just better work done with better tools, will turn to better data. And the simplest portable protocol can make the worst action the easiest to port.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traces have to cross the CRM boundary
&lt;/h2&gt;

&lt;p&gt;I would add observability and tracing to the list of Headless 360 features worth taking seriously. Agentic CRM work will fail in fascinating ways and logging of integration work as is done in traditional integration work will not be enough. A purely deterministic integration will record every request and every response. An agentic integration on the other hand will include information about the planner’s state, the context that was retrieved, the model output that was generated, the tools that were chosen, the Salesforce actions that were permitted, the field level validation, the automations that were triggered, the retry behavior, and the approval activity for handoffs that require a human.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.honeycomb.io/blog/honeycomb-is-built-for-the-agent-era-pt1" rel="noopener noreferrer"&gt;Honeycomb's agent-era observability post frames agent telemetry as spans and fields&lt;/a&gt; which answer a set of questions: what changed after a deploy, where handoffs failed, which retries occurred, and how tool usage, cost, latency and quality changed.&lt;/p&gt;

&lt;p&gt;Again, this is closely related to the work we have previously outlined in &lt;a href="https://focused.io/lab/agent-traces-need-to-cross-the-mcp-boundary" rel="noopener noreferrer"&gt;Agent Traces Need to Cross the MCP Boundary&lt;/a&gt;. Therefore, by “integration work” here we mean work that can be followed by an integrator in terms of individual traces for individual runs that, when followed, contain information about the individual action taken by the integrator and all of the preceding action taken by the agent to plan and prepare for that single run. Those individual traces, therefore, must contain information for the planner span, the MCP client span, the MCP server span, the individual CRM API call or calls, and the subsequent automated actions. Those individual traces must also contain sufficient trace context so that the individual integrator following the traces can understand the individual runs taken by the agent in full.&lt;/p&gt;

&lt;p&gt;But, as with all things, there are open issues. &lt;a href="https://arxiv.org/html/2603.13417v1" rel="noopener noreferrer"&gt;This 2026 MCP deployment patterns paper lays out standardization of MCP discovery and invocation, plus production gaps around identity propagation, tool budgeting, structured error semantics, server contracts, user context, timeouts, and runtime observability for agent execution traces&lt;/a&gt;. These look like things which would form a pretty good backlog for full Salesforce MCP adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build the runtime before the agent gets useful
&lt;/h2&gt;

&lt;p&gt;Salesforce MCP is a tremendous asset to integration-heavy enterprises because it reveals the agent-usable surfaces on the platform, instead of simply putting a chat bot on top of the existing human CRM interface.&lt;/p&gt;

&lt;p&gt;Beyond the prompts, I would start with an “ownership map” for this runtime. Specifically: within Salesforce, what are all the tools that can be executed by an agent. How do credentials for these tools get set up and managed. How do business rules that currently reside in the UI get ported to runtime. What are the properties of each tool (read-only, idempotent, approval-gated, side-effecting, …). How does each action call get a correlation ID passed in as a parameter. How does trace context propagate across the MCP boundary. Where are handoffs to humans for approval and how does runtime resume after approval is granted.&lt;/p&gt;

&lt;p&gt;Then let the agent work. The enterprises that get this right will not brag about a chat window updating Salesforce. They will show how records were touched in a series of steps by an agent, with specific actions along the way. In each case a trace can be reviewed by the integrators, developers, finance, security, and engineering. The trace represents the complete workflow: automated work, manual approvals, safe retries, and the unsafe action that failed.&lt;/p&gt;

&lt;p&gt;In summary, the Salesforce MCP protocol is important because it makes possible a new set of work that can be done in production. The protocol is easy; the runtime is hard.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>AI Agent Governance Follows the Execution Path | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 09 Jun 2026 15:44:14 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/ai-agent-governance-follows-the-execution-path-focused-labs-2gc4</link>
      <guid>https://dev.to/focused_dot_io/ai-agent-governance-follows-the-execution-path-focused-labs-2gc4</guid>
      <description>&lt;p&gt;AI agent governance fails at the intersection of governance and permission. What appears to be granted by way of permission to use a tool in a narrow fashion can rapidly devolve. In the end, what failed governance? A permission spreadsheet with attached vibes recorded in a spreadsheet.&lt;/p&gt;

&lt;p&gt;Live execution path is the space where governance happens. Runtime security for autonomous agents, a space Microsoft is actively tackling in the open with the &lt;a href="https://opensource.microsoft.com/blog/2026/04/02/introducing-the-agent-governance-toolkit-open-source-runtime-security-for-ai-agents/" rel="noopener noreferrer"&gt;Agent Governance Toolkit&lt;/a&gt;, is the space of deterministic policy enforcement, identity, isolation, audit, reliability controls, etc. for a variety of agent frameworks. All of this is open-source, MIT-licensed. Governance is moving out of the prompt and into the runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  The action is only half the question
&lt;/h2&gt;

&lt;p&gt;What has traditionally been thought of as access control, a simple yes or no as to whether a subject can take an action on a resource, becomes rapidly more complex as the same action becomes, following a series of steps, a vastly different thing. As an example, a verified billing change for a customer may cause the sending of a single email to that customer. A scraping of a long-dormant escalation queue by an AI agent could cause the sending of 4,000 such emails, potentially each with its own complexity, and be a wildly different event. Similarly, a verified alert causing a remediation script to be run is a vastly different action from an untrusted prompt injection that caused the alert in the first place.&lt;/p&gt;

&lt;p&gt;The paper &lt;a href="https://arxiv.org/abs/2603.16586" rel="noopener noreferrer"&gt;Runtime Governance for AI Agents: Policies on Paths&lt;/a&gt; gives a more refined definition of what one would expect from a runtime governance system. Policy should map: 1) identity of the AI agent; 2) partial execution path of the AI agent; 3) proposed next action of the AI agent; 4) organizational state to a probability of policy-violation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjam56i8t6rybsav0pk55.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjam56i8t6rybsav0pk55.png" alt="Runtime policy gate evaluating identity, execution path, next action, policy state, and trace evidence before a production tool call." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The policy decision belongs on the execution path, before the side effect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompts can shape behavior. They cannot enforce policy.
&lt;/h2&gt;

&lt;p&gt;There is a key distinction here between prompt-level safety and the safety of the application code. Even stochastic systems can be designed to request safe actions. When we write application code that invokes tools, we can, as a rule, intercept the tool call on the wire before it actually executes, i.e. implement deterministic safety as opposed to stochastic safety of the prompt. The &lt;a href="https://github.com/microsoft/agent-governance-toolkit" rel="noopener noreferrer"&gt;Agent Governance Toolkit README&lt;/a&gt; quick-start does just this, and governs tool calls by wrapping them in a function, for example &lt;code&gt;govern(my_tool, policy="policy.yaml")&lt;/code&gt;, which logs information about the call, makes the decision, and returns &lt;code&gt;GovernanceDenied&lt;/code&gt; when policy blocks the action.&lt;/p&gt;

&lt;p&gt;This is why &lt;a href="https://focused.io/lab/ai-agent-authentication-workload-identity" rel="noopener noreferrer"&gt;agent identity has to be workload identity, not a shared key&lt;/a&gt;. “The agent did it” does not help during incident review. Which agent? Which run? Which delegated subagent? Which credential? Which policy version? Which approval? Shared credentials turn the trail into mush.&lt;/p&gt;

&lt;h2&gt;
  
  
  The harness is the governance surface
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://openreview.net/forum?id=nM5tDHrQsx" rel="noopener noreferrer"&gt;Agent Systems with Harness Engineering&lt;/a&gt; paper describes harness as a set of infrastructure supporting the execution of workflows, using memory, using skills, orchestrating groups of agents, error handling, and adaptive context management. This exactly matches what we see in real builds today. The harness provides the tool registry, memory lookup, retry strategy, subagent delegation, context, approval, traces, and handling of side effects.&lt;/p&gt;

&lt;p&gt;As it currently stands, a model can plan out the actions of an agent including choosing the tools to be used, passing in the appropriate arguments (e.g. file paths, email addresses, etc.), obtaining credentials for the actions, which the model requests but the runtime can mint as a scoped token with an expiration time and trace information, and repeating loops of behavior. The harness can stop the agent from continuing down a particular path if the agent’s behavior crosses a policy boundary.&lt;/p&gt;

&lt;p&gt;The harness is therefore becoming &lt;a href="https://focused.io/lab/the-agent-harness-is-the-new-lock-in-layer" rel="noopener noreferrer"&gt;the lock-in layer for agent governance&lt;/a&gt;. The number of components, including runtime policy, identity, memory, tool routing, and evals, that get to accumulate on the harness is far greater than on a model. Microsoft has a large &lt;a href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/agent-governance-toolkit-architecture-deep-dive-policy-engines-trust-and-sre-for/4510105" rel="noopener noreferrer"&gt;architecture deep dive&lt;/a&gt; into the components that form the Agent OS, including Agent Mesh, Agent Hypervisor, Agent Runtime, Agent SRE, and Agent Compliance. As with SRE domains, the areas of concern map exactly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The runtime sits between the model and the application
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2603.00495" rel="noopener noreferrer"&gt;AI Runtime Infrastructure&lt;/a&gt; paper describes AI Runtime Infrastructure as execution-time infrastructure that observes, reasons, and intervenes in the behavior of an AI agent for goals related to task success, latency, token efficiency, reliability, and safety. Governance typically occurs after a model has determined a course of action and before an application is invoked to execute the resulting action.&lt;/p&gt;

&lt;p&gt;I like the frame because it forces the boring implementation question. Where does the action pause? Who signs the decision? What trace carries the receipt? At Focused, we keep coming back to that seam because it is the last safe place to turn intent into governed execution before the agent touches production.&lt;/p&gt;

&lt;p&gt;A runtime policy gate needs to see more than just the name of the tool, e.g. “run SQL.” It needs to know the identity of the agent, the task that the agent is currently trying to complete, the current path that the agent is following, the retrieved context, previous denials, the current set of approvals, the current resource budget, the customer boundaries, the data classification, and the organization state. To make a decision, a customer-service agent can refund an order of up to $200 as long as the account has been verified successfully and no unrelated support notes have been retrieved on the current path. A coding agent can open a pull request after the tests for the associated code change have passed successfully. It cannot merge auth code without first receiving approval from the relevant security department.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trace evidence makes governance inspectable
&lt;/h2&gt;

&lt;p&gt;Blocking an action safely is better than allowing it, and turning that block into an eval case is how a system will improve. &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/" rel="noopener noreferrer"&gt;OpenTelemetry’s GenAI agent semantic conventions&lt;/a&gt; specify spans for creating an agent, invoking an agent, invoking a workflow, and executing a tool. A trace should have all the run information, tool attempted, policy version, decision, reason for decision, approval state, any subsequent approval’s revised decision, and result of downstream actions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.honeycomb.io/blog/your-questions-about-ai-assisted-development-answered" rel="noopener noreferrer"&gt;Honeycomb’s Q&amp;amp;A for AI-assisted development&lt;/a&gt; correctly observes the AI agent as a program making network calls, reading and writing data, with latency characteristics and failure modes, to be observed with the same tools that observe software: traces and logs. Therefore, the trace for a blocked tool call contains a receipt, as does a risky approval, a policy override, and so on. These are used for incident review, as eval cases, to change the harness. The data from these observations is accurate only to the degree that it is observable, i.e., turned into evidence, not into an averaged metric displayed on a dashboard. That is how &lt;a href="https://focused.io/lab/ai-agent-accuracy-is-an-observability-problem" rel="noopener noreferrer"&gt;accuracy becomes observable evidence&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmntf6k3wdtwpahhrq9u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmntf6k3wdtwpahhrq9u.png" alt="Feedback loop showing runtime policy decisions becoming trace evidence, eval cases, policy updates, and harness updates." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Governance gets better when denied actions and risky approvals become eval and policy evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance has to own the boring artifacts
&lt;/h2&gt;

&lt;p&gt;Name the agents. Scope the credentials. Register the tools. Sign the skills. Record the policy version. Log the decision. Attach the trace. Route the approval. Preserve the receipt. Turn bad paths into evals. Roll out policy changes like software changes.&lt;/p&gt;

&lt;p&gt;Skills are also important to highlight in this framework, because skills are executable, have tool expectations, have hidden assumptions, and have an upgrade path. I wrote a post recently explaining why &lt;a href="https://focused.io/lab/agent-skills-are-a-software-supply-chain-surface" rel="noopener noreferrer"&gt;agent skills are a software supply-chain surface&lt;/a&gt;. For exactly the same reasons that we run software through test cases and sanity checks before promoting it through the supply chain, a skill can change what an agent tries to do before policy ever sees the tool call, so it too must be governed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The governance boundary follows the side effect
&lt;/h2&gt;

&lt;p&gt;To determine the boundaries of governance for an AI system, test the agent by seeing where it can do damage. The areas where an AI can cause harm are tool calls, sending messages, writing files, updating databases, making payments, deploying software, merging pull requests, and launching subagents. Governance needs to occur right before the side effect of each of these. As such, the determination about taking a particular action by an agent needs to travel with the resulting evidence of that action.&lt;/p&gt;

&lt;p&gt;AI agent governance follows the execution path. The risk of actions by an AI agent follows the execution path of that AI agent. Thus, the company’s governance of an AI agent follows the execution path of that AI agent. In other words, governance is determined for the next action, in that run, after those prior steps, under that policy, with that identity, before that side effect. Follow the execution path and own that path, or do not govern the agent and let it decide what to do with that path.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>AI Agent Accuracy Is an Observability Problem | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 09 Jun 2026 15:43:41 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/ai-agent-accuracy-is-an-observability-problem-focused-labs-3bpe</link>
      <guid>https://dev.to/focused_dot_io/ai-agent-accuracy-is-an-observability-problem-focused-labs-3bpe</guid>
      <description>&lt;p&gt;Agent accuracy is a bad headline metric.&lt;/p&gt;

&lt;p&gt;It sounds responsible. It gives the dashboard a number. It lets a team say the agent is 92% accurate, 84% accurate, or not ready until the number crosses a threshold invented during a planning meeting.&lt;/p&gt;

&lt;p&gt;The rest of the journey then is production, and that is where the accuracy number turns into theater, and from there to the wall of shame.&lt;/p&gt;

&lt;p&gt;The failures that matter are the wrong refund, the missed escalation, the silent tool call, the policy decision nobody can replay, and the answer that looked fine until a customer acted on it. Average answer quality does not sort these risks. AI agent observability is the work of sorting failures by consequence and evidence. Without it, accuracy becomes a vibes score wearing a decimal point.&lt;/p&gt;

&lt;p&gt;I see this as the biggest pitfall in teams using accuracy as their primary metric, because the salient form of failure (hallucination) can mask a far greater number of problems that are far more insidious. The agent may state things that are true enough to read out in a final answer cleanly, yet be operationally incorrect in every possible way (e.g., it correctly summarizes the ticket, but incorrectly identifies the account in question; it correctly identifies the tool to call, but has the wrong scope in place; etc.). Further, an agent could correctly retry and return a successful response to the final answer, yet be silently incorrect for the side effects of that response (e.g., it correctly updates the invoice, but does so in a way that creates garbage in a dependent system that accepts it by default).&lt;/p&gt;

&lt;p&gt;Accuracy has to answer a boring production question: what broke, who owns it, and what trace proves the next change made the system safer?&lt;/p&gt;

&lt;h2&gt;
  
  
  The score hides the consequence
&lt;/h2&gt;

&lt;p&gt;A single accuracy percentage can mask a variety of errors.&lt;/p&gt;

&lt;p&gt;One error is a style miss. The agent writes a clumsy sentence in a support reply and a human edits it. Annoying, cheap (maybe), recoverable.&lt;/p&gt;

&lt;p&gt;Another error is one of state change: the AI’s answer caused something to happen as a result, and that something was incorrect. The phrasing of the answer was not a problem in these cases; the answer was factually correct. However, the system as a whole was not.&lt;/p&gt;

&lt;p&gt;The operational consequence of different errors is not captured in a single accuracy percentage, because errors carry different consequence. Indeed, the harness has to record more than just the score of the model for accuracy to carry any significance. That is why monitoring the production AI agent has become &lt;a href="https://focused.io/lab/agent-monitoring-is-an-infrastructure-workload" rel="noopener noreferrer"&gt;production infrastructure&lt;/a&gt;. The monitoring harness has to contain the full trace of the production agent’s steps, including the plan it executed, the context it retrieved to perform each step, the full tool calls and results, the agent’s decision, the side effects received by the agent for each step, the agent’s fallback behavior for each step, and the final answer.&lt;/p&gt;

&lt;p&gt;Honeycomb puts AI agent accuracy in the proper context. &lt;a href="https://www.honeycomb.io/blog/fast-and-close-to-right-how-accurate-should-ai-agents-be" rel="noopener noreferrer"&gt;Telemetry data is already a lossy representation of system state&lt;/a&gt;, and another lossy layer on top of that is added by the AI. Observability teams deal with lossy signals already (traces, spans, etc.) so long as they are operational and allow the team to route around failures encountered during the investigation into the issue. AI agents are no different and the same mistakes (lossy representation) are made operationally by the agent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe00o50344js7w9qcth9b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe00o50344js7w9qcth9b.png" alt="Matrix showing agent accuracy errors sorted by operational consequence and response." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Accuracy becomes useful when errors are sorted by consequence, not averaged into one score.&lt;/p&gt;

&lt;h2&gt;
  
  
  Early agents fail structurally first
&lt;/h2&gt;

&lt;p&gt;Mature production agents fail in interesting ways. Immature production agents fail in less interesting ways before the team is confused as to why the agent failed in the first place.&lt;/p&gt;

&lt;p&gt;The retrieval path points at the wrong corpus, the account context is incomplete, the agent has access to a tool but not the correct permissions to safely use it, the approval boundary is after the side effect rather than before it.&lt;/p&gt;

&lt;p&gt;A June 2026 paper, &lt;a href="https://arxiv.org/abs/2606.02494" rel="noopener noreferrer"&gt;Monitoring Agentic Systems Before They're Reliable&lt;/a&gt;, also comes to the same conclusion: early production AI agent systems are full of structural problems, which hide task-level errors. This paper talks about how to monitor a system in 3 scopes (within a run, across runs, and structurally) and then how to use variance to route the findings to either automated tracking or to human investigation. In this paper, 97% of the findings could be automatically tracked, and 2% or so would be left for the team to investigate on a case-by-case basis.&lt;/p&gt;

&lt;p&gt;Accuracy at the end of the run is always late because the causes of the error will be within the structure of the workflow that generated the answer.&lt;/p&gt;

&lt;p&gt;The first observability pass should ask three questions.&lt;/p&gt;

&lt;p&gt;Within a run, did the agent complete each stage with the expected inputs, outputs, and receipts? Across runs, does the same case drift because the model is exploring or because the integration is unstable? Structurally, does the harness expose the state required to make the task possible?&lt;/p&gt;

&lt;p&gt;This is the part that people gloss over because it does not seem to involve AI. But the harness is what exposes the state that the AI is using to begin with. Without the correct plumbing in place, the AI will never have the correct information to make accurate decisions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbnb3rfdu381qfrn7jp0r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbnb3rfdu381qfrn7jp0r.png" alt="Layered monitoring stack showing within-run, cross-run, and structural scopes feeding triage and agent improvement." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Early agent monitoring should find the structure hiding the task-level signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Treat accuracy like an error budget
&lt;/h2&gt;

&lt;p&gt;The budget of errors with different consequences should be defined by the production team, not by demanding perfect accuracy from the agent.&lt;/p&gt;

&lt;p&gt;A draft-only content agent can be sloppy with wording because it’s reviewed in any case. A billing agent must be extremely accurate with account numbers because a wrong account number can cause bills to be sent to the wrong people with potentially disastrous results. A developer agent can propose to add a risky diff of code if the CI process, subsequent review, and rollback capability are all in place to handle the worst case. An access-management agent must treat identity ambiguity as a hard stop because incorrectly granted access to data about another person could be a serious breach of privacy.&lt;/p&gt;

&lt;p&gt;A harness should say “for this task, these types of errors are tolerable, for this task these types of errors can be retried, for this task these types of errors can be routed to someone else, for this task these types of errors are a hard stop and must be blocked, for this task these types of errors must be escalated, for this task these types of errors become a new eval for the model”. All of these decisions should be embedded in the harness, next to the tools and state transitions of the system, and not in a polite prompt that is asking the model to be careful about something. The side effect of the prompt is not something that can be handled by prompt text, runtime code can.&lt;/p&gt;

&lt;p&gt;The same evaluation that uncovered a failure in the agent should be able to be used to modify the harness. &lt;a href="https://focused.io/lab/ai-agent-evaluation-steers-the-harness" rel="noopener noreferrer"&gt;Evaluation should steer changes to the harness&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The dull implementation detail here is that every action the agent takes (risky or otherwise) should have a &lt;a href="https://focused.io/lab/ai-agent-orchestration-needs-a-side-effect-ledger" rel="noopener noreferrer"&gt;side-effect receipt&lt;/a&gt;. So rather than just logging a change of state to a database, the system would generate a receipt for the action that includes: input, the selected account, the tool contract/schema used, the scope of the authorization used, the model’s output, the human approval state, the resulting external identifier(s), etc.&lt;/p&gt;

&lt;p&gt;That is how accuracy turns into operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The feedback loop has to close
&lt;/h2&gt;

&lt;p&gt;The manual improvement loop as described recently in the LangSmith Engine note from LangChain follows this pattern: &lt;a href="https://x.com/LangChain/status/2062625696292225500" rel="noopener noreferrer"&gt;trace to failure, then prompt or code to fix, then eval and test, then ship and repeat&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It is good to see monday’s LangSmith case study, where &lt;a href="https://www.langchain.com/blog/customers-monday" rel="noopener noreferrer"&gt;evaluation feedback loops for Service reports became 8.7x faster, down from 162 seconds to 18 seconds&lt;/a&gt;, with offline evaluation safety nets, and online production monitoring against traces. This is a concrete example of how evaluation can become code-backed production practice (as opposed to occasional spreadsheets).&lt;/p&gt;

&lt;p&gt;So long as an agent is being run as a transcript of prompt and answer, there is little in the way of release evidence of what was run to test it out. &lt;a href="https://focused.io/lab/evaluation-pipelines-for-langgraph-agents" rel="noopener noreferrer"&gt;Evaluation pipelines for LangGraph agents&lt;/a&gt; turn the run of an agent into a versioned object, comparable, gatable, and improvable by the team.&lt;/p&gt;

&lt;p&gt;For production-grade AI agents, accuracy is a property of the loop, i.e. offline evals catch known regressions, online monitors catch real drift, trace analysis names new failure classes, harness changes move the boundary, release gates keep the fix from breaking another path.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to instrument first
&lt;/h2&gt;

&lt;p&gt;Start with the places where accuracy can become damage.&lt;/p&gt;

&lt;p&gt;Every tool call (even scripts) should have a full record of: the selected entity (accounts, etc), the scoped credentials (auth tokens, etc) used to call the tool, all of the arguments passed to the tool, the full response from the tool, and the full external side-effect receipt (e.g. what records were updated, etc). Every retrieval should have a record of all the source IDs and versions it considered, plus the reason it chose the retrieval context it used. For every human approval boundary (e.g. approval for a particular action), it should be possible to determine who approved, what changed as a result of that approval, and whether the agent continued from the exact same state after approval (as opposed to after another action).&lt;/p&gt;

&lt;p&gt;Add &lt;a href="https://focused.io/lab/agent-rubrics-turn-evaluation-into-runtime-qa" rel="noopener noreferrer"&gt;evaluator verdicts&lt;/a&gt; where they can change behavior. Groundedness score for instance is useless if nobody reads it. But a verdict that for instance blocks a deploy, routes a case to review, opens a ticket, or even updates a regression set of failed evals to run, is useful as it produces operational evidence of accuracy.&lt;/p&gt;

&lt;p&gt;The AgentOps paper &lt;a href="https://arxiv.org/abs/2507.11277" rel="noopener noreferrer"&gt;describes the lifecycle of an AI Agent as: observe the agent’s behavior, collect metrics, detect issues, analyze the root cause, recommend optimizations, and automate runtime operations&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The seductive version of observability is a trace store full of nice looking spans while the agent continues to make the same mistake. And nothing changes. Observability has to produce action. Open the ticket. Add the eval. Patch the tool contract. Move the approval gate. Kill the unsafe route.&lt;/p&gt;

&lt;p&gt;Own the next change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Accuracy belongs to the system
&lt;/h2&gt;

&lt;p&gt;Once an agent is integrated with tools, memory, policy, approvals and external systems then accuracy becomes an attribute of the entire runtime.&lt;/p&gt;

&lt;p&gt;A clean percentage is misleading because it pretends that the model vendor is responsible for the accuracy of the AI agent. They are not.&lt;/p&gt;

&lt;p&gt;Accuracy of an Agent increases as long as the system is able to distinguish between errors of concern and errors that do not matter at all. Traces have to prove consequence of actions performed by the agent. Evals have to have the ability to preserve regressions that might occur during retraining. The harness has to route risky behavior to the correct human(s). The release process has to be able to prove that a fix for an error actually fixed that error.&lt;/p&gt;

&lt;p&gt;And the important note here: average accuracy is just a starting point for an AI powered agent. But in production, accuracy means creating an observable loop of error, cause, fix, and updated runtime, that again and again catches the right mistakes, explains them to us, and then fixes them before someone else has to pay for that same mistake.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>observability</category>
    </item>
    <item>
      <title>Agent Rubrics Turn Evaluation Into Runtime QA | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 09 Jun 2026 15:43:39 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/agent-rubrics-turn-evaluation-into-runtime-qa-focused-labs-1emk</link>
      <guid>https://dev.to/focused_dot_io/agent-rubrics-turn-evaluation-into-runtime-qa-focused-labs-1emk</guid>
      <description>&lt;p&gt;Agent evaluation has been too detached from the run of the agent.&lt;/p&gt;

&lt;p&gt;For production AI agents, we typically score the trace after the fact (i.e. offline). We compare the final answers in a benchmark harness. Sometimes we call in a human reviewer to verify that the agent’s output “smells right”. That gives us decent research notes and nice screenshots for an exec readout, but that does not tell us if the agent is “done” in production.&lt;/p&gt;

&lt;p&gt;Runtime QA has to sit inside the agent loop.&lt;/p&gt;

&lt;p&gt;LangChain has released &lt;a href="https://www.langchain.com/blog/introducing-rubrics-for-deepagents" rel="noopener noreferrer"&gt;RubricMiddleware for Deep Agents&lt;/a&gt;. Evaluation of AI agents has focused on the grading of one LLM by another LLM. There is already discourse in this space around judge bias, model drift, scoring methodology, and benchmark theater. Where the evaluation of an AI agent’s performance actually happens is inside the runtime QA harness, i.e. where the agent is put through its paces against specific criteria. This is where RubricMiddleware adds interesting functionality.&lt;/p&gt;

&lt;p&gt;It runs before the agent gets to leave.&lt;/p&gt;

&lt;p&gt;In traditional eval, &lt;a href="https://focused.io/lab/agent-benchmarks-measure-the-harness" rel="noopener noreferrer"&gt;agent benchmark scores measure the harness, not just the model&lt;/a&gt;. The score of the trace after the fact determines how “good” the deep agent was. By steering the harness, AI agent evaluation affects how a deep agent is used for work. Useful, but incomplete. The runtime definition of done still matters: when is a customer-service agent done with a chat? When is a coding agent done with a coding task? When is a research agent done with a question? The score after the fact only gives an impression of what might happen in production. It does not guarantee anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Offline evals do not close the loop
&lt;/h2&gt;

&lt;p&gt;Offline evals still matter. As I wrote before, &lt;a href="https://focused.io/lab/ai-agent-evaluation-steers-the-harness" rel="noopener noreferrer"&gt;ai agent evaluation steers the harness&lt;/a&gt;. Datasets, trajectories, scorers, and regression suites all end up steering the AI agent system over time. A new prompt, tool, model, memory policy, or AI agent orchestration change gets evaluated, and the team learns whether the system got better or worse over time.&lt;/p&gt;

&lt;p&gt;They do not protect the individual run.&lt;/p&gt;

&lt;p&gt;The issue is that a customer-service agent’s “evaluation” can say it passed the suite on Tuesday, yet have it botch a live account-change request on Wednesday because it relied on stale order status, unclear account-change policy, or a tool response the dataset missed. A coding agent can pass its benchmark suite, yet hand back a “refactored” codebase that fails the migration test. A research agent can churn out a fluent answer and still miss the citation the downstream workflow required.&lt;/p&gt;

&lt;p&gt;The gap is the runtime definition of done.&lt;/p&gt;

&lt;p&gt;LangChain has described a pattern here in the &lt;a href="https://docs.langchain.com/oss/python/deepagents/rubric" rel="noopener noreferrer"&gt;Deep Agents rubric docs&lt;/a&gt;. The basic loop is simple enough: the deep agent arrives at a final-looking output, a grader sub-agent reviews the transcript against the rubric, and each criterion comes back as &lt;code&gt;needs_revision&lt;/code&gt; or &lt;code&gt;satisfied&lt;/code&gt;. When a criterion needs revision, that feedback gets injected into the next pass. The loop exits when the agent satisfies the rubric, hits the iteration cap, fails at runtime, or the grader fails.&lt;/p&gt;

&lt;p&gt;That extends the harness used to evaluate the AI agent to include the criteria used to teach the agent in the harness. The eval harness is extended into the runtime of the AI agent where it can check criteria in the actual traces produced by the agent in real production work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rubrics turn done into runtime state
&lt;/h2&gt;

&lt;p&gt;A rubric is valuable in making the word done less squishy.&lt;/p&gt;

&lt;p&gt;For a legal research agent, done can mean covered jurisdictions, cited authority for every conclusion, and no cited case outside the relevant date range. For a customer-service agent, it can mean verified account ID, the billing action required by policy, and a clear audit note. For an engineering agent, it can mean passing tests, a diff inside allowed paths, and a migration script that exists.&lt;/p&gt;

&lt;p&gt;Yes it is. And that is exactly as it should be. Production QA is typically dull work, boring and thankless, but necessary, so someone has to do it.&lt;/p&gt;

&lt;p&gt;The criteria of the rubric can be incorporated into the run of the agent. The grader sub-agent can review the transcript, call tools, and return structured feedback. The transcript can then be forced through another pass because the rubric was not satisfied. This is managed by the middleware and configured through the grader model, system prompt, optional grader tools, iteration cap, and &lt;code&gt;on_evaluation&lt;/code&gt; callback. There are also custom stream events for &lt;code&gt;rubric_evaluation_start&lt;/code&gt; and &lt;code&gt;rubric_evaluation_end&lt;/code&gt;, with the result, explanation, and per-criterion verdicts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkswmm6ifg6xcf90b24ah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkswmm6ifg6xcf90b24ah.png" alt="Flow diagram of a deep agent, grader sub-agent, rubric verdicts, feedback injection, and observable rubric events." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The run ends on a verdict, not on the agent feeling finished.&lt;/p&gt;

&lt;p&gt;But beyond these basic facts, there is more to the story of observability. Honeycomb has described the need for production telemetry in AI-enabled workflows, and its March release describes &lt;a href="https://www.honeycomb.io/blog/honeycomb-advances-observability-for-ai-powered-software-development" rel="noopener noreferrer"&gt;AI agents investigating production issues with the same telemetry an SRE uses&lt;/a&gt;. Rubric verdicts, including failed criteria, form part of the evidence stream for runtime behavior of an AI-powered workflow.&lt;/p&gt;

&lt;p&gt;That is operational evidence, not vibes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The grader has to be owned
&lt;/h2&gt;

&lt;p&gt;If evaluation simply becomes a quality assurance function then it goes wrong quickly.&lt;/p&gt;

&lt;p&gt;That will fail.&lt;/p&gt;

&lt;p&gt;When treated as mere quality assurance for deep agents, the grader sub-agent becomes infrastructure with its own model, prompt, tools, timeouts, error modes, and cost profile. It can be too lenient. It can be too strict. It can fail with a false pass. It can also be slow enough to bury the run in retries. Or worse, every hard judgment routes through a frontier model and turns QA for the agent into the new and unforeseen cloud bill.&lt;/p&gt;

&lt;p&gt;Another useful point from the &lt;a href="https://reference.langchain.com/python/deepagents/middleware/rubric" rel="noopener noreferrer"&gt;LangChain Python reference&lt;/a&gt; is that when grader feedback is injected into the conversation, it is tagged as synthetic feedback from &lt;code&gt;rubric_grader&lt;/code&gt;. The feedback also contains source metadata, so it is clear that this feedback was generated by a grader and not by the user of the chat. Synthetic QA feedback must be distinguishable from real user input.&lt;/p&gt;

&lt;p&gt;This is where ownership shows up.&lt;/p&gt;

&lt;p&gt;People misinterpret this pattern. They bolt a grader on a deep agent and suddenly the grader is QA. That fails quickly. The pattern works only if the team has a shared understanding of what the rubric for quality means. Product and domain experts accept the rubric as acceptance criteria for the product. Engineering executes those terms in a loop. QA enforces pass/fail on those terms for the agent in a run. Observability shows the evidence that proves those terms were met.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verification has a cost curve
&lt;/h2&gt;

&lt;p&gt;On the same day that LangChain published the above piece on using rubrics as QA feedback, it co-published &lt;a href="https://www.langchain.com/blog/designing-efficient-verifiers-for-legal-agents" rel="noopener noreferrer"&gt;designing efficient verifiers for legal agents&lt;/a&gt; with Harvey. This piece is the practical follow-on to that previous post and outlines the key tradeoffs involved in effective verification.&lt;/p&gt;

&lt;p&gt;The post discusses verifier cost on Harvey’s Legal Agent Benchmark (LAB). LAB is not a toy for verifying legal agents. It currently consists of &lt;a href="https://www.harvey.ai/blog/introducing-harveys-legal-agent-benchmark" rel="noopener noreferrer"&gt;1,250 long-horizon legal tasks across 24 practice areas and 75,000-plus expert-written rubric criteria&lt;/a&gt; that a verifier would check against an agent’s output. The initial results are brutal but useful. Under Harvey’s strict all-pass standard, &lt;a href="https://www.harvey.ai/blog/legal-agent-benchmark-initial-results" rel="noopener noreferrer"&gt;frontier models completed less than 10 percent of tasks end to end, with the top configuration around 50.90 dollars per task and about 22 minutes&lt;/a&gt;. This is the kind of task where a false pass would have serious consequences.&lt;/p&gt;

&lt;p&gt;A failed criterion can go to review. A criterion that should have failed but passed is worse because it can become client risk.&lt;/p&gt;

&lt;p&gt;The verifier-cost post lands on a useful conclusion: per-criterion verification with a frontier model gives a narrow judgment window, but gets expensive fast. Batched verification can bring repeated input-token costs down by about an order of magnitude, but agreement drops. Cheaper open models drop cost again, but the false-pass rate has to be watched as a production risk.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0qn5almjinbi4dzpo0o8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0qn5almjinbi4dzpo0o8.png" alt="Tradeoff matrix comparing frontier per-criterion, batched, and cheaper open verifiers by cost, latency, agreement, and false-pass risk." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Verifier design is a cost curve with a risk budget, not a checkbox.&lt;/p&gt;

&lt;p&gt;But then buyers feel it. The QA work just described is quite different from calling a frontier verifier on every criterion for every task. The production AI QA work consists of designing a system that can route work to the appropriate type of verification for each criterion, depending on risk. That is &lt;a href="https://focused.io/lab/agentic-ai-architecture-needs-model-routing" rel="noopener noreferrer"&gt;model routing applied to verification&lt;/a&gt;, with QA policy deciding which verifier belongs on which criterion. So, the system can use a cheap checker for low-risk formatting work, stricter verifier settings for high-risk domain work, and combine tool-generated evidence with human review where the risk deserves it.&lt;/p&gt;

&lt;p&gt;The runtime should know the difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failed criteria should become work
&lt;/h2&gt;

&lt;p&gt;A rubric failure is a better incident than a vague agent failure.&lt;/p&gt;

&lt;p&gt;First, failing a criterion is better than not having a review process at all. When the agent fails to meet a criterion, the owner can revise it, the reviewer can inspect the failure, the platform can collect data around that failure, and the owner can add tests to verify that the agent meets the criterion in future cases.&lt;/p&gt;

&lt;p&gt;I wrote a similar post called &lt;a href="https://focused.io/lab/agent-failures-should-open-tickets" rel="noopener noreferrer"&gt;Agent Failures Should Open Tickets&lt;/a&gt;, where I argued that agent failures should become work in the engineering system instead of vanishing in chat. Rubrics are then the specific acceptance test for a criterion that failed grading in live runs. The work item does not say “agent failed” but instead “rubric criterion 4 failed” after the billing-policy retrieval tool returned stale policy text in later retrieval runs.&lt;/p&gt;

&lt;p&gt;That changes the improvement loop.&lt;/p&gt;

&lt;p&gt;The team can now see whether errors fall into categories, e.g. lack of context for the agent, bad tools, bad rubric criteria, bad prompts, or verifier drift. The team can cross-reference failures in production against failures in offline eval. The team can make release gates for individual criteria stricter if those errors escape and cause defects in production.&lt;/p&gt;

&lt;p&gt;This is also where a rubric becomes the runtime side of change control. A rubric is an acceptance test with a language-model-shaped grader. That does not make it magic. It makes it easier to express criteria that were already implicit in a human review process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability makes rubric QA real
&lt;/h2&gt;

&lt;p&gt;RubricMiddleware exposes an &lt;code&gt;on_evaluation&lt;/code&gt; callback as well as stream events. I consider the runtime and QA process around these events the thing that turns a self-correction loop into something testable. The harness is still producing a trace. Runtime acceptance criteria, as expressed by the rubric, should be testable as well. The verifier design and related runtime verification should be exposed like any test case.&lt;/p&gt;

&lt;p&gt;A production agent should write out a full runtime trace for every query and emit for every acceptance test, including the rubric ID, all the criterion IDs, the grader model, the final verdict, the full explanation, the number of iterations it took to reach a conclusion, the retry cap, any tool evidence collected during evaluation, the total latency to reach a conclusion, the total cost of the verifier used to reach a conclusion, and the final disposition of the query. The UI should highlight when human feedback was injected into the system. The trace should clearly indicate when feedback was injected. The runtime trace should be distinguishable from human generated messages, such as customer complaints, and the trace from previous runs should be easily comparable to highlight any drift in the verifiers over time. The dashboard should also be able to clearly highlight the point at which a grader, or set of graders used by the system for verification, failed due to lack of proper credentials or a change in validator tools.&lt;/p&gt;

&lt;p&gt;That is the work. Boring, specific, necessary.&lt;/p&gt;

&lt;p&gt;I find it helpful to read the optimists and the pessimists on AI and arrive at the same point: there is risk here, and there is no natural feedback loop. Charity Majors wrote this week that AI enthusiasts and skeptics are both reacting to real risk, and her line, &lt;a href="https://charitydotwtf.substack.com/p/ai-enthusiasts-are-in-a-race-against" rel="noopener noreferrer"&gt;“There is no natural feedback loop,”&lt;/a&gt; is the frame I keep coming back to. Rubrics can become part of that feedback loop if verdicts are exposed, owned, and acted on.&lt;/p&gt;

&lt;p&gt;If the loop is invisible, it becomes another agent trick. If the loop is observable, it becomes runtime QA.&lt;/p&gt;

&lt;h2&gt;
  
  
  Own the criteria before scaling the agent
&lt;/h2&gt;

&lt;p&gt;The hard part is not adding middleware.&lt;/p&gt;

&lt;p&gt;It turns out that deciding what done means, and what to do when each type of failure occurs, is hard product and engineering work.&lt;/p&gt;

&lt;p&gt;Manual review cannot keep pace with fan-out work across tools, files, and documents. Blind self-correction cannot safely verify that work either. A runtime rubric provides verification before the agent delivers work, while giving the team evidence when the system cannot complete the loop.&lt;/p&gt;

&lt;p&gt;The offline harness is separate from runtime QA. &lt;a href="https://focused.io/lab/agentic-ai-implementation-change-control" rel="noopener noreferrer"&gt;Change control&lt;/a&gt; is separate from runtime verification and enforcement. In fact, the harness, release process, and runtime rubric form a chain: the harness drives the test runs; the release process gates the new baseline; and the runtime rubric verifies on a per-run basis whether the agent can finish the QA loop for that work item.&lt;/p&gt;

&lt;p&gt;Own that boundary.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>langchain</category>
    </item>
    <item>
      <title>AI Agent Evaluation Steers the Harness | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 03 Jun 2026 15:24:58 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/ai-agent-evaluation-steers-the-harness-focused-labs-3c9k</link>
      <guid>https://dev.to/focused_dot_io/ai-agent-evaluation-steers-the-harness-focused-labs-3c9k</guid>
      <description>&lt;p&gt;Agent evaluation is being conflated with scoring agent performance. Such scoring is useful, but what one gets from such scoring are edits to the harness.&lt;/p&gt;

&lt;p&gt;If an evaluation fails to affect the harness after the fact, the team was simply left with a dashboard written up in better language. LangChain puts the sharper version in its Deep Agents writeup, where &lt;a href="https://www.langchain.com/blog/how-we-build-evals-for-deep-agents" rel="noopener noreferrer"&gt;every evaluation is a vector that shifts the behavior of the agentic system&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That sentence carries the argument.&lt;/p&gt;

&lt;p&gt;An eval suite is part of the system. The eval suite is driving the agent to behave in one way and not another. A sloppy eval is cheap to run. A broad benchmark is a good thing to include in a review. But a broad benchmark does not push the system to perform well on the task distribution actual users encounter.&lt;/p&gt;

&lt;p&gt;I see people treat evaluation of AI agents similarly to how they score the performance of agents. Yes, that matters. But what does the score from that evaluation buy? Edits to the harness. That one failing trace gave the team something valuable. Insight into the structure of the harness.&lt;/p&gt;

&lt;h2&gt;
  
  
  The harness is where the lesson lands
&lt;/h2&gt;

&lt;p&gt;An agent harness is the stuff around the model: tools, tool descriptions, prompts, routing rules, memory, retrieval, runtime policy, state shape, and all the weird connective tissue that decides what the model can actually do. We have written about this in the context of &lt;a href="https://focused.io/lab/developing-ai-agency" rel="noopener noreferrer"&gt;developing AI agency&lt;/a&gt;, because model swaps get too much credit and harness changes get too little ownership.&lt;/p&gt;

&lt;p&gt;Agent evaluation belongs there.&lt;/p&gt;

&lt;p&gt;LangChain's Better-Harness writeup makes the loop explicit: &lt;a href="https://www.langchain.com/blog/better-harness-a-recipe-for-harness-hill-climbing-with-evals" rel="noopener noreferrer"&gt;evals create the learning signal for iteratively improving prompts, tools, tool descriptions, instructions, and runtime scaffolding&lt;/a&gt;. To recap the useful part: design evals, run evals, get learning signal from evals, then use that learning signal to improve the components around the model. The evals are training data for the harness.&lt;/p&gt;

&lt;p&gt;This might happen because of a weakly worded instruction, or a tool set up with incorrect parameters, or a retrieval component that sends the agent on a wild goose chase through a swamp of useless information, or a routing rule that gives the cheap model a go first even though that model is likely to take patience to get to the right answer, or because the agent's state is disappearing at exactly the wrong moment as the agent is trying to recover from something.&lt;/p&gt;

&lt;p&gt;The score does not fix these problems. The score only earns the right to edit the harness.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1p75ciwkejhk1zwhlvan.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1p75ciwkejhk1zwhlvan.png" alt="Circular flow showing production traces turning into eval datasets, scores, harness edits, holdout gates, and release." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The eval loop is only useful when it changes the harness and then protects the next release.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production traces are the raw material
&lt;/h2&gt;

&lt;p&gt;The best eval cases come from the system embarrassing itself.&lt;/p&gt;

&lt;p&gt;Here is the refund agent case. The customer asks for a refund and the agent fails to check eligibility. For the research agent example, the agent reads the first file in a series of linked files but fails to open the rest. It then confidently and incorrectly summarizes the material for the user. For the coding agent, the agent changes the implementation but fails to update the tests. The agent then reports success because the patch compiled in the agent's head.&lt;/p&gt;

&lt;p&gt;LangChain's readiness checklist says the first move is to &lt;a href="https://www.langchain.com/blog/agent-evaluation-readiness-checklist" rel="noopener noreferrer"&gt;manually review 20 to 50 real agent traces before building eval infrastructure&lt;/a&gt;. The fact that a team would be willing to read 50 real traces is what makes this refreshingly boring and, more important, it saves months while the team avoids building an eval suite from vibes, guesses, and the latest loud failure.&lt;/p&gt;

&lt;p&gt;There is a distinction between capability evals and regression evals. The ability of the system to do new things will naturally have a low pass rate at first because the team is hill climbing. Once the system is able to do something, it should continue to do that thing in the future. Regression evals catch the system falling back to old behavior that the product relies on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluate the path when the path is the product
&lt;/h2&gt;

&lt;p&gt;Another simplistic assumption for agent evaluations: just grade the final answer.&lt;/p&gt;

&lt;p&gt;For a variety of reasons, grading the final answer simply does not work for task classes where the path is part of the product surface. The refund agent rejects a refund request because it skipped a check against company policy. The data agent in a meeting creates a chart by issuing a full table scan in production. The support agent solves the ticket by leaking internal notes to the customer in a conversation.&lt;/p&gt;

&lt;p&gt;Google's Agent Development Kit docs make the split cleanly: agent evaluation should assess &lt;a href="https://adk.dev/evaluate/" rel="noopener noreferrer"&gt;both final output quality and trajectory, meaning the sequence of steps, tools, and reasoning the agent used&lt;/a&gt;. Their ADK codelab turns that into a testing workflow with &lt;a href="https://codelabs.developers.google.com/adk-eval/instructions" rel="noopener noreferrer"&gt;golden datasets that preserve user query, trajectory, and final response&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The release gate has to know what to test for, which behaviors rely on an acceptable path and which behaviors rely only on the output of the agent for a given input. Otherwise the team either blocks good changes with brittle tests or ships dangerous changes because the final answer looked right.&lt;/p&gt;

&lt;h2&gt;
  
  
  System, trace, and node evals answer different questions
&lt;/h2&gt;

&lt;p&gt;The name Agentic CLEAR serves to identify the different levels at which a team can review and assess the ability of an agent to complete complex tasks. The paper describes an evaluation framework that produces insights at &lt;a href="https://arxiv.org/abs/2605.22608" rel="noopener noreferrer"&gt;system, trace, and node levels of granularity&lt;/a&gt;. IBM's project page expands that into an open-source package that evaluates traces across &lt;a href="https://ibm.github.io/CLEAR/" rel="noopener noreferrer"&gt;system-wide issues, node or component analysis, and trace-level inspection&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Those levels map to different harness changes.&lt;/p&gt;

&lt;p&gt;A system-level eval asks whether the workflow produced the outcome the product cares about. Did the claims agent resolve the case? Did the analyst agent produce a grounded answer? Did the coding agent land a patch that passed the intended checks? System-level failures point to architecture: routing, ownership, data access, memory, deployment boundary, or business process fit.&lt;/p&gt;

&lt;p&gt;Trace-level evaluations assess whether an agent follows a task through to a coherent end. They look at whether an agent searches for the appropriate information prior to writing, whether an agent uses the correct tool to send a package, whether an agent sends off work that can be completed in parallel, and whether an agent follows through on a task that has no end point. Trace-level failures suggest problems with planning, tool use, interrupt handling, retry policy, and multi-agent orchestration. This is where &lt;a href="https://focused.io/lab/multi-agent-orchestration-in-langgraph-supervisor-vs-swarm-tradeoffs-and-architecture" rel="noopener noreferrer"&gt;multi-agent orchestration&lt;/a&gt; stops being a diagram and starts being an eval surface.&lt;/p&gt;

&lt;p&gt;Node-level evaluations examine individual behaviors produced by an agent as it goes through a task. Did a retrieval node produce the correct documents? Does a summarizer preserve constraints? Did a tool call include the tenant ID? Did a model produce the correct function for the job? Node-level failures can be addressed by changing local parts of the harness, including a tool schema, prompt wording, retrieval filters, model choice, and guardrail placement.&lt;/p&gt;

&lt;p&gt;One pass rate does not cut it for this type of evaluation. A single number will not highlight the repair surface to the agent developer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fajt2m2dswnphrwdt22uo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fajt2m2dswnphrwdt22uo.png" alt="Layered stack showing system, trace, and node levels of AI agent evaluation feeding harness changes." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;System, trace, and node evals answer different questions, so they should change different parts of the harness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability feeds evaluation, then evaluation changes behavior
&lt;/h2&gt;

&lt;p&gt;Agent evaluation without traces becomes example-driven theater in which teams argue about a few examples, write up synthetic test cases, and then do a qualitative evaluation of behavior that nobody has actually seen triggered in real life.&lt;/p&gt;

&lt;p&gt;Observability without evaluation is storage. I like traces. I like spans. As wonderful as these things are, the operational data the system is running on is also a receipt for how the system got to that point. That is what can become evaluation data.&lt;/p&gt;

&lt;p&gt;Honeycomb's AI-era observability piece spells out how agentic workflows depend on operational data of high cardinality, queried quickly, because agents query production context iteratively on a case-by-case basis. In their words, agentic workflows depend on &lt;a href="https://www.honeycomb.io/blog/evaluating-observability-tools-for-the-ai-era" rel="noopener noreferrer"&gt;fast, queryable, high-cardinality operational data because agents ask iterative questions against raw production context, not just dashboards&lt;/a&gt;. The easiest way to compromise an evaluation dataset is for production traces to stop including tenant ID, tool arguments, retrieval sources, policy decisions, model versions, prompt versions, and release versions.&lt;/p&gt;

&lt;p&gt;The eval dataset should be downstream of observability and upstream of the harness.&lt;/p&gt;

&lt;p&gt;As such, &lt;a href="https://focused.io/lab/agent-monitoring-is-an-infrastructure-workload" rel="noopener noreferrer"&gt;Agent Monitoring Is an Infrastructure Workload&lt;/a&gt;. Monitoring, log collection, metrics collection, and tracing must be treated as workloads and run as services. Otherwise they are screenshots in a vendor console. The trace proves the agent failed. Then it dies in storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  The release gate is the boring power move
&lt;/h2&gt;

&lt;p&gt;A good release shape is a pull request that includes the modified harness, with all changes visible in the diff, plus the relevant evaluations that identified the change. The diff states that trace 481 failed and that the failed trace and its evaluation were used to modify the tool description. Or that a retrieval filter changed to avoid tenant leakage found by a node eval. Or that a route now uses a stronger model because a holdout set found cheap-model failures. The release is blocked by a regression test suite that found a path violation in the payment-approval case.&lt;/p&gt;

&lt;p&gt;That is boring in the correct way.&lt;/p&gt;

&lt;p&gt;LangChain's readiness checklist describes a CI/CD flow where &lt;a href="https://www.langchain.com/blog/agent-evaluation-readiness-checklist" rel="noopener noreferrer"&gt;code or prompt changes trigger offline evals, preview deployments, online evals, and promotion only after quality gates pass&lt;/a&gt;. Better-Harness then covers how &lt;a href="https://www.langchain.com/blog/better-harness-a-recipe-for-harness-hill-climbing-with-evals" rel="noopener noreferrer"&gt;optimization examples can guide improvement, while holdout evals and human review protect against overfitting the harness to visible cases&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Without that owner, AI agent evaluation becomes a pile of numbers. With that owner, evals become a steering wheel.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first useful eval stack is small
&lt;/h2&gt;

&lt;p&gt;The practical stack does not have to start fancy.&lt;/p&gt;

&lt;p&gt;We should start by recording real usage traces, and then the failures in there as well, with corresponding success criteria that any reasonable human could check. Distinguish between the capability hills that the evaluation is trying to climb, and the regressions that it is trying to prevent. Tag evaluations by behavior. Holdouts should not be part of the agent's optimization loop. Log the part of the harness that changed for each failed eval. Run regression evals in CI before the agent gets another production release.&lt;/p&gt;

&lt;p&gt;Note the granularity of agent evaluation is about a system workflow and thus System-level evaluations about the entire system workflow as a whole. Also note that trace-level evaluations verify the acceptability of the path an agent took to arrive at a conclusion. Finally, note that node-level evaluations verify the local step an agent took through a given node was correct.&lt;/p&gt;

&lt;p&gt;Even good AI will sometimes fail to reach its desired outcome, and that is where the dataset comes from.&lt;/p&gt;

&lt;p&gt;Every meaningful failure of the AI system should become evidence which a human can reuse. Thus a trace of AI system failure through human interactions becomes an eval of an AI program. An eval of an AI program becomes a harness edit. A harness edit goes through the holdout gate for that AI system. The next production run of the AI system produces more evidence.&lt;/p&gt;

&lt;p&gt;In the end, AI agent evaluation is engineering.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>langchain</category>
    </item>
  </channel>
</rss>
