<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Focused</title>
    <description>The latest articles on DEV Community by Focused (@focused_dot_io).</description>
    <link>https://dev.to/focused_dot_io</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F1686%2F237dba8d-1803-4c89-8e66-fdb283d0aa4a.png</url>
      <title>DEV Community: Focused</title>
      <link>https://dev.to/focused_dot_io</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/focused_dot_io"/>
    <language>en</language>
    <item>
      <title>AI Agent Orchestration Needs Receipts | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Sun, 17 May 2026 21:13:49 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/ai-agent-orchestration-needs-receipts-focused-labs-2ho</link>
      <guid>https://dev.to/focused_dot_io/ai-agent-orchestration-needs-receipts-focused-labs-2ho</guid>
      <description>&lt;p&gt;Orchestrating AI agents breaks in the boring place of all: between issuing a tool call and the tool call having its intended side effect.&lt;/p&gt;

&lt;p&gt;As tool calls transition from being client tools executed by application code to server tools executed by models, there is a point in the system where the language and the abstraction used to describe the tool use breaks down. A tool call becomes a runtime transaction. The work done by a tool affects databases, makes payments, sends emails, creates tickets, etc. A retry storm, or even a simple retry, now has significant production consequences.&lt;/p&gt;

&lt;p&gt;Agent tools need receipts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool Calls Are Side Effects With Better Marketing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview" rel="noopener noreferrer"&gt;Anthropic's tool-use docs split server tools from client tools&lt;/a&gt;. A client tool is executed by application code, and then the application sends &lt;code&gt;tool_result&lt;/code&gt; back to the model. This is where language ends and production begins. Databases get mutated. Payments get made. Emails get sent. Tickets get updated. Credentials get used.&lt;/p&gt;

&lt;p&gt;I see this boundary get described as a function call. Better: side-effect boundary. These systems do not have a durable receipt right now.&lt;/p&gt;

&lt;p&gt;What proves the side effect in an agent runtime? The request IDs from external vendors, the changed rows in the business system, and the receipt the runtime saved before the model moved on. It takes human eyes reading through three different systems (and writing glue code along the way) to answer questions like "Did this exact tool intent already cause this exact side effect?" if the runtime cannot track the side effects caused by tool calls inside the model loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Old Backend Pattern Still Applies
&lt;/h2&gt;

&lt;p&gt;Normal API work has already figured this out. For example, &lt;a href="https://docs.stripe.com/api/idempotent_requests" rel="noopener noreferrer"&gt;Stripe supports idempotent requests for POST&lt;/a&gt;, so a caller can retry after a network failure without charging the customer twice. It tracks the original parameters for a given idempotency key, so if the key is reused with different parameters, it will not be treated as the same operation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/powertools/python/latest/utilities/idempotency/" rel="noopener noreferrer"&gt;AWS Lambda Powertools describes idempotency records&lt;/a&gt; with INPROGRESS and COMPLETE states, payload hashes, stored responses and an expiration for the record. This is a tiny state machine around a side effect. That's all that's required for an agent runtime to safely handle model-intent-to-change-the-world calls.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/transactional-outbox.html" rel="noopener noreferrer"&gt;transactional outbox pattern&lt;/a&gt;: write the business state and the outbound message in one database transaction, then deliver from the outbox. AWS writes about the duplicate-message problem for this style of delivery and recommends idempotent consumers that track processed message IDs.&lt;/p&gt;

&lt;p&gt;The deterministic backend, for example a Java or Python service, calls a service endpoint with fixed intent semantics. Booking a hotel room is boring in exactly the right way. An agent tool call is produced by a model loop that can re-plan, retry, branch, summarize state, and call the same tool again. The runtime has to record the intent before the side effect is produced.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Ledger Has to Know
&lt;/h2&gt;

&lt;p&gt;Tool Ledger. Side-Effect Journal. Orchestration Transaction Table. The name is unimportant. It is a table with a specific shape.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5f8qrxylenw05rlpay8t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5f8qrxylenw05rlpay8t.png" alt="Architecture diagram showing an agent runtime routing mutating tool calls through a side-effect ledger with idempotency keys and receipts." width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The side-effect ledger is the boundary between model intent and production side effects.&lt;/p&gt;

&lt;p&gt;A side-effecting tool call needs a record before execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;create&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="nf"&gt;agent_tool_ledger &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;step_id&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;input_hash&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;operation_key&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt; &lt;span class="nf"&gt;check &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;planned&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;in_progress&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;succeeded&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compensating&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compensated&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
  &lt;span class="p"&gt;)),&lt;/span&gt;
  &lt;span class="n"&gt;receipt&lt;/span&gt; &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compensation&lt;/span&gt; &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;run_trace_id&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;owner_service&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="nf"&gt;unique &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;operation_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That unique constraint is the point.&lt;/p&gt;

&lt;p&gt;The record would hold: tool name, normalized input hash, run ID, graph step, owner service, run trace ID, status, receipt, and compensation metadata. On conflict, the application checks the stored &lt;code&gt;input_hash&lt;/code&gt; against the new &lt;code&gt;input_hash&lt;/code&gt;. Same key with different input is a bug. The receipt is the external fact: Stripe charge ID, Zendesk ticket ID, GitHub comment URL, invoice number, database primary key, email provider message ID.&lt;/p&gt;

&lt;p&gt;No receipt, no production claim.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retry Safety Has to Be Designed Before the Retry
&lt;/h2&gt;

&lt;p&gt;A retry policy is essentially a duplicate side-effect generator wearing a reliability costume.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzr995om0l6r4xovbtj4j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzr995om0l6r4xovbtj4j.png" alt="Timeline comparing an unsafe agent retry that duplicates a side effect with a safe retry that checks a side-effect ledger and returns an existing receipt." width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Retries become safe only after the runtime has a durable place to check intent and receipts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.temporal.io/activity-definition" rel="noopener noreferrer"&gt;Temporal's Activity documentation recommends idempotent Activities&lt;/a&gt; because they can be retried. A non-idempotent Activity can corrupt application state even when the distributed system is functioning correctly. The runtime's retry policy does not make the agent reliable by itself.&lt;/p&gt;

&lt;p&gt;This is where agent systems get uncomfortable. Because we've instrumented our system to retry on transport failure, we can easily believe that we're retrying on transport failure, when in reality we're just retrying on a model of the world that observes a timeout and decides to go down a different path. So, for example, after refunding a customer the model may decide to create a support note, and then the model may decide to refund the customer again in a summary step, losing the receipt from the first attempt. The model may ask a human for confirmation in the meantime and then resume with stale tool context. The model may even run a background subagent that decides to go down a different path in order to arrive at the same conclusion.&lt;/p&gt;

&lt;p&gt;This intent cannot be raw JSON. Models produce irrelevant differences. Field order changes. Natural-language notes shift. A good operation key comes from the business operation. The model's token stream is too noisy. refund:{tenant_id}:{payment_id}:{reason_code} beats a hash of the entire prompt. comment:{repo}:{pull_request}:{review_run_id} beats a blob of generated markdown.&lt;/p&gt;

&lt;p&gt;That ownership boundary corresponds to the ownership of the credentials for the tool. In agent systems, the authentication of the agent to the external system should start with the workload identity. In &lt;a href="https://focused.io/lab/ai-agent-authentication-workload-identity" rel="noopener noreferrer"&gt;AI Agent Authentication Starts With Workload Identity&lt;/a&gt;, we discussed the reasons why the secrets should not be passed around like party favors. This same principle applies here. The runtime should not make up the side-effect semantics for a tool that is not owned by the runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability Without the Receipt Is Theater
&lt;/h2&gt;

&lt;p&gt;But traces do not, by default, create a business-level uniqueness boundary.&lt;/p&gt;

&lt;p&gt;Joining traces to ledger entries changes what agent observability can do. The trace explains the path after the incident. The ledger table can drive behavior during the incident: suppress the duplicate, resume from a receipt, trigger compensation, alert the owning team, or block the next step until a human approves the ambiguous side effect.&lt;/p&gt;

&lt;p&gt;That is the difference between a dashboard and a control surface. The trace is evidence. The ledger is state.&lt;/p&gt;

&lt;p&gt;Evaluations also get a lot better. In place of "the model called the refund tool", the useful check is one planned refund, one succeeded ledger entry, one receipt, zero duplicate external effects after a simulated timeout. In &lt;a href="https://focused.io/lab/everybody-tests" rel="noopener noreferrer"&gt;Everybody Tests&lt;/a&gt;, we recognized that people are already testing with the feedback loops they have today. The transcript is too thin to capture all the detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tool Interface Should Expose the Contract
&lt;/h2&gt;

&lt;p&gt;The contract for a side-effecting tool should be defined near the definition of the tool itself. That contract should describe the operational facts that the runtime can enforce for that tool. A side-effecting tool contract should answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the tool read-only or mutating?&lt;/li&gt;
&lt;li&gt;Who owns the tool?&lt;/li&gt;
&lt;li&gt;Which fields form the operation key?&lt;/li&gt;
&lt;li&gt;Which external receipt proves success?&lt;/li&gt;
&lt;li&gt;What status means the side effect is safe to retry?&lt;/li&gt;
&lt;li&gt;What compensation path exists when the effect is wrong?&lt;/li&gt;
&lt;li&gt;How long does the ledger entry live?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where &lt;a href="https://focused.io/lab/mcp-is-packaging-agent-operable-interfaces-are-the-product" rel="noopener noreferrer"&gt;MCP and other tool packaging efforts&lt;/a&gt; need to "grow up" to support packaging of tools for agents to use in production. Such interfaces are not just "packaging" and must be agent-operable - typed, permissioned, inspectable, retryable, and owned by a service. This is the real product, and it is a far cry from a mere interface for the agent to discover and call a tool.&lt;/p&gt;

&lt;p&gt;A tool registry that simply says a tool exists is table stakes. A registry that says a write tool mutates customer billing, requires workload identity, lists the operation-key fields, emits a specific external receipt, and pages the service owner on ambiguous completion starts to look like production infrastructure.&lt;/p&gt;

&lt;p&gt;Boring. Also useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Runtime Should Refuse Unsafe Writes
&lt;/h2&gt;

&lt;p&gt;Ledger policies for mutating tools run the show.&lt;/p&gt;

&lt;p&gt;Read-only search tools remain lightweight, (retrieval, ranking, summarization, classification). Write tools charge cards or email customers. Write tools have their own set of problems but follow a different set of rules. For write tools the runtime should require a ledger policy before registration. The tool owner supplies the operation-key builder, receipt parser, retry rules, and compensation metadata. The runtime supplies the reservation, status transitions, trace joining, and audit events. The rest of the orchestration layer checks the side-effect ledger before running the tool and after it fails. The eval harness tests the duplicate paths for the tool. The on-call team can see stuck &lt;code&gt;in_progress&lt;/code&gt; rows before the customers do.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://focused.io/lab/langgraph-agent-error-handling-production" rel="noopener noreferrer"&gt;LangGraph Agent Error Handling in Production&lt;/a&gt;. Here, handling errors in tools called by an agent is more than simply handling exceptions that occur when the tool is called. The side effects that occur before the error is surfaced, especially around a timeout, are the real problem the error handling has to address. The ledger is where the system goes looking for evidence.&lt;/p&gt;

&lt;p&gt;That last point matters. Agents can keep going after an error has occurred. But in production, continuing can be reckless.&lt;/p&gt;

&lt;h2&gt;
  
  
  Own the Receipt
&lt;/h2&gt;

&lt;p&gt;The gold rush version of AI agent orchestration wants better planners, bigger context windows, and more tools. Fine. Those help.&lt;/p&gt;

&lt;p&gt;The production version needs a boring table that answers whether a tool call already did the thing.&lt;/p&gt;

&lt;p&gt;That table won't demo well. Nobody cheers for a simple unique index on &lt;code&gt;(tool_name, operation_key)&lt;/code&gt;. But that's exactly what this table is. And it will save a team from having to refund, email, provision, delete and apologize (for the mysterious model) twice.&lt;/p&gt;

&lt;p&gt;The model can be probabilistic. The side-effect boundary cannot.&lt;/p&gt;

&lt;p&gt;Own the receipt.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>Agentic AI Implementation Runs Through Change Control | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Sun, 17 May 2026 21:13:16 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/agentic-ai-implementation-runs-through-change-control-focused-labs-37pi</link>
      <guid>https://dev.to/focused_dot_io/agentic-ai-implementation-runs-through-change-control-focused-labs-37pi</guid>
      <description>&lt;p&gt;There’s been a big mis-selling in Agentic AI implementation. People compare its implementation to software enablement. But this breaks when the agent can change a workflow.&lt;/p&gt;

&lt;p&gt;The agent approves a refund, opens an incident, updates a customer record, begins onboarding for a new customer, or escalates a support ticket. At that point a training calendar and a Slack message are not enough for a rollout plan.&lt;/p&gt;

&lt;p&gt;It needs a change record.&lt;/p&gt;

&lt;p&gt;Enterprise AI adoption has a naming problem. Work ‘adoption’ gets viewed through the same lens as software ‘usage’. Thus work is framed in terms of seats, office hours, examples of how to properly format a prompt, and wait for it to kick in. But then the work actually gets executed out through an agent that in turn changes a workflow.&lt;/p&gt;

&lt;p&gt;The system has entered the process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.microsoft.com/en-us/worklab/work-trend-index/agents-human-agency-and-the-opportunity-for-every-organization" rel="noopener noreferrer"&gt;Microsoft's 2026 Work Trend Index&lt;/a&gt; frames this shift as an operating-model problem. WorkLab analysis finds that employees may be ready for AI, while the systems around work are not. Agent approvals, open incidents, and changed customer records create a different implementation roadmap.&lt;/p&gt;

&lt;p&gt;That changes the implementation roadmap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rollout Surface Changed
&lt;/h2&gt;

&lt;p&gt;Agents behave differently from a chat tool. An agent is released through a system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-opens-its-full-system-of-action-to-every-AI-Agent-in-the-enterprise/default.aspx" rel="noopener noreferrer"&gt;ServiceNow announced Action Fabric at Knowledge 2026&lt;/a&gt;, explicitly opening its governed system of action to agents. The MCP Server gives agents access to workflows, playbooks, approvals, catalog requests, and business rules. All of which run through identity verification, granted permissions, and audit trails.&lt;/p&gt;

&lt;p&gt;Within an enterprise the enterprise agent problem manifests itself when an agent has moved from the edge of a process, creating a summary of work done, to inside the process, making a move.&lt;/p&gt;

&lt;p&gt;The first key question that comes to the surface for the enterprise is no longer "who should have access to this tool" and rather "what change is this tool going to drive for the business, and who is going to own that change (ie: the teams that run the production systems, compliance to regulations, promises to customers, incident response, and the overall economics of the workflows that this will insert into)".&lt;/p&gt;

&lt;p&gt;The reality of the enterprise is well captured in a preview for LangChain's Interrupt 2026: the initial excitement to have agents proving work in production will quickly give way to questions about the team, tooling and infrastructure required to support agents that are no longer ‘proof-of-concept’ work (LangChain Interrupt 2026 preview &lt;a href="https://www.langchain.com/blog/previewing-interrupt-2026-agents-at-enterprise-scale" rel="noopener noreferrer"&gt;LangChain Interrupt 2026 preview&lt;/a&gt;). My experience with clients has been the same: there is initial excitement with the first useful agent, overlap of work with the second and finally ownership problems with the third.&lt;/p&gt;

&lt;p&gt;Fine. That is the good version.&lt;/p&gt;

&lt;p&gt;The bad version of this is quiet. A team enables an agent with a service account, an admin token, a dashboard that nobody looks at. It looks good during the demo, and then a change in a source system happens (e.g. a field name changes), a policy document drifts, an approval queue gets renamed, a customer edge case gets found out, and the agent keeps moving. Nobody owns the change because nobody treated the agent as a change.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjm67oni2b700t3f1emvd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjm67oni2b700t3f1emvd.png" alt="Agent rollout path from prototype to change record, sandbox, canary, and production" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The rollout path gets safer when every promotion carries evidence, scope, and a rollback owner.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Change Record Is the Agent Spec
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.atlassian.com/itsm/change-management" rel="noopener noreferrer"&gt;Atlassian describes IT change management&lt;/a&gt; as planning, reviewing, approving, and deploying changes to services with as little disruption as possible. Boring. Also the right object.&lt;/p&gt;

&lt;p&gt;Agentic AI needs the same boring object.&lt;/p&gt;

&lt;p&gt;A change record should specify which human role loses or gains work, which systems the agent can interact with, which actions require approval, which actions are forbidden, which metrics define harm, which traces prove behavior, and which owner can roll back changes made by the agent when something goes wrong.&lt;/p&gt;

&lt;p&gt;Rather than going straight to a typical roadmap of discovery, pilot, platform choice, training, and rollout, I would put a change-control spine through each step of that typical roadmap.&lt;/p&gt;

&lt;p&gt;By discovering the workflows instead of thinking of all the cool things an AI can do, we can categorize “Summarize account notes” and “renew an enterprise contract” for example into different risk classes. For example, pilot work should run in a sandbox that is production-like in terms of data and failure handling. Limited rollout of an agent should in the first place constrain the authority of the agent before it’s given to more people. And production should have a clear owner, and the agent and all its traces should be kept for a defined amount of time, after which they can be evaluated for performance, and in case of an incident there should be a clear path to resolve it.&lt;/p&gt;

&lt;p&gt;This keeps the agent’s actual permissions from being discovered during an incident review.&lt;/p&gt;

&lt;p&gt;By embedding service ownership into an organization’s way of working, these implementation dangers can be mitigated by establishing contracts between teams, a sandboxed deployment, and an appropriate rollout sequence. The AI team can be left to own the things they know best, i.e. the evaluation harness, the evals, model routing, and deployment mechanics. The business process owner must own the workflow semantics. Security, operations, and the relevant parts of legal or compliance must own the permission envelope, production response, and the consequences of non-compliance (respectively).&lt;/p&gt;

&lt;p&gt;Shared ownership is annoying. So is production.&lt;/p&gt;

&lt;p&gt;This is why I keep harping on service ownership for agent work. &lt;a href="https://focused.io/lab/langgraph-enterprise-agent-development" rel="noopener noreferrer"&gt;LangGraph for enterprise agent development&lt;/a&gt; made the runtime version of this point. Production agents have operational contracts. A clever graph is not enough. It can fall apart after the first model swap, policy change, or integration outage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fre0n2yfo5djf2k3juxli.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fre0n2yfo5djf2k3juxli.png" alt="Change record connecting workflow owner, permission envelope, eval gate, telemetry, and rollback path" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The change record is the handoff object between business process, agent runtime, security, and operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Metrics Already Exist
&lt;/h2&gt;

&lt;p&gt;No need for another exotic agent scorecard. The software delivery world already has the basic bones. &lt;a href="https://dora.dev/guides/dora-metrics/" rel="noopener noreferrer"&gt;DORA's software delivery metrics&lt;/a&gt; track change lead time, deployment frequency, failed deployment recovery time, change fail rate, and deployment rework rate.&lt;/p&gt;

&lt;p&gt;Change lead time: time from proposing agent behavior to approving production behavior. Deployment frequency: rate of safe promoting of an agent to production, such as adding an agent to a tool registry, policy pack, an organization’s memory schema, retrieval index, or a workflow. Failed deployment recovery time: time to reverse an action of an agent, such as reverting a prompt or policy that was added to production, removing a permission that was granted to an agent, or switching back to a previous workflow. Change fail rate: percentage of changes to agents that require intervention.&lt;/p&gt;

&lt;p&gt;This would all be nice and clean if an agent’s behavior failed in a binary way, like an exception being thrown. But it does not. It produces a technically correct answer that just happens to be wrong in the context of the workflow. Which is why the failure is behavioral, not binary, and is invisible to a deployment platform that only knows how to scream when a process fails to start.&lt;/p&gt;

&lt;p&gt;So the metric needs evidence.&lt;/p&gt;

&lt;p&gt;In the end, the production agent rollout should collect all traces of decisions (tool calls, approval steps etc), rejected actions (e.g. because of insufficient privileges), user corrected mistakes as well as any failures of the eval routine. Business outcomes should also be added to that list of the things changed for a release story and then the team has the evidence for the change board that they’re approving of “stuff” with a slightly nicer UI.&lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://focused.io/lab/everybody-tests" rel="noopener noreferrer"&gt;Everybody Tests&lt;/a&gt; comes in. Testing cannot be relegated to downstream QA when an agent can affect a live workflow. Product, engineering, operations, security, and enterprise systems teams should be able to run the test. Ideally, they should understand it, too. The eval suite tests behavioral regressions. Traces reveal runtime drift. Approval logs expose authority escalation. Business metrics surface harm the model never sees.&lt;/p&gt;

&lt;p&gt;All of them are part of the change.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Roadmap Is a Promotion Ladder
&lt;/h2&gt;

&lt;p&gt;Start with read-only assistance. The agent assists with summarization, search, templates, classification, and process explanation. That finds workflow fit and failure modes without giving the system authority to act.&lt;/p&gt;

&lt;p&gt;Next, the team gradually grants more permission inside well-defined boundaries. Completing low-dollar refunds, updating internal tickets, sending non-regulated customer messages, changing low-risk account fields, deploying to test environments. The goal is to prove bounded authority before scope expands.&lt;/p&gt;

&lt;p&gt;This promotion path pays for itself by preventing a business process from being secretly screwed by an AI that nobody can explain.&lt;/p&gt;

&lt;p&gt;Make each step on the promotion ladder concrete. Human-in-the-loop needs a named reviewer, a review surface, override power, correction capture, and a rule for when the agent stops asking. Same for guardrails, observability, and governance. Each word should collapse to an owner, system, threshold, and audit trail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/tech-forward/state-of-ai-trust-in-2026-shifting-to-the-agentic-era" rel="noopener noreferrer"&gt;McKinsey's 2026 AI trust survey&lt;/a&gt; is useful here because it separates adoption from maturity. Strategy, governance, and controls for agentic AI remain the weak spots. Security and risk concerns remain the main barrier to scaling. Which tracks.&lt;/p&gt;

&lt;p&gt;Boring. Beautiful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Own the Change
&lt;/h2&gt;

&lt;p&gt;So long as an organization treats an enterprise AI agent like another tool intended to spread to more people in the organization with the same amount of enthusiasm, then the AI agent’s implementation will fail shortly after the first collisions with the organization’s permission models, its customers’ reporting structures, its compliance requirements, its process exceptions and its sheer number of customers.&lt;/p&gt;

&lt;p&gt;I have no particular interest in helping to recreate the CAB theater for Enterprise Agents. Meetings with 8 approvers (or more!) for a password reset workflow that they cannot even understand is a huge waste of time and effort. Yes, review is reasonable in regulated paths, but that should be the exception, not the rule. And it should be as trivial and technical as possible, ideally close to where the work is actually being done. (In this case a simple approval in the workflow UI).&lt;/p&gt;

&lt;p&gt;Put the agent change record next to the PR, the eval report, the trace sample, the permission diff, and the rollback plan. Have the workflow owner sign the semantics; security sign the authority; engineering sign the runtime; and operations sign the incident path.&lt;/p&gt;

&lt;p&gt;Then ship.&lt;/p&gt;

&lt;p&gt;That is what an AI implementation roadmap needs now: a promotion path for systems that can act.&lt;/p&gt;

&lt;p&gt;Production always gets weird.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>Agent Benchmark Scores Are Measuring the Harness, Not the Model | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Sun, 17 May 2026 21:13:13 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/agent-benchmark-scores-are-measuring-the-harness-not-the-model-focused-labs-145l</link>
      <guid>https://dev.to/focused_dot_io/agent-benchmark-scores-are-measuring-the-harness-not-the-model-focused-labs-145l</guid>
      <description>&lt;p&gt;The difference between the leading agentic coding models is much smaller than the difference between two distinct configurations of a single model on the same benchmark. &lt;a href="https://www.anthropic.com/engineering/infrastructure-noise" rel="noopener noreferrer"&gt;Anthropic just quantified it&lt;/a&gt;: a six-percentage-point gap on Terminal-Bench 2.0 between the most- and least-resourced setups, p &amp;lt; 0.01. Same model. Same task set. Same harness. The only variable was the resource budget given to the pod.&lt;/p&gt;

&lt;p&gt;This is larger than the spread between most frontier models on the public leaderboard.&lt;/p&gt;

&lt;p&gt;The number the enterprise picked as "the best agent model" is mostly the amount of CPU and RAM that the eval team assigned to the pod for the test. Welcome to production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark is not what the benchmark claims to measure
&lt;/h2&gt;

&lt;p&gt;Static evals score a model's output directly. Agentic coding evals score a model in a runtime, and the runtime itself decides whether a container gets OOM-killed for a transient memory spike, whether a &lt;code&gt;pip install&lt;/code&gt; command finishes, whether a test subprocess ever returns a result. Two agents at different resource budgets will be taking different tests.&lt;/p&gt;

&lt;p&gt;Anthropic ran Terminal-Bench 2.0 across six resource configurations, from strict enforcement of the per-task specs all the way to completely uncapped. They observed 5.8% of tasks failing on pod errors unrelated to model capacity at strict enforcement, compared to 0.5% at uncapped. Success scores at 1x through 3x were largely within noise (p=0.40), since the agent was going to fail those tasks anyway. However, past 3x, success scores climbed faster than infra errors declined. The extra headroom gave the agent room to attempt new approaches that only work when given more generous allocations, such as installing several large packages at once, running memory-hungry test suites, or spawning subprocesses that take extra time to complete.&lt;/p&gt;

&lt;p&gt;The benchmark shifted. Previously it was measuring how capable the model was. Now it is measuring how much budget the harness gives the agent to brute-force the answer.&lt;/p&gt;

&lt;p&gt;This is not a bug in Terminal-Bench. It is the nature of agentic evaluation: the runtime is not a passive container, it is an active part of the problem-solving process.&lt;/p&gt;

&lt;p&gt;When the benchmark does not include the exact hardware and resource configuration, it ships a number that can't be compared to anyone else's number. Nobody is measuring the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model is mostly plumbing
&lt;/h2&gt;

&lt;p&gt;Harrison Chase has been making a variant of this argument for about a year. The agent is not the model. The agent is the harness, memory, tools, prompts, retries, state machines, guardrails, and context windows, with a model call buried somewhere in there.&lt;/p&gt;

&lt;p&gt;The Anthropic data is the experimental confirmation of the harness sitting at the heart of the agent. Flip the pod resource limits and the "same" agent is a different agent inhabiting a wildly different reality. Flip the sandbox provider and the same leaderboard score means a completely different thing. The vast majority of the decisions that go into building an agent are about tuning the harness.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://x.com/AnnaBernad50664/status/2046626400296174052" rel="noopener noreferrer"&gt;Anna Bernad posted a Twitter thread&lt;/a&gt; last week after looking at 36 production agent harnesses. Her take is far sharper than mine.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Every harness I studied that actually ships does the same underlying move, and guess, it's not separation. It's making the context describe a different room."  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the context reads as "teammate shipped work, I'm the reviewer, pipeline wants green," the agent soft-approves with a minor note. Not because the model is bad. The agent is trying to fit the response to the context, and soft approval is the only way to complete the pattern.&lt;/p&gt;

&lt;p&gt;The harness is the room. The model is the tenant.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this does to enterprise procurement
&lt;/h2&gt;

&lt;p&gt;Agent performance based on a benchmark consistently deviates from expectations once a client engages with our service. The model selected for the agent's function is sound. The "harness" through which the model is commanded to operate is what impedes the application. The runtime may not give the tools sufficient compute to act effectively. The retry mechanism built to improve throughput actually masks critical errors until it is far too late. The context window is being consumed by boilerplate system prompts the procurement team didn't know existed.&lt;/p&gt;

&lt;p&gt;The enterprise then concludes "AI doesn't work for us" and abandons the effort. The model vendor is blamed. Nobody audits the scaffold.&lt;/p&gt;

&lt;p&gt;Vendor benchmark claims aren't automatically disbelieved, but those claims become purely marketing when translated into an "eval score" meant for buyers to use in evaluating vendors. If the eval score is only reproducible on the vendor's Kubernetes cluster with their sandboxing solution and their machine resources, it's safe to say the score has no procurement value.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://x.com/LangChain/status/2046303329312227787" rel="noopener noreferrer"&gt;LangSmith Signal report this week&lt;/a&gt; puts billions of agent runs behind the month's trends. Anthropic grew 73% in users, gaining 39% of share. Gemini rose after the release of Gemini 3. OpenAI remained the largest at around 80% of volume but didn't move up or down. Those are usage numbers, not capability numbers. People are moving around based on what actually works in their harness, not based on what a leaderboard says.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to read a benchmark
&lt;/h2&gt;

&lt;p&gt;Three questions, in order.&lt;/p&gt;

&lt;p&gt;The first question is what the harness actually was. If the eval team doesn't publish the scaffold, retry policy, context budget, tool set, and resource configuration tradeoffs, the number is a picture of one run on their box and not comparable to anything.&lt;/p&gt;

&lt;p&gt;Second: what is the infra error rate? Anthropic reported 5.8% of Terminal-Bench 2.0 tasks failing on pod errors at strict enforcement, a 5x margin above the spread between most frontier models. An eval that doesn't separate "model failed" from "container got killed" introduces a lot of noise in the headline number.&lt;/p&gt;

&lt;p&gt;Third: does my production environment resemble the eval environment? If the eval runs uncapped on a data-center GPU cluster, the score is going to have almost no predictive value for me, since my agent runs in a sandboxed environment such as a Lambda function with a 512MB memory cap. An agent can win the competition by brute-forcing the space of &lt;code&gt;scikit-learn&lt;/code&gt; installs and then fail silently at ship time because it consumes too much memory in the production environment. A lean, efficient agent that loses the benchmark will ship just fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do instead
&lt;/h2&gt;

&lt;p&gt;Build the harness first. Run the model last.&lt;/p&gt;

&lt;p&gt;The analysis has to translate to production. Production tools. Production retry budget (or lack thereof). Production memory store. Production prompt scaffolding. Production runtime limits. Wire it up with &lt;a href="https://focused.io/lab/your-customer-service-bot-is-slow-because-its-single-threaded" rel="noopener noreferrer"&gt;observability that traces trajectories through the system, not individual LLM calls&lt;/a&gt;. Then swap different models in and see what changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Shape of an internal model bake-off in 2026.
# LangChain 1.x, LangGraph 1.1.9, LangSmith.
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;traceable&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith.evaluation&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;

&lt;span class="n"&gt;CANDIDATES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic:claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai:gpt-5.1-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google:gemini-3-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Same tools, same prompt, same retry budget, same memory store.
&lt;/span&gt;    &lt;span class="c1"&gt;# The ONLY variable is the model string.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PRODUCTION_TOOLS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PRODUCTION_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;middleware&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;PIIMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PROD_PII_CONFIG&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;HumanInTheLoopMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;escalation_policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PROD_POLICY&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;context_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ProductionContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production-trajectories-q2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;CANDIDATES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;evaluators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;trajectory_match&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# compares actual tool-call path to reference
&lt;/span&gt;            &lt;span class="n"&gt;tool_call_precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# did the agent use the right tool at the right time
&lt;/span&gt;            &lt;span class="n"&gt;final_output_rubric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# LLM-as-judge on the end state
&lt;/span&gt;        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;experiment_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;harness-bakeoff-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All tests run using the same harness, the same tools, one variable at a time. The goal is to select the model that actually works within the production stack, not the one that earned points on a public leaderboard running on a Kubernetes cluster someone else had tuned.&lt;/p&gt;

&lt;p&gt;This is where the engineering work is. This is also why &lt;a href="https://focused.io/lab/developing-ai-agency" rel="noopener noreferrer"&gt;the agent harness is where the engineering work lives now&lt;/a&gt;, and why a lot of clients call us. The model picker is not the problem. The harness design is the problem. The eval infrastructure is the problem. The trajectory observability is the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The harder truth
&lt;/h2&gt;

&lt;p&gt;The methods for finding genuinely good agents tended to favor simplicity and efficiency. The reason is that we were looking for agents that could write efficient code quickly. In contrast, agents that had plenty of resources available tended to do better when there were plenty of resources available. Both types of agents are useful to test for, and both correspond to realistic scenarios. Neither of them can fairly be collapsed into a single number on a leaderboard.&lt;/p&gt;

&lt;p&gt;Many of the agents we deploy to enterprises run on some sort of strict budget for resources such as memory and CPU. Beyond these general limits, there are often specific restrictions on things like subprocess runtime and the number of times an API can be called within a window, largely because of cost. The model that wins with unlimited resources is a different model than the one that wins under strict limits.&lt;/p&gt;

&lt;p&gt;Pick the model that performs in the harness. Own the harness. Measure the trajectory. The benchmark is not the product.&lt;/p&gt;

&lt;p&gt;The harness is the product.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>AI Agent Authentication Starts With Workload Identity | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 13 May 2026 14:55:56 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/ai-agent-authentication-starts-with-workload-identity-focused-labs-418</link>
      <guid>https://dev.to/focused_dot_io/ai-agent-authentication-starts-with-workload-identity-focused-labs-418</guid>
      <description>&lt;p&gt;AI agent authentication starts when the system can answer which actor is allowed to make a tool call.&lt;/p&gt;

&lt;p&gt;The model can propose the action. The runtime has to attach authority to it.&lt;/p&gt;

&lt;p&gt;Most teams start with the fastest answer: an API key in an environment variable. The agent reaches Salesforce, GitHub, Jira, Snowflake, Stripe, whatever system makes the first useful proof feel real, and everyone moves on.&lt;/p&gt;

&lt;p&gt;That proof matters. It shows the agent can reach the systems where work actually happens. It also hides the first product decision: who is acting when the tool call leaves the runtime?&lt;/p&gt;

&lt;p&gt;The agent gets memory. The agent runs in the background. The agent forks into subagents. The agent retries failed operations. The agent calls tools after the user has walked away. The agent lands in an enterprise workflow where the work has value, the logs have value, and breaking something has a consequence.&lt;/p&gt;

&lt;p&gt;A shared API key starts as configuration. Then it quietly becomes the identity of the agent.&lt;/p&gt;

&lt;p&gt;An ugly place to stumble into by accident.&lt;/p&gt;

&lt;h2&gt;
  
  
  The secret becomes the actor
&lt;/h2&gt;

&lt;p&gt;Early security models for agents tend toward good vibes with a bearer token. The prompt gives instructions. The tool schema lists calls. Hard-coded secrets in the runtime decide what actually gets done based on the input, the agent, and whatever authority those secrets carry.&lt;/p&gt;

&lt;p&gt;The secret wins.&lt;/p&gt;

&lt;p&gt;The agent has all of those powers if the same key can read every customer record, submit refunds, update tickets, and write to production data. Carefulness in the prompt is theater at that point. The tool description can say those powers apply only when appropriate. The audit log will still show one credential able to perform a pile of different tasks.&lt;/p&gt;

&lt;p&gt;There is already a category for this outside agents: &lt;a href="https://owasp.org/www-project-non-human-identities-top-10/" rel="noopener noreferrer"&gt;OWASP's Non-Human Identities Top 10&lt;/a&gt;. Production applications identify themselves as non-human identities. Agents are adding themselves to that growing list of stranger workloads, running differently than normal services, but still requiring access to systems and data.&lt;/p&gt;

&lt;p&gt;The important step for me is naming the agent as a workload, because the architecture gets less magical and more useful.&lt;/p&gt;

&lt;p&gt;Workloads have identities. Workloads can request scoped credentials for those identities. A workload can be denied a credential. A workload can rotate credentials. A workload can leave an audit trail that survives the model, the prompt, and the v2 or v3 abstraction barrier the team is currently working around.&lt;/p&gt;

&lt;p&gt;Baseline authentication for production AI agents.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ke2gp8x404se04fz457.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ke2gp8x404se04fz457.png" alt="A runtime identity boundary showing an agent requesting scoped credentials from an identity broker before calling external systems." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The runtime should issue tool-specific credentials instead of letting the agent carry a shared key everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workload identity is the boring answer
&lt;/h2&gt;

&lt;p&gt;This part is old. Good.&lt;/p&gt;

&lt;p&gt;Kubernetes already considers service accounts to be identities of processes running in Pods, and the current docs describe &lt;a href="https://kubernetes.io/docs/concepts/security/service-accounts/" rel="noopener noreferrer"&gt;short-lived, automatically rotating ServiceAccount tokens&lt;/a&gt; issued through the TokenRequest API. SPIFFE generalizes that into workload identity documents, including &lt;a href="https://spiffe.io/docs/latest/spiffe-about/spiffe-concepts/" rel="noopener noreferrer"&gt;short-lived X.509 and JWT SVIDs&lt;/a&gt; that a workload can use to authenticate itself to other workloads.&lt;/p&gt;

&lt;p&gt;Cloud platforms are heading in the same general direction. AWS STS can &lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRoleWithWebIdentity.html" rel="noopener noreferrer"&gt;issue temporary security credentials&lt;/a&gt; after a workload has identified itself using OpenID Connect. Google Cloud Workload Identity Federation allows external workloads to &lt;a href="https://cloud.google.com/iam/docs/workload-identity-federation" rel="noopener noreferrer"&gt;access Google Cloud resources without service account keys&lt;/a&gt;. Azure managed identity docs describe workload identities as &lt;a href="https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview" rel="noopener noreferrer"&gt;machine and non-human identities&lt;/a&gt; associated with compute resources.&lt;/p&gt;

&lt;p&gt;The industry knows how to keep long-lived secrets out of the hot path. It just keeps giving agents interfaces that make the old mistake easy.&lt;/p&gt;

&lt;p&gt;A developer writes a tool wrapper. The tool wrapper needs credentials. The fastest way to configure it is to add an API key to an environment variable and add a TODO to remove it later. The TODO gets pushed to production because now the agent answers support tickets, reconciles invoices, or looks at CI.&lt;/p&gt;

&lt;p&gt;I've worked with teams who reviewed the model, tuned prompts, drew diagrams for tool selection, created a few secrets in deploy config, and crossed their fingers that the tool descriptions would shore it all up.&lt;/p&gt;

&lt;p&gt;They are not enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delegation is the missing primitive
&lt;/h2&gt;

&lt;p&gt;In many applications, the agent should rarely hold the credential it uses to act.&lt;/p&gt;

&lt;p&gt;Put an identity assertion in the flow. This agent. This tenant. This user context if present. This policy version. This tool request. This approval state. That assertion is exchanged for a credential only when the action needs one.&lt;/p&gt;

&lt;p&gt;OAuth was designed to support exactly this shape. &lt;a href="https://www.rfc-editor.org/rfc/rfc8693" rel="noopener noreferrer"&gt;RFC 8693 defines token exchange&lt;/a&gt;, describing how one temporary credential can be exchanged for another temporary credential intended for a different context. In the agent case, the model proposes an action, the runtime checks policy, the broker issues a credential for that action and tool context, the call happens, and the credential dies.&lt;/p&gt;

&lt;p&gt;It does not expire after a quarter. It does not expire after someone remembers to rotate it. It expires because the system puts expiration in the path.&lt;/p&gt;

&lt;p&gt;That changes the damage pattern. A compromised tool wrapper no longer implies broad access to every downstream system. A prompt injection has to cross approval, run, tenant, and policy boundaries. A subagent that escapes its execution boundary cannot reuse credentials after the run, approval, or tenant context has expired.&lt;/p&gt;

&lt;p&gt;The agent is still useful. It just has to query through a production boundary that understands production concerns.&lt;/p&gt;

&lt;p&gt;This is why &lt;a href="https://focused.io/lab/2026-year-of-the-integrated-agent" rel="noopener noreferrer"&gt;integrated agents&lt;/a&gt; are valuable and dangerous at the same time. The valuable integrated agents do not live in a chatbot tab. They integrate with real systems. Once an agent is tied to real systems, authentication becomes product architecture rather than cleanup work hidden in deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The runtime owns the identity boundary
&lt;/h2&gt;

&lt;p&gt;A model provider should not own this boundary. A prompt should not own this boundary. A tool schema should not own this boundary.&lt;/p&gt;

&lt;p&gt;The runtime owns it because the runtime follows the whole path.&lt;/p&gt;

&lt;p&gt;It connects agent definitions to threads or runs, tenants, and identity information, including the user who initiated the work, whether the work is backgrounded, whether a human approved a risky step, which tool is being called, and which downstream credential is being requested. It can attach those facts to an identity assertion and make a policy decision before any assertion leaves the process.&lt;/p&gt;

&lt;p&gt;That policy decision can be boring and explicit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The refund tool can request a payment credential for the current tenant.&lt;/li&gt;
&lt;li&gt;A GitHub tool can request a write credential after CI has produced an eval pass.&lt;/li&gt;
&lt;li&gt;The Snowflake tool can request a read credential for one warehouse, one role, and one time window.&lt;/li&gt;
&lt;li&gt;A subagent can run with a delegated identity, but only with fewer capabilities than the parent run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The list is not impressive, which is why it is powerful.&lt;/p&gt;

&lt;p&gt;This is also where &lt;a href="https://focused.io/lab/multi-agent-orchestration-in-langgraph-supervisor-vs-swarm-tradeoffs-and-architecture" rel="noopener noreferrer"&gt;multi-agent orchestration&lt;/a&gt; gets serious. A supervisor handing work to a subagent creates a delegation relationship along with the task description. The child process needs enough authority to perform the work at hand and no more. The audit log must reflect that chain of trust cleanly or troubleshooting becomes an exercise in futility.&lt;/p&gt;

&lt;p&gt;The worst setup is a swarm of agents all sharing the same service account. Simple enough to get going. Terrible when it comes time to debug an incident. Every action has been performed by the same principal, authenticated with the same key, and observed through the same useless blur.&lt;/p&gt;

&lt;p&gt;The incident has no useful actor. Just a shared key with a long memory and no accountability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qaovc23fundj7hk9akt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qaovc23fundj7hk9akt.png" alt="A token lifecycle showing an agent run creating an identity assertion, exchanging it for a scoped token, calling a tool, writing audit evidence, and expiring the credential." width="800" height="312"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Short-lived delegated credentials make the agent run, policy decision, tool call, and audit trail line up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audit follows identity
&lt;/h2&gt;

&lt;p&gt;Agent observability without identity is half a story.&lt;/p&gt;

&lt;p&gt;A trace for the agent step called &lt;code&gt;refund_customer&lt;/code&gt; can include latency, tool arguments, model output, retries, all visualized in a convenient span tree. Useful. Then someone asks who had authority to issue that refund, and the trace turns into archaeological excavation.&lt;/p&gt;

&lt;p&gt;The right trace shows the tool call connected to a principal. Not just a service account. A principal with an agent ID, run ID, tenant, user context, policy decision, credential scope, and expiration time.&lt;/p&gt;

&lt;p&gt;This is what allows a team to answer questions after the tool call has done real work.&lt;/p&gt;

&lt;p&gt;Who granted access? What user context did it use? What broker generated the credential? What version of policy allowed it? What downstream resource accepted it? What subagent inherited it? Can that credential be used for something else?&lt;/p&gt;

&lt;p&gt;Those questions determine whether there is a real postmortem or just hand waving about the agent doing something weird.&lt;/p&gt;

&lt;p&gt;The same principle applies to testing. In &lt;a href="https://focused.io/lab/everybody-tests" rel="noopener noreferrer"&gt;Everybody Tests&lt;/a&gt;, I argued that every team already tests whether they admit it or not. Agent identity needs that same honesty. If a runtime can create delegated credentials, tests should verify that the boundary holds. A refund agent should fail against the wrong tenant. A code agent should fail when eval gates are red. A research agent should fail when it asks for write access to a system it only reads.&lt;/p&gt;

&lt;p&gt;Not a single &lt;code&gt;npx this and that&lt;/code&gt; in the whole codebase. Test it in CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared keys hide product decisions
&lt;/h2&gt;

&lt;p&gt;The fastest credential story hides the decisions that matter most.&lt;/p&gt;

&lt;p&gt;A shared key hides tenancy. It hides user context. It hides the identity of the agent performing an action. It hides which subagent inherited authority. It hides whether approval was granted. It hides whether the action matched the original request. It hides rotation until rotation becomes an outage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html" rel="noopener noreferrer"&gt;OWASP's secrets management guidance recommends dynamic secrets where possible&lt;/a&gt; to reduce credential reuse and limit the damage when credentials leak. Agent systems need the same pressure, with the additional constraint that the credential must represent the run instead of only the application.&lt;/p&gt;

&lt;p&gt;A normal backend service is expected to behave predictably and follow a reliable lifecycle. It accepts requests, implements endpoints, and changes through controlled deployments. An agent runtime for integration automation can select different tools per request, execute work in subagents, retry steps, and continue running after initial user interaction has completed.&lt;/p&gt;

&lt;p&gt;So identity has to be more exact.&lt;/p&gt;

&lt;p&gt;The credential loaned to the system should assert what it is currently allowed to do. The operating policy should be visible enough to understand the motivation behind the action. The audit trail must persist long enough for a human to traverse the events as they happened.&lt;/p&gt;

&lt;p&gt;A boundary-based platform does not need a full rewrite. Start with one boundary.&lt;/p&gt;

&lt;p&gt;Put an identity broker between the agent runtime and the first high-risk tool. Give the agent runtime a workload identity. Have the broker exchange that identity for a tool credential. Associate the decision with tenant, run, and operation. Record the policy decision in the trace. Add a CI test that proves the wrong tenant fails. Expire the credential quickly. Make the failure visible when the broker returns no.&lt;/p&gt;

&lt;p&gt;Then move the next tool behind the boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The production line
&lt;/h2&gt;

&lt;p&gt;AI agent authentication is the control plane for non-human actors who do work across systems.&lt;/p&gt;

&lt;p&gt;Ownership matters here. Security cannot retroactively add this after the agent and its resources have shipped. Platform cannot stash it in a vault path. Product cannot mark it as a checkbox in consent. Identity, delegation, expiration, and audit have to be inherent in the runtime of the agent and how it executes.&lt;/p&gt;

&lt;p&gt;The agent should actually be able to act. That is, after all, why we are doing &lt;a href="https://focused.io/lab/developing-ai-agency" rel="noopener noreferrer"&gt;AI agency&lt;/a&gt; in the first place. That agency should have a workload identity.&lt;/p&gt;

&lt;p&gt;Production systems have already worked out parts of the problem. Kubernetes, SPIFFE, OAuth token exchange, cloud workload federation, managed identities, dynamic secrets. They exist because static secrets rot and shared principal accounts make bad worse.&lt;/p&gt;

&lt;p&gt;It is a mistake to grant agents an exemption because the interface is conversational.&lt;/p&gt;

&lt;p&gt;The model can decide on the next step. The runtime decides whether that step gets a credential.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Agentic AI Architecture Needs Model Routing</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Fri, 08 May 2026 01:57:35 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/agentic-ai-architecture-needs-model-routing-1e1k</link>
      <guid>https://dev.to/focused_dot_io/agentic-ai-architecture-needs-model-routing-1e1k</guid>
      <description>&lt;p&gt;Agentic AI architecture is stuck on model loyalty.&lt;/p&gt;

&lt;p&gt;The same graph. The same provider. One giant model doing every job because one graph is easier to defend than a routing policy.&lt;/p&gt;

&lt;p&gt;I get why people want to pick one model: it makes demos and evaluation and procurement easier, and sometimes debugging only slightly worse. The agent call becomes always the same, the trace becomes always the same, and the team can blame one provider instead of four.&lt;/p&gt;

&lt;p&gt;Fine. But production agents do not do one kind of work.&lt;/p&gt;

&lt;p&gt;Classify intent. Search. Summarize. Write code. Choose a tool. Check if a tool's result smells wrong. Write a customer-facing answer when something failed. Decide whether approval is required. Wait for something to happen. Retry something that failed. Recover from something gone wrong.&lt;/p&gt;

&lt;p&gt;Production agents run a pile of distinct workloads.&lt;/p&gt;

&lt;p&gt;Harrison Chase notes that &lt;a href="https://x.com/hwchase17/status/2051745855812882576" rel="noopener noreferrer"&gt;LLMs are getting expensive, and open source models matter for that reason&lt;/a&gt;. LangChain is pushing the same direction from a product perspective, noting that &lt;a href="https://x.com/LangChain/status/2051367244060598312" rel="noopener noreferrer"&gt;Fleet agents no longer have to be constrained by a single model and can instead use multi-model support&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Those are the same production reality arriving through two doors.&lt;/p&gt;

&lt;p&gt;The agent architecture must determine which model should perform which work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Same Model Everywhere Is an Architecture Smell
&lt;/h2&gt;

&lt;p&gt;This is surprising. Many current agent stacks treat model selection as just another config parameter of the environment, equivalent to tradeoff parameters or batch sizes. Set &lt;code&gt;MODEL=claude-whatever&lt;/code&gt; or &lt;code&gt;MODEL=gpt-whatever&lt;/code&gt; and deploy the agent.&lt;/p&gt;

&lt;p&gt;That's fine for a chatbot, but lazy for an agent.&lt;/p&gt;

&lt;p&gt;Agents introduce variance internally. What looks simple to a user becomes retrieval, planning, transformation, checking, execution, generation and scheduling inside the system. Some of these steps need to be deep, some fast, some cheap. Some need a model that is good at generating code, others an open-weight model because the data cannot legally leave the boundary, or because it is simply too expensive to move around the company.&lt;/p&gt;

&lt;p&gt;Using the same frontier model across the board is comforting. It also conceals the waste.&lt;/p&gt;

&lt;p&gt;Instead of one glaring failure, I get slow, expensive, bureaucratic agent production. A team looks at the dashboard. Cost rises, latency rises, and people say the model is too expensive or the prompts are too long. The architecture is linear and all steps go to one place.&lt;/p&gt;

&lt;p&gt;What gets under my skin is the compute monolith. Everywhere else we have learned to separate compute classes properly (queues are not databases, lambdas are not batch workers, CDNs are not origin servers). Then some clever agent comes along and suddenly every cognitive function has to go through the biggest model in the account.&lt;/p&gt;

&lt;p&gt;Come on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing Has to Do More Than Fallbacks
&lt;/h2&gt;

&lt;p&gt;Model routing usually enters the conversation through reliability. If OpenAI is down, try Anthropic. If a deployment is overloaded, try another one. If a provider rate-limits, retry somewhere else.&lt;/p&gt;

&lt;p&gt;This is important. &lt;a href="https://docs.litellm.ai/docs/routing" rel="noopener noreferrer"&gt;LiteLLM's router docs&lt;/a&gt; explain load balancing, cooldowns, fallbacks, timeouts, retries, and Redis-based production rate limiting. &lt;a href="https://openrouter.ai/docs/guides/routing/provider-selection" rel="noopener noreferrer"&gt;OpenRouter's provider routing docs&lt;/a&gt; explain provider ordering, fallbacks, performance, price, and data policy constraints. Boring infrastructure at its best.&lt;/p&gt;

&lt;p&gt;But routing cannot stop at uptime.&lt;/p&gt;

&lt;p&gt;In a production agent workflow, the router should understand why a task exists. It should see the agent step, the tool context, the risk, latency budget, data boundary and previous run quality. Then it can pick the appropriate model class for the work at hand.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wvfq4fqt38sx7vkh4p8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wvfq4fqt38sx7vkh4p8.png" alt="Architecture diagram showing an agent graph sending a typed task into a model router with a router policy that chooses among fast, reasoning, code, and open-weight models, with telemetry and evaluation feedback returning to the policy." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The router belongs in production architecture, where policy can be tested.&lt;/p&gt;

&lt;p&gt;This is where things get more interesting for agentic AI architecture, compared to just building an LLM app. The router turns the agent’s internal structure into an execution policy.&lt;/p&gt;

&lt;p&gt;A planner step can go to a reasoning model. A normalization step can go to a fast model. A code-editing subagent can go to a model tuned for code. A bulk summarization step can go to an open-weight model. A regulated data step can stay inside the boundary. A customer-facing final answer can take the slower path because that is where quality matters (since it impacts the customer).&lt;/p&gt;

&lt;p&gt;The pattern is already familiar, which is the point. It has the same shape as &lt;a href="https://focused.io/lab/multi-agent-orchestration-in-langgraph-supervisor-vs-swarm-tradeoffs-and-architecture" rel="noopener noreferrer"&gt;multi-agent orchestration in LangGraph&lt;/a&gt;, but I like it better down at this level. The graph determines what work exists, and the router determines which model class should process that work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Router Needs Typed Work
&lt;/h2&gt;

&lt;p&gt;Prompt-based routing is where it all goes wrong.&lt;/p&gt;

&lt;p&gt;A team adds "Use the cheaper model when the task is simple." The agent is amiable, but ignores the team's intent at exactly the wrong time. The AI guesses or routes based on whatever words match the current prompt. The result is a vibe with a model attached.&lt;/p&gt;

&lt;p&gt;The router needs typed work.&lt;/p&gt;

&lt;p&gt;My ideal is for the agent to report task metadata &lt;em&gt;before&lt;/em&gt; the model call occurs: task kind, expected output shape, sensitivity of input data, allowed tools, user-facing risk, latency/cost budgets, required capability, and retry posture. I do not need a full taxonomy to start. Most teams can begin with something tiny: &lt;code&gt;classify&lt;/code&gt;, &lt;code&gt;retrieve&lt;/code&gt;, &lt;code&gt;reason&lt;/code&gt;, &lt;code&gt;write&lt;/code&gt;, &lt;code&gt;code&lt;/code&gt;, &lt;code&gt;act&lt;/code&gt;. The key is moving model choice from prose to runtime.&lt;/p&gt;

&lt;p&gt;This is a lesson already learned elsewhere in agent architecture. In &lt;a href="https://focused.io/lab/developing-ai-agency" rel="noopener noreferrer"&gt;Developing AI Agency&lt;/a&gt;, explicit mechanisms for planning, tools, memory, and verification beat one giant prompt pretending to be architecture. Model selection is another version of this.&lt;/p&gt;

&lt;p&gt;The router can start dumb and be a simple lookup table driven by task type. It can be configured to dispatch to the code model for code tasks, the fast model for low-risk summaries, the local model for sensitive data, and the quality model for final text written for specific customers. First, ship that. Verify that it works. Then gradually become less dumb and add more nuance to the router.&lt;/p&gt;

&lt;p&gt;The first mistake is expecting the team to find the single best router before shipping anything. The second mistake is letting the model design the router policy inside the same prompt it is supposed to execute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability Makes Routing Honest
&lt;/h2&gt;

&lt;p&gt;A router that does not publish telemetry data becomes an additional place where opinions get hidden.&lt;/p&gt;

&lt;p&gt;An engineer's affection for a particular design, the score of a benchmark, and the features listed on a vendor's web page are all useful, but ultimately insufficient. The only relevant test is whether the routing rule improves the production agent's performance on the tasks it actually faces.&lt;/p&gt;

&lt;p&gt;This means we need to consider cost, latency, error rate, retry rate, approval rate, human correction rate and eval score when deciding the routing for a request. So these statistics need to attach to the routing decision itself, not just to the trace.&lt;/p&gt;

&lt;p&gt;LangSmith's platform language is already pointing in this direction. It treats traces as the record of an agent’s actions and reasoning, and says teams should monitor &lt;a href="https://www.langchain.com/langsmith-platform" rel="noopener noreferrer"&gt;cost, latency, errors, and qualitative online evals&lt;/a&gt;. Fleet's product page puts &lt;a href="https://www.langchain.com/langsmith/fleet" rel="noopener noreferrer"&gt;model choice next to admin controls, observability, approvals, MCP connections, and export via APIs&lt;/a&gt;. This is the signal.&lt;/p&gt;

&lt;p&gt;Model selection has moved from dropdown aesthetics into operational control. It affects the performance of a wide array of business processes.&lt;/p&gt;

&lt;p&gt;Once routing is visible, the discussion shifts. The team can stop arguing over which model is best and start figuring out which route failed: fast model for tool argument generation, reasoning model for eval lift, open-weight model for internal summarization, code model for patch generation.&lt;/p&gt;

&lt;p&gt;Those are engineering questions.&lt;/p&gt;

&lt;p&gt;The answers need to inform the router policy, or else the agent keeps making yesterday's decisions with today's realities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open-Weight Models Are Part of the Architecture
&lt;/h2&gt;

&lt;p&gt;The open-model conversation is often deeply ideological. People tend to think in terms of closed models versus open models, frontier quality versus control, benchmarks, and vibes.&lt;/p&gt;

&lt;p&gt;Production is less dramatic.&lt;/p&gt;

&lt;p&gt;Open-weight models give teams another execution path. They are useful when the task is bounded, when the data boundary matters, when throughput matters, when the cost curve gets ugly, or when the model only needs to be good enough for an internal step the user never sees.&lt;/p&gt;

&lt;p&gt;A frontier connection does not mean every call should route through that location. That misconception is common. Routing makes the difference.&lt;/p&gt;

&lt;p&gt;A team can still use a frontier model architecture for the high-risk reasoning step. And yes, the final answer can still go through a strong hosted model. But the retrieval cleanup, first-pass summarization, metadata extraction, and internal critique may not automatically deserve the same spend.&lt;/p&gt;

&lt;p&gt;There is no best model for this problem. The more useful question is: Which model owns this step under these constraints?&lt;/p&gt;

&lt;p&gt;Interface portability matters for the same reason. LangChain says &lt;a href="https://x.com/LangChain/status/2051715028567437359" rel="noopener noreferrer"&gt;Deep Agents ships with ACP so the same harness can run across multiple interfaces&lt;/a&gt;. The &lt;a href="https://docs.langchain.com/oss/python/deepagents/cli/overview" rel="noopener noreferrer"&gt;Deep Agents CLI docs&lt;/a&gt; show a coding agent with provider credentials, model switching, tools, memory, skills, MCP tools, and LangSmith tracing. The interface can change. The harness can change. The routing policy has to be portable across both.&lt;/p&gt;

&lt;p&gt;Model choice that lives in a UI dropdown is prone to drift. Model choice that lives in the agent runtime can be tested, traced, reviewed and rolled back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Own the Decision Boundary
&lt;/h2&gt;

&lt;p&gt;The old agent stack revolved around a model call. The next one revolves around a decision boundary.&lt;/p&gt;

&lt;p&gt;That boundary decides which work deserves which model, which provider, which data path, how many retries to attempt, what approval loop to operate in, and which evaluation loop to use. Less glamorous than a chart, to be sure, but more relevant to production workflows. Most production architecture is less glamorous than the thing that sells the demo.&lt;/p&gt;

&lt;p&gt;The teams that get this right won’t talk about having one “agent model”. They’ll talk about routes: Fast route. Deep route. Code route. Local route. Human-review route. And for each route, they’ll know when to use it, how much it costs, how often it fails, and whether the next release made it better.&lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://focused.io/lab/2026-year-of-the-integrated-agent" rel="noopener noreferrer"&gt;integrated agents&lt;/a&gt; become useful. The agent owns execution decisions instead of wrapping a model call in a little workflow theater.&lt;/p&gt;

&lt;p&gt;The code that matters controls the router, the telemetry and the eval loop.&lt;/p&gt;

&lt;p&gt;The model will keep changing. The decision boundary should belong to the team shipping the agent.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>Stop Eager-Loading MCP Tools Into the Context Window</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 05 May 2026 20:31:01 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/stop-eager-loading-mcp-tools-into-the-context-window-3mjl</link>
      <guid>https://dev.to/focused_dot_io/stop-eager-loading-mcp-tools-into-the-context-window-3mjl</guid>
      <description>&lt;p&gt;&lt;em&gt;MCP servers should not eagerly load every tool schema into an agent's context window. Lazy-load tools by intent, then govern and audit execution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Austin Vance, CEO of&lt;/em&gt;&lt;a href="https://focused.io" rel="noopener noreferrer"&gt; &lt;em&gt;Focused&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I think the problem with the current state of MCP is way deeper than just resizing the context window.&lt;/p&gt;

&lt;p&gt;The protocol itself is decent, tool discovery and schema negotiation works well and the JSON-RPC architecture all feel very solid and well engineered. However, the default behavior of populating the agent's context at session start with every tool definition from every connected server makes running production agents virtually impossible.&lt;/p&gt;

&lt;p&gt;One developer &lt;a href="https://joshowens.dev/mcps-are-dead/" rel="noopener noreferrer"&gt;measured 67,300 tokens consumed&lt;/a&gt; before typing a single question. Seven MCP servers. Tool schemas alone ate up a third of the available context. Another measured 81,986 tokens. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Eager-Loading Tax
&lt;/h2&gt;

&lt;p&gt;When an agent starts a session with MCP servers connected, it downloads the full library of all tools, every session. And never filters out just the tools needed for the job at hand.&lt;/p&gt;

&lt;p&gt;My browser automation server is loading 21 tool definitions. A GitHub server loads 27. My web search server bundles 8 providers behind 20 tools. I've not sent a single message yet and I'm already consuming significant context.&lt;/p&gt;

&lt;p&gt;The numbers from &lt;a href="https://arxiv.org/abs/2602.14878" rel="noopener noreferrer"&gt;a study of 856 tools across 103 MCP servers&lt;/a&gt; make this worse than it sounds. Fully augmented MCP tool descriptions add 67% more execution steps for a 5.85 percentage point accuracy gain. The tool definitions don't just eat context. They also slow agents down at actually learning to use the tools.&lt;/p&gt;

&lt;p&gt;We wrote about &lt;a href="https://focused.io/lab/evaluation-pipelines-for-langgraph-agents" rel="noopener noreferrer"&gt;evaluation pipelines for production agents&lt;/a&gt;. One of the failure modes of context pollution from tool definitions that I never see anyone mention is when the agent becomes less effective over time. It doesn't necessarily die or crash or throw an error. The amount of real conversation history that can be displayed in the working window gets pushed out by the tool schemas.&lt;/p&gt;

&lt;p&gt;Even with child agents the context budget gets severely curtailed. Each child agent inherits the MCP configuration. That's new context I guess, but the immediate loss of tens of thousands of tokens to render tool schemas for subagents that may not even use them is completely antithetical to the point of using subagents in the first place: focused context. We covered the architecture patterns for &lt;a href="https://focused.io/lab/multi-agent-orchestration-in-langgraph-supervisor-vs-swarm-tradeoffs-and-architecture" rel="noopener noreferrer"&gt;multi-agent orchestration in LangGraph&lt;/a&gt;, but even great orchestration can't fix a context budget that's already half spent before the first tool call.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flsght88sk9728j25u1gi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flsght88sk9728j25u1gi.png" alt="Split comparison of eager MCP tool loading versus lazy tool discovery preserving the context window." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The waste is architectural: eager loading spends the context budget before the agent starts working.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloudflare Just Admitted This Is Broken
&lt;/h2&gt;

&lt;p&gt;Cloudflare launched &lt;a href="https://blog.cloudflare.com/welcome-to-agents-week/" rel="noopener noreferrer"&gt;Agents Week&lt;/a&gt; on April 12, and buried in their enterprise MCP reference architecture is an admission that the tool-definition model doesn't scale.&lt;/p&gt;

&lt;p&gt;Their solution is called &lt;a href="https://blog.cloudflare.com/enterprise-mcp/" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt;. It condenses all of the individual MCP tools down into two meta-tools: &lt;code&gt;portal_codemode_search&lt;/code&gt; and &lt;code&gt;portal_codemode_execute&lt;/code&gt;. Rather than loading every tool definition into context, the agent writes JavaScript to search for and invoke tools on demand.&lt;/p&gt;

&lt;p&gt;This means that 4 internal MCP servers exposing 52 tools would normally consume 9,400 tokens just for definitions. Code Mode drops that to 600 tokens. A 94% reduction. For Cloudflare's own API, which would consume over 2 million tokens as a traditional MCP server (twice the largest context window available right now), the reduction hits 99.9%.&lt;/p&gt;

&lt;p&gt;That last number deserves to sit for a second. Cloudflare, one of the companies most aggressively adopting MCP across their entire enterprise, had to build a system that essentially replaces MCP's tool discovery mechanism because the original approach would literally overflow the context window. With one server.&lt;/p&gt;

&lt;p&gt;The MCP spec team &lt;a href="https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1300" rel="noopener noreferrer"&gt;acknowledged context overload as the most frequent community concern&lt;/a&gt; in their tool filtering proposal. Quality decreases rapidly after around 10 tools, which far exceeds what most production setups connect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lazy-Loading Is the Fix
&lt;/h2&gt;

&lt;p&gt;Not just a theoretical issue. I'm seeing lazy-loading work in multiple production environments, each implementing it slightly differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare's Code Mode&lt;/strong&gt; turns the agent into its own tool browser. Give it a search function, give it an execute function, and let it figure out which tools matter for the job at hand. The context cost for exploring MCP servers stays the same regardless of how many servers are connected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There's also the Skills pattern.&lt;/strong&gt; Instead of representing all of the tool schemas in detail upfront, agents encode the knowledge needed for a given task in lightweight skill files (typically 200 to 1,500 tokens each) that can be loaded as needed based on intent matching. A skill for browser automation might cost around 2,000 tokens to activate, as opposed to 13,600 tokens to load the full MCP server at startup. GitHub operations drop from 18,000 tokens to maybe 500 or so. Web search goes from 14,100 down to 550.&lt;/p&gt;

&lt;p&gt;That's not marginal. That's an order of magnitude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Arcade's MCP Gateway&lt;/strong&gt; in &lt;a href="https://blog.langchain.com/arcade-dev-tools-now-in-langsmith-fleet/" rel="noopener noreferrer"&gt;LangSmith Fleet&lt;/a&gt; takes a third approach by centralizing 7,500+ tools and optimizing the tool descriptions for language models. These tools are not simply API wrappers. They are mapped to actions that agents can perform, with descriptions written specifically for how language models select and call upon them.&lt;/p&gt;

&lt;p&gt;Harrison Chase wrote about this from the other side of the spectrum. His &lt;a href="https://blog.langchain.com/continual-learning-for-ai-agents/" rel="noopener noreferrer"&gt;continual learning framework&lt;/a&gt; identifies three realms where agents improve: model weights, harness code, and context. The context layer is "the most common and most exciting area right now." However, optimizing for context only works if there is room in the context budget to do so. An agent can't learn from its interactions if the space for learning is already completely filled by tool schemas it loaded at boot time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59hrnl33bnrdh6d5p6l3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59hrnl33bnrdh6d5p6l3.png" alt="Flow diagram showing task intent routing through tool discovery, policy approval, needed tool schemas, agent execution, and audit logging." width="800" height="279"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lazy-loading turns tool discovery into a governed routing path instead of a context-window tax.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;What I particularly like about the current LangChain infrastructure is that the eager version of these agents registers all tools when the agent is built:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_mcp_adapters.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MultiServerMCPClient&lt;/span&gt;

&lt;span class="n"&gt;MCP_SERVERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3001/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3002/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3003/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3004/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_eager_agent&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MultiServerMCPClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MCP_SERVERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# all tools, all servers, every session
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The lazy approach is not a magic discovery tool that mutates the running agent's tool set. The boring version is a router: decide which MCP servers matter for this task, load only those tools, then build the agent for that run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_mcp_adapters.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MultiServerMCPClient&lt;/span&gt;

&lt;span class="n"&gt;TOOL_REGISTRY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3001/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;commit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;branch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3002/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;navigate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3003/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;find&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;look up&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3004/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;select_servers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;selected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TOOL_REGISTRY&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trigger&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;trigger&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
            &lt;span class="n"&gt;selected&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;selected&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_with_lazy_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;selected_servers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;select_servers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;selected_servers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;available&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TOOL_REGISTRY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No matching MCP servers. Available: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;available&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MultiServerMCPClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;selected_servers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# only tools from the routed servers
&lt;/span&gt;    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ainvoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;}]}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first version of the feature I had written had a terrible context profile because it stored definitions for every tool on every server. The next version routed first, then loaded only the relevant components as needed. The gain in a production system with 5 to 10 MCP servers is in the tens of thousands of fewer tokens processed every session.&lt;/p&gt;

&lt;p&gt;Holding all of that tool schema in context is expensive. But more importantly, every token of tool schema that sits in context is a token that could be spent on reasoning, conversation history, or user-specific memory. We wrote about why &lt;a href="https://focused.io/lab/persistent-agent-memory-in-langgraph" rel="noopener noreferrer"&gt;persistent agent memory&lt;/a&gt; is critical for production agents. Memory is useless if there isn't room for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shadow MCP Is the Enterprise Problem Nobody Expected
&lt;/h2&gt;

&lt;p&gt;Cloudflare's reference architecture introduces another concept worth paying attention to: &lt;a href="https://blog.cloudflare.com/enterprise-mcp/" rel="noopener noreferrer"&gt;Shadow MCP detection&lt;/a&gt;. They scan for unauthorized MCP server connections across the organization, monitoring hostnames, URI paths, and even DLP-based body inspection for JSON-RPC method calls like &lt;code&gt;tools/call&lt;/code&gt; and &lt;code&gt;initialize&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;MCP has its own shadow IT problem. Developers will sometimes set up their own MCP server, integrate that into their existing agents, and security will never even be aware. This code can execute locally on developer machines, reach out to internal APIs, and bypass security controls. No audit trail, no credential governance, no DLP.&lt;/p&gt;

&lt;p&gt;Cloudflare's answer is a monorepo governance model: centralized MCP team, AI governance approval, templates that inherit default-deny write controls and audit logging out of the box. New governed MCP servers deploy in minutes because the governance is baked into the platform, not bolted on after the fact.&lt;/p&gt;

&lt;p&gt;I see this pattern constantly with clients. The MCP gold rush has teams spinning up servers faster than security can evaluate them. We wrote about why &lt;a href="https://focused.io/lab/mcp-is-packaging-agent-operable-interfaces-are-the-product" rel="noopener noreferrer"&gt;agent-operable interfaces are the product&lt;/a&gt;. The same principle applies to the tools agents use. If an employee can't access a system without approval, the agent shouldn't be able to either.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix Is Architecture, Not Bigger Windows
&lt;/h2&gt;

&lt;p&gt;"Context windows keep getting bigger." They do. And the waste doesn't get smaller.&lt;/p&gt;

&lt;p&gt;A million-token window doesn't help if 67,000 tokens of tool schemas still get loaded that the agent won't ever use. The underlying issue is architectural: eager-loading is the wrong pattern for tool discovery in production agents.&lt;/p&gt;

&lt;p&gt;Lazy-load tools based on task intent. Gate discovery behind a search mechanism. Keep tool definitions out of the context until the agent actually needs them.&lt;/p&gt;

&lt;p&gt;Honeycomb published &lt;a href="https://www.honeycomb.io/blog/icymi-is-this-code-worth-running-heres-how-know" rel="noopener noreferrer"&gt;a set of principles for the AI era&lt;/a&gt; that apply here: cost is a system attribute, not an afterthought, and pre-production testing doesn't prepare for the load that comes from real systems in a real environment. Tool context overhead is exactly the kind of emergent cost that only shows up in production, when real agents connect to real MCP servers and the token bills start making people uncomfortable.&lt;/p&gt;

&lt;p&gt;The protocol isn't the problem. The eager-loading default is the problem. Own the architecture decision. Lazy-load.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>MCP Is Packaging. Agent-Operable Interfaces Are the Product | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Mon, 04 May 2026 14:25:47 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/mcp-is-packaging-agent-operable-interfaces-are-the-product-focused-labs-49gp</link>
      <guid>https://dev.to/focused_dot_io/mcp-is-packaging-agent-operable-interfaces-are-the-product-focused-labs-49gp</guid>
      <description>&lt;p&gt;&lt;em&gt;MCP packages tools, but the real product is the narrow, typed, auditable interface an agent can actually operate.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Austin Vance, CEO of&lt;/em&gt;&lt;a href="https://focused.io" rel="noopener noreferrer"&gt; &lt;em&gt;Focused&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MCP is not the hard part.&lt;/p&gt;

&lt;p&gt;The hard part is designing a system that an agent can use, as opposed to guessing, wandering, or mangling it. The protocol is the distribution rather than the architecture&lt;/p&gt;

&lt;p&gt;This is kind of important. Every enterprise AI conversation I’ve had will, at some point, boil down to this: we have a model, we have a workflow, and we have a tangle of internal tools designed for humans to interact with them through a web interface at human speeds. Then the question becomes “should we make an MCP server to handle all of this?”&lt;/p&gt;

&lt;p&gt;Fine. But for what?&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://modelcontextprotocol.io/introduction" rel="noopener noreferrer"&gt;Model Context Protocol makes it easy for applications to expose tools and model context&lt;/a&gt;. That’s useful and I'm not opposing MCP. I am opposing the use of this protocol to justify exposure of a useless shortcut as being also useful.&lt;/p&gt;

&lt;p&gt;Harrison Chase broke down the lock-in problem well: &lt;a href="https://x.com/hwchase17/status/2050470473310572849" rel="noopener noreferrer"&gt;switching model providers is easy, switching harnesses is less so, and model providers want to lock teams in through the harness&lt;/a&gt;. The harness is where the agent learns about the actions in an application, the state, the model’s memory, what can be retried, what needs approval, and what telemetry gets written down.&lt;/p&gt;

&lt;p&gt;But then there is the interface below the harness, which gets little recognition.&lt;/p&gt;

&lt;p&gt;A bad interface can turn an excellent harness into a nightmarish pain. A good interface can make any harness only fair at worst.&lt;/p&gt;

&lt;p&gt;I see why “just build an MCP server” isn’t the entire answer. An MCP server can send a messy action. It can wrap up a sharp action. But deciding which action exists in the first place is up to the team. And it's a design / experience problem not engineering.&lt;/p&gt;

&lt;p&gt;Teams build integrations for internal agents by wrapping around existing APIs, often structured to hide awkward frontend decisions, like why the API returned an object with an object with an object inside of it. An endpoint might have a side effect of updating state because it’s an admin screen. Exceptions include human-readable error messages, implicit permissions, opaque pagination parameters, no support for dry running, and no idempotency keys. The most lacking verb in this system is “after policy rules apply, approve this one invoice,” and that ends up on an agent with the verb &lt;code&gt;updateInvoice&lt;/code&gt;. Stricter prompts don’t work.&lt;/p&gt;

&lt;p&gt;Welcome to production.&lt;/p&gt;

&lt;p&gt;After reading yet another question about whether a given subsystem has an MCP server, I paused for an instant to ask myself whether I missed something here. We shouldn't be asking "is an MCP server," instead we should ask if the system in question has handles for the agent that just got invited in.&lt;/p&gt;

&lt;p&gt;A handle is a small, typed, boring action, describing what it intends to do with some data. It describes what the data contains, what the operation needs from it, and what it will look like afterward. It fails in a way that the caller can understand. Handle-based operations are easy to test without a full model. Finally, handles leave traces of their prior actions.&lt;/p&gt;

&lt;p&gt;Do the new examples reinforce the point? Google’s &lt;a href="https://github.com/googleapis/mcp-toolbox" rel="noopener noreferrer"&gt;MCP Toolbox for Databases&lt;/a&gt; might sound utterly bland because “database plus MCP” is a magical phrase. But in this case, the interesting new aspect is that databases require controlled, auditable work that can be inspected by the software agent. MathWorks has released an official &lt;a href="https://github.com/matlab/matlab-mcp-core-server" rel="noopener noreferrer"&gt;MATLAB MCP server&lt;/a&gt;, which is interesting because the interface to MATLAB’s mature technical environment is vastly more appropriate than a chat window. Browserbase and LangChain are demonstrating Deep Agents with &lt;a href="https://docs.langchain.com/oss/python/integrations/providers/browserbase" rel="noopener noreferrer"&gt;search, fetch, and browser subagents&lt;/a&gt;. Again, a cheap, light subagent performs quick retrieval, followed by a heavier browser-based operation if necessary.&lt;/p&gt;

&lt;p&gt;I don’t mean that every single thing suddenly becomes an MCP server. I mean that more of the important tools in a business can become something controlled through an agent instead of through a browser tab or terminal command.&lt;/p&gt;

&lt;p&gt;There is a difference.&lt;/p&gt;

&lt;p&gt;An MCP server is just one package boundary among several, each with its own strengths and weaknesses. An agent-operable interface is a product decision, choosing specific verbs, inputs, outputs, reversible operations, and mandatory human pause actions. A protocol can then move that interface around, but it cannot make the interface good.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4pryn0qq9ln3xc56py8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4pryn0qq9ln3xc56py8.png" alt="Side-by-side architecture diagram comparing a thin MCP wrapper around a messy API with an agent-operable interface that has narrow verbs, dry runs, typed errors, and audit records." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MCP moves an interface around. It does not make the verbs worth trusting.&lt;/p&gt;

&lt;p&gt;This is the same anti-pattern we saw with APIs. Companies would publish a REST API to tremendous fanfare, convinced that integration problems were now solved. In practice, the nouns and mutations provided by the API would prove inadequate for anything beyond the simplest cases. Docs would sometimes contradict behavior. And while most of the workflow might be automatable, the remaining chunk still required a human being logged into the admin console.&lt;/p&gt;

&lt;p&gt;The gap costs more as agents move further into it, since they typically stop short of explicitly stating the ambiguities at the boundary, and instead select tools, insert missing fields, retry operations, and give misleading summaries of the results as if they were progress. Agents do not intend to fail in workflows. Instead, they are given an irregular surface to work on for which they have no clear mandate and for which they must pretend to be competent.&lt;/p&gt;

&lt;p&gt;A useful way to think about this is &lt;a href="https://focused.io/lab/developing-ai-agency" rel="noopener noreferrer"&gt;Developing AI Agency&lt;/a&gt;. The word “agency” comes with unfortunate connotations of personality, so I try to think about it in terms of the required affordances for any agent: a goal, some tools to pursue it with, memory, feedback, and permission to act. When the tool layer is too vague, the AI ends up with fake agency. It can talk about work and even generate a lot of thoughtful-sounding design language, but it can’t actually do the work.&lt;/p&gt;

&lt;p&gt;The current gold rush of building MCPs obfuscates this problem because when people say “server” they think of code and physical hardware. Code and hardware are tangible. There is a repo, a README, and a demo of someone, usually Claude or Cursor, opening up the tool and something happening.&lt;/p&gt;

&lt;p&gt;That demo is not the test.&lt;/p&gt;

&lt;p&gt;Test whether the interface still behaves when the request is boring, partial, duplicated, late, unauthorized, or wrong. Test whether a reviewer can always reconstruct what happened to an object after the agent touched the handle of the thing. Test whether the action can be replayed in staging without accidentally sending the email to customers. &lt;a href="https://focused.io/lab/everybody-tests" rel="noopener noreferrer"&gt;Everybody Tests&lt;/a&gt;, even when the thing under test is an agent holding a tool handle.&lt;/p&gt;

&lt;p&gt;A useful agent-operable interface has a few properties.&lt;/p&gt;

&lt;p&gt;The verbs are narrow. A verb for “create refund request” instead of “update order.” A verb for “draft response” instead of “send message.” A verb for “propose schema migration” instead of “run SQL.” Narrow verbs help by letting the operation name strongly suggest the operation’s intent.&lt;/p&gt;

&lt;p&gt;All inputs are provided in a form that the domain expects, not just pure JSON schema for the sake of it. Real domain constraints are used where possible, to reflect the kind of validation that matters in the application. This means providing an account ID that actually exists in the system, a payment amount that has a meaningful currency, and a date and time with timezone rules that have real-world meaning to the user. And when using enums, the validated output should contain meaningful strings, not just values used in the demo.&lt;/p&gt;

&lt;p&gt;Outputs should be machine-readable and human-readable at the same time. The agent expects certain fields to be populated. A human reviewer wants to read a simple statement of what changed, what didn’t change, and what still needs work.&lt;/p&gt;

&lt;p&gt;There’s a dry-run path. A dry run is the cheapest safety mechanism available, and almost nobody shipping generated code tries it first. A dry run turns “can the agent do this?” into “can the agent explain the diff before doing this?” That is where human judgment is better.&lt;/p&gt;

&lt;p&gt;Interfaces are idempotent to the degree possible. Networks fail, agents retry, and tool calls time out while the downstream system was actually working. If creating an invocation of &lt;code&gt;create_refund_request&lt;/code&gt; also creates a second refund, or a second ticket, or a second production deploy, then the interface is not yet ready for an agent.&lt;/p&gt;

&lt;p&gt;Every interface has contract tests that don’t involve a model. This matters. If every single correctness check has to run an LLM, we have built a slot machine and only looked at the CI badge. The tool’s schema, how it validates, what a dry run looks like, how permissions fail, and what audit records are generated should all be tested by normal software tests. Save the model evals for when there’s a model involved.&lt;/p&gt;

&lt;p&gt;The interface leaves evidence. Not vibes, though it could strive for better ones. Tangible records of who acted, through which agent, under which policy, against which object, with what proposed change, and with what final result. Here I’m talking about connecting observability to governance without inverting into another dashboard cult.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jgx17fgudljd1qbdpuh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jgx17fgudljd1qbdpuh.png" alt="Matrix listing the properties of an agent-operable handle: narrow verb, typed input, dry run, idempotency, typed failure, audit record, and human pause." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A useful handle is a contract the agent cannot creatively reinterpret.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://x.com/GoogleCloudTech/status/2050334450697863535" rel="noopener noreferrer"&gt;Google Cloud conversation with Harrison Chase framed harness engineering as the path from demo to production&lt;/a&gt;. I think that is right, and I think the next practical step is interface engineering. The harness made sense once it had an interface for composing sane things.&lt;/p&gt;

&lt;p&gt;This is why abstractions on top of LangChain are useful too. Start with a basic agent primitive, then a graph, and finally a Deep Agent that can even use browser subagents and human interruption. Every level of abstraction still ultimately bottoms out at a tool call, which either corresponds to a clean domain operation or a tangled mess of code that happens to work on the backend.&lt;/p&gt;

&lt;p&gt;In practice, &lt;a href="https://focused.io/lab/multi-agent-orchestration-in-langgraph-supervisor-vs-swarm-tradeoffs-and-architecture" rel="noopener noreferrer"&gt;Multi-Agent Orchestration in LangGraph&lt;/a&gt; is only half the story. The other half is whether the interface lets the worker do anything worth trusting.&lt;/p&gt;

&lt;p&gt;It’s getting said out loud in the community now: &lt;a href="https://x.com/i/status/2050545264927093004" rel="noopener noreferrer"&gt;“Stop building MCP servers. Build CLIs that agents can use”&lt;/a&gt;. I don’t care what the end result is, as long as it’s a CLI, OpenAPI endpoint, MCP tool, database management procedure, internal command bus, or whatever boring thing is observable, testable, and readable by others.&lt;/p&gt;

&lt;p&gt;Interesting new projects are emerging around this idea too. &lt;a href="https://github.com/millionco/agent-install" rel="noopener noreferrer"&gt;agent-install&lt;/a&gt; treats agent capabilities as installable surfaces across coding agents. &lt;a href="https://github.com/DesmondSanctity/loadam" rel="noopener noreferrer"&gt;loadam&lt;/a&gt; turns OpenAPI specs into tests, MCP output, and drift reports. &lt;a href="https://www.freecodecamp.org/news/how-to-build-a-multi-agent-ai-system-with-langgraph-mcp-and-a2a-full-book/" rel="noopener noreferrer"&gt;freeCodeCamp’s LangGraph, MCP, and A2A guide&lt;/a&gt; also illustrates the progress from single-agent demos to more structured systems with protocols between them.&lt;/p&gt;

&lt;p&gt;Good. Just make the distinction between what the protocol diagram shows and what the system can actually do.&lt;/p&gt;

&lt;p&gt;The work is deciding what actions the agent can take within Salesforce, Jira, GitHub, Postgres, SAP, Stripe, and the lingering internal admin app that is totally going to get replaced tomorrow. Deleting broad verbs is the new favorite hobby. Adding dry runs is straightforward. Making failures typed is tedious. Writing tests for contracts before a single model sees the tool is boring.&lt;/p&gt;

&lt;p&gt;Boring is the point.&lt;/p&gt;

&lt;p&gt;Stop Eager-Loading MCP Tools Into the Context Window. A giant pile of tools is not capability. It is usually confusion with a larger token bill. Agents need fewer, sharper handles to their tools, and tool catalogs should feel more like a well-designed command line than a junk drawer with JSON schemas bolted on.&lt;/p&gt;

&lt;p&gt;Agent-operable interfaces should be treated as part of product architecture, not just sweeping up integration bits and pieces that product teams don’t want anymore. Enterprise teams should own the verbs the same way they own the database schema. Version them. Deprecate them. Test them and document the failure modes. Have review for dangerous actions. Make the interface boring enough that the agent has no creative wiggle room around the important bits.&lt;/p&gt;

&lt;p&gt;MCP will help distribute interfaces. Harnesses will help compose them. Models will get better at calling them.&lt;/p&gt;

&lt;p&gt;Companies will not win by having the most MCP-capable servers. They will win by having the cleanest handles in their systems.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Your Customer Service Bot Is Slow Because It's Single-Threaded</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:16:24 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/your-customer-service-bot-is-slow-because-its-single-threaded-1gnb</link>
      <guid>https://dev.to/focused_dot_io/your-customer-service-bot-is-slow-because-its-single-threaded-1gnb</guid>
      <description>&lt;p&gt;Consider a typical enterprise support agent. A customer asks a complex compliance question and the agent dutifully queries the knowledge base, then searches the web, then checks policy docs. Sequential. Three LLM calls back to back. &lt;em&gt;That's ~12 seconds of wall time.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Users start abandoning chat around 8.&lt;/p&gt;

&lt;p&gt;Fan out those three research calls in parallel, same calls, same models, same prompts, and &lt;em&gt;wall time drops to ~6.5 seconds.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This post covers the parallel sub-agent pattern using LangGraph and LangSmith. I'll show the code, but more importantly, I'll show you the failure modes because the pattern is simple and the bugs are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Latency Math
&lt;/h2&gt;

&lt;p&gt;You have an agent that needs to hit three sources, internal KB, web search, and policy documents. Each LLM call takes 2–4 seconds. Sequentially:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classify query&lt;/td&gt;
&lt;td&gt;~1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research KB&lt;/td&gt;
&lt;td&gt;~3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research Web&lt;/td&gt;
&lt;td&gt;~3.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research Policy&lt;/td&gt;
&lt;td&gt;~2.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synthesize&lt;/td&gt;
&lt;td&gt;~2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~12s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In parallel, the three research steps overlap:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classify query&lt;/td&gt;
&lt;td&gt;~1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research (all three, parallel)&lt;/td&gt;
&lt;td&gt;~3.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synthesize&lt;/td&gt;
&lt;td&gt;~2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~6.5s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A 45% reduction from a structural change, not a prompt improvement. Every additional sub-agent you add sequentially costs another 2–4 seconds. In parallel, it's free, until you hit the slowest branch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Parallel Agents Architecture
&lt;/h2&gt;

&lt;p&gt;We're building a research assistant that fans out to three parallel sub-agents, aggregates results, and synthesizes a response:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                     ┌→ [Research: KB]     ─┐
[Classify Query] ────┼→ [Research: Web]    ─┼→ [Synthesize] → END
                     └→ [Research: Policy] ─┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;LangGraph executes parallel branches in a superstep, all three branches run concurrently, state updates are transactional. The fan-in edge waits for all branches before proceeding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the Send API:&lt;/strong&gt; LangGraph has a &lt;code&gt;Send&lt;/code&gt; API for dynamic map-reduce where branch count is unknown at build time. Don't reach for it here. &lt;code&gt;Send&lt;/code&gt; is designed for running the same node N times with different inputs. For a fixed set of specialist agents, static edges or conditional routing are simpler, preserve graph structure, and keep every branch visible at compile time via &lt;code&gt;graph.get_graph().draw_mermaid()&lt;/code&gt;. In practice, you'll rarely need &lt;code&gt;Send&lt;/code&gt;. Start with static fan-out, graduate to conditional, reach for &lt;code&gt;Send&lt;/code&gt; as a last resort.&lt;/p&gt;

&lt;h2&gt;
  
  
  State: The One Thing You'll Get Wrong
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;Annotated[list, operator.add]&lt;/code&gt; reducer tells LangGraph to &lt;strong&gt;concatenate&lt;/strong&gt; results from parallel branches instead of overwriting them. Without it, parallel branches race to write the results field. The last branch to finish wins, and you silently lose the other two. This is one of the most common bugs in parallel agent systems. The synthesizer produces suspiciously narrow responses, coverage evals fail intermittently, and you spend two days blaming the prompt before realizing you're only getting one source's data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;State, a sub-agent factory, and three agent instances. The &lt;code&gt;@traceable&lt;/code&gt; decorator ensures each agent appears as a distinct span in LangSmith — this will be the single most important debugging decision you make.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import operator
from typing import Annotated, TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    question: str
    research_results: Annotated[list[dict], operator.add]
    final_response: str


def make_agent(name: str, focus: str):
    """Factory that builds a traceable research sub-agent."""

    @traceable(name=name, run_type="chain")
    def node(state: State) -&amp;gt; dict:
        response = llm.invoke([
            SystemMessage(content=f"You are the {name} agent. Focus on {focus}. "
                                  "Return a concise summary. Cite your source type."),
            HumanMessage(content=f"Research query: {state['question']}"),
        ])
        return {"research_results": [{"source": name, "content": response.content}]}

    return node


kb_agent = make_agent("knowledge_base", "internal knowledge base searches.")
web_agent = make_agent("web_search", "recent news and industry trends.")
policy_agent = make_agent("policy", "compliance, legal, and regulatory frameworks.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The synthesizer merges sub-agent outputs into one customer-facing response. The key constraint, worth knowing before you ship, is that policy information takes precedence. Without this, the synthesizer will cheerfully soften restrictions to sound more helpful.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="Synthesizer", run_type="chain")
def synthesize(state: State) -&amp;gt; dict:
    context = "\n\n".join(
        f"[{r['source']}]: {r['content']}" for r in state["research_results"]
    )
    response = llm.invoke([
        SystemMessage(
            content="Synthesize the following research into a clear, actionable "
                    "response. When policy information conflicts with or constrains "
                    "other responses, the policy statement takes precedence. "
                    "Never soften or omit policy restrictions."
        ),
        HumanMessage(
            content=f"Customer question: {state['question']}\n\n"
                    f"Research findings:\n{context}"
        ),
    ])
    return {"final_response": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Graph Assembly
&lt;/h2&gt;

&lt;p&gt;Fifteen lines of wiring. &lt;code&gt;RetryPolicy&lt;/code&gt; on every research node so a provider 429 doesn't kill the entire pipeline, successful branches are checkpointed and won't re-execute.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.graph import StateGraph, START, END
from langgraph.types import RetryPolicy

builder = StateGraph(State)

builder.add_node("kb", kb_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("web", web_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("policy", policy_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("synthesize", synthesize)

builder.add_edge(START, "kb")
builder.add_edge(START, "web")
builder.add_edge(START, "policy")
builder.add_edge(["kb", "web", "policy"], "synthesize")
builder.add_edge("synthesize", END)

graph = builder.compile()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Conditional Routing: The Upgrade
&lt;/h2&gt;

&lt;p&gt;Sometimes hitting every source is wasteful. A simple "what's our refund policy?" doesn't need web search. Conditional fan-out lets you route based on the question using structured output, no regex parsing, no brittle string matching:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from collections.abc import Sequence

from pydantic import BaseModel, Field


class RoutingPlan(BaseModel):
    agents: list[str] = Field(
        description="Agents to activate: kb, web, policy"
    )

structured_llm = llm.with_structured_output(RoutingPlan)


def classify_and_route(state: State) -&amp;gt; Sequence[str]:
    plan = structured_llm.invoke([
        SystemMessage(content="Decide which research agents to invoke. "
                              "Available: kb, web, policy. When in doubt, include the agent."),
        HumanMessage(content=state["question"]),
    ])
    return plan.agents or ["kb"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The tradeoff is real. Conditional routing saves latency on simple queries but your routing logic becomes a new failure point. And with conditional fan-out, use individual edges from each node to &lt;code&gt;synthesize&lt;/code&gt; not the list-style fan-in or LangGraph waits forever for branches that were never dispatched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Failures in Concurrent Execution
&lt;/h2&gt;

&lt;p&gt;These are the failure modes that surface once parallel agents hit real traffic.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;State Clobbering.&lt;/strong&gt; Synthesizer references only one source. Intermittent. Cause: missing &lt;code&gt;operator.add&lt;/code&gt; reducer. Parallel branches overwrite instead of appending. There's no warning, the graph runs fine, it just loses data.****&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesizer Contradicted the Policy Agent.&lt;/strong&gt; Say a customer asks about returning an opened product. The policy agent correctly stated the 30-day &lt;em&gt;unopened-only&lt;/em&gt; return policy. The KB agent mentioned "hassle-free returns." The synthesizer merged these into: "You can return the product within 30 days, hassle-free" omitting the unopened requirement. LangSmith traces showed the policy agent's output was correct; the synthesizer span revealed where the information was lost. Fix: the policy-takes-precedence constraint in the synthesizer prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hung Branch Blocking Fan-In.&lt;/strong&gt; Response times spike from ~6s to 30s+. The fan-in waits for ALL branches. Your p50 is fine, your p99 is determined by the slowest branch on its worst day. Fix: async timeouts per branch, return partial results (&lt;code&gt;{"source": "web_search", "content": "Timed out"}&lt;/code&gt;) rather than blocking the pipeline.****&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestrator Under-Dispatched&lt;/strong&gt;. A significant fraction of multi-domain queries will be only partially routed. Over-dispatching (an agent returning empty results) is cheap. Under-dispatching is a customer getting an incomplete answer. Fix: explicit multi-domain examples in the routing prompt and a &lt;code&gt;"when in doubt, include the agent"&lt;/code&gt; instruction.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Parallel agents are hard to debug without tracing. &lt;code&gt;@traceable&lt;/code&gt; on every sub-agent gives you per-branch spans in LangSmith. Tag production traces with metadata for filtering:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import tracing_context

with tracing_context(
    metadata={"customer_tier": "enterprise", "channel": "chat"},
    tags=["production", "v2"],
):
    result = graph.invoke({"question": "How does GDPR affect our data pipeline?"})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The first thing to check when latency spikes: is one branch consistently slower? LangSmith makes that a 10-second investigation instead of an hour of log-grepping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;Shipping without evals is negligence. Three evaluators catch the most common regressions: deterministic coverage, structural fan-out validation, and LLM-as-judge for overall quality.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="research-agent-evals",
    description="Parallel research agent evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What is our refund policy for enterprise clients?"},
        {"question": "How does GDPR affect our data pipeline architecture?"},
        {"question": "What competitors launched AI features last quarter?"},
    ],
    outputs=[
        {"must_mention": ["refund", "enterprise", "policy"]},
        {"must_mention": ["GDPR", "data", "compliance"]},
        {"must_mention": ["competitor", "AI", "feature"]},
    ],
)


from langsmith import evaluate
from openevals.llm import create_llm_as_judge

QUALITY_PROMPT = """\
Customer query: {inputs[question]}
AI response: {outputs[final_response]}

Rate 0.0-1.0 on completeness, accuracy, and tone.
Return ONLY: {{"score": &amp;lt;float&amp;gt;, "reasoning": "&amp;lt;explanation&amp;gt;"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the synthesizer actually address the question?"""
    text = outputs.get("final_response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def source_diversity(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Is the fan-out actually working, or did it silently degrade?"""
    results = outputs.get("research_results", [])
    sources = {r["source"] for r in results if isinstance(r, dict)}
    return {"key": "source_diversity", "score": min(len(sources) / 2.0, 1.0)}


def target(inputs: dict) -&amp;gt; dict:
    return graph.invoke({"question": inputs["question"]})


results = evaluate(
    target,
    data="research-agent-evals",
    evaluators=[quality_judge, coverage, source_diversity],
    experiment_prefix="parallel-research-v1",
    max_concurrency=4,
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;source_diversity&lt;/code&gt; is the only automated check that your parallel architecture is actually parallel. Without it, state clobbering can ship to production and sit there for weeks. Run this eval on every PR that touches agent code.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use parallel sub-agents when:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queries regularly span 2+ domains in a single message&lt;/li&gt;
&lt;li&gt;You need per-domain traceability for debugging and compliance&lt;/li&gt;
&lt;li&gt;Sub-agents have different tool sets or retrieval sources&lt;/li&gt;
&lt;li&gt;You're iterating on prompts and need isolated regression testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip it when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queries are single-domain (a FAQ bot doesn't need orchestration)&lt;/li&gt;
&lt;li&gt;Latency budget is extremely tight (routing adds one LLM call)&lt;/li&gt;
&lt;li&gt;You have fewer than 3 distinct knowledge domains&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Parallel sub-agents aren't architecturally complex it's a fan-out, a fan-in, and a reducer. The code is about 15 lines of graph wiring. The production hardening is everything else.&lt;/p&gt;

&lt;p&gt;Start with static fan-out. Add conditional routing when you have data showing which sources matter for which queries. Write the &lt;code&gt;source_diversity&lt;/code&gt; eval before you write the second prompt. And put &lt;code&gt;operator.add&lt;/code&gt; on your list fields you'll thank me later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/focused-dot-io/01-parallel-sub-agents/" rel="noopener noreferrer"&gt;Parrallel Agents Github Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/quickstart" rel="noopener noreferrer"&gt;LangGraph Quickstart (State, Reducers, Graph Construction)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/observability" rel="noopener noreferrer"&gt;LangSmith Observbaility &amp;amp; Tracing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/evaluation" rel="noopener noreferrer"&gt;LangSmith Evaluation Framework&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/your-customer-service-bot-is-slow-because-its-single-threaded" rel="noopener noreferrer"&gt;https://focused.io/lab/your-customer-service-bot-is-slow-because-its-single-threaded&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Your AI Just Emailed a Customer Without Permission</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:16:21 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/your-ai-just-emailed-a-customer-without-permission-38k4</link>
      <guid>https://dev.to/focused_dot_io/your-ai-just-emailed-a-customer-without-permission-38k4</guid>
      <description>&lt;p&gt;In a customer complaint handler for a fintech company you have drafted responses, checked tone, and verified responses to match company policy. Automated from end to end. Then, the agent sends a $4,200 refund approval to a customer who'd asked about a fee schedule. The LLM hallucinates the complaint, writes up a professional apology with a specific dollar amount, and fires it off before anyone on the team even knows.&lt;/p&gt;

&lt;p&gt;Better prompts won’t help because the problem isn't what the model says, it's that nothing stops it from saying it.&lt;/p&gt;

&lt;p&gt;To fix this you need an approval gate. Somewhere in the agent’s graph where execution... stops. State gets written to disk and a human looks at the draft. Only after they say "yeah, send it" does anything go out the door. LangGraph has a built-in primitive for this called &lt;code&gt;interrupt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let's walk through the full pattern here. The code is straightforward but state management can trip you up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost argument (if you need one)
&lt;/h2&gt;

&lt;p&gt;If you're already sold on why AI shouldn't email customers unsupervised, skip this, but if you need to convince your PM, here's some napkin math:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Without Gate&lt;/th&gt;
&lt;th&gt;With Gate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Messages sent/day&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error rate (wrong tone/info)&lt;/td&gt;
&lt;td&gt;~3%&lt;/td&gt;
&lt;td&gt;~0.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bad messages/day&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg cost per bad message&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daily risk&lt;/td&gt;
&lt;td&gt;$3,000&lt;/td&gt;
&lt;td&gt;$100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What we’re building&lt;/p&gt;

&lt;p&gt;A customer complaint response pipeline. Complaint comes in, AI drafts a response, a human approves or edits, system sends the final version.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Intake] → [Draft Response] → [INTERRUPT: Human Review] → [Send Response] → END
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;interrupt&lt;/code&gt; is where execution pauses. All the graph state (draft, original complaint, metadata, etc) gets checkpointed. It could be hours or days before someone reviews it and when they do, the graph will pick up right where it stopped.&lt;/p&gt;

&lt;p&gt;Even in serverless environments &lt;code&gt;interrupt&lt;/code&gt; is resilient. The Python process can crash. Server can restart. You resume with the same &lt;code&gt;thread_id&lt;/code&gt; and LangGraph reloads everything from the checkpointer. &lt;/p&gt;

&lt;h2&gt;
  
  
  The state schema
&lt;/h2&gt;

&lt;p&gt;Whatever the reviewer needs to see has to be in state before the interrupt fires.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    complaint: str
    customer_id: str
    draft_response: str
    review_decision: str
    reviewer_notes: str
    final_response: str
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The nodes
&lt;/h2&gt;

&lt;p&gt;Let’s build three nodes, draft, review, send. All with &lt;code&gt;@traceable&lt;/code&gt; because six months from now when someone asks "who approved sending that email to the VP of procurement at our biggest account," you want a trace showing what the AI wrote vs. what a person changed.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="draft_response", run_type="chain")
def draft_response(state: State) -&amp;gt; dict:
    response = llm.invoke([
        SystemMessage(
            content="You are a customer service agent. Draft a professional, "
                    "empathetic response to the following complaint. Be specific "
                    "about next steps. Do NOT promise refunds or credits unless "
                    "the complaint clearly warrants one. Keep it under 150 words."
        ),
        HumanMessage(
            content=f"Customer ID: {state['customer_id']}\n\n"
                    f"Complaint: {state['complaint']}"
        ),
    ])
    return {"draft_response": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The review node is where &lt;code&gt;interrupt()&lt;/code&gt; does its work.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.types import interrupt

@traceable(name="human_review", run_type="chain")
def human_review(state: State) -&amp;gt; dict:
    decision = interrupt({
        "draft": state["draft_response"],
        "customer_id": state["customer_id"],
        "complaint": state["complaint"],
        "instructions": "Review the draft. Respond with a JSON object: "
                        '{"action": "approve" | "edit" | "reject", '
                        '"edited_response": "...", "notes": "..."}'
    })
    return {
        "review_decision": decision["action"],
        "reviewer_notes": decision.get("notes", ""),
        "final_response": decision.get("edited_response", state["draft_response"])
            if decision["action"] != "reject" else "",
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The dict you pass to &lt;code&gt;interrupt()&lt;/code&gt; is the payload. It shows up in the &lt;code&gt;__interrupt__&lt;/code&gt; field of the graph's return value, which is what your UI or Slack bot reads to build the review screen. When someone calls &lt;code&gt;Command(resume={"action": "approve"})&lt;/code&gt;, that dict becomes what &lt;code&gt;interrupt()&lt;/code&gt; returns. The function resumes from the line right after the &lt;code&gt;interrupt()&lt;/code&gt; call. It looks like a normal function call but there's a checkpoint boundary hiding inside it.&lt;/p&gt;

&lt;p&gt;Send node. Don't send if it was rejected:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="send_response", run_type="chain")
def send_response(state: State) -&amp;gt; dict:
    if state["review_decision"] == "reject":
        return {"final_response": "[REJECTED] " + state["reviewer_notes"]}
    return {"final_response": state["final_response"]}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Wiring it up
&lt;/h2&gt;

&lt;p&gt;The checkpointer makes interrupts durable. You can use &lt;code&gt;InMemorySaver&lt;/code&gt; for dev, &lt;code&gt;PostgresSaver&lt;/code&gt; for prod and if you forget the checkpointer and &lt;code&gt;interrupt()&lt;/code&gt; throws a &lt;code&gt;RuntimeError&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.checkpoint.memory import InMemorySaver
from langgraph.graph import StateGraph, START, END

builder = StateGraph(State)

builder.add_node("draft", draft_response)
builder.add_node("review", human_review)
builder.add_node("send", send_response)

builder.add_edge(START, "draft")
builder.add_edge("draft", "review")
builder.add_edge("review", "send")
builder.add_edge("send", END)

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The full interrupt/resume cycle
&lt;/h2&gt;

&lt;p&gt;Two &lt;code&gt;invoke&lt;/code&gt; calls. First one runs until the interrupt and stops, the second one picks up where it left off.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.types import Command

config = {"configurable": {"thread_id": "complaint-1234"}}

# Phase 1: Run until the interrupt
result = graph.invoke(
    {
        "complaint": "I was charged twice for my subscription last month. "
                     "Order #A-9912. I want a refund immediately.",
        "customer_id": "cust_8837",
    },
    config=config,
)

# The graph paused. Extract the interrupt payload.
interrupt_data = result["__interrupt__"][0].value
print(f"Draft for review: {interrupt_data['draft']}")
print(f"Customer: {interrupt_data['customer_id']}")

# Phase 2: Human reviews and approves (could be minutes or days later)
final_result = graph.invoke(
    Command(resume={
        "action": "edit",
        "edited_response": "We've identified the duplicate charge on Order #A-9912. "
                           "A refund of $29.99 has been initiated and will appear "
                           "in 3-5 business days. We apologize for the inconvenience.",
        "notes": "Verified duplicate charge in billing system. Approved refund.",
    }),
    config=config,  # Same thread_id — this is how LangGraph finds the checkpoint
)

print(f"Final response: {final_result['final_response']}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That &lt;code&gt;thread_id&lt;/code&gt; in the config matters more than anything else here. It's the key into the checkpointer. Without a &lt;code&gt;thread_id&lt;/code&gt; you can't resume. We treat these as primary keys and map it to something stable in your system: ticket ID, conversation ID, etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding risk-based routing
&lt;/h2&gt;

&lt;p&gt;The basic version sends everything through human review. Start there, but eventually reviewers get tired of approving "thanks for contacting us, we're looking into it" all day, and you'll want to auto-approve the low-risk stuff.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pydantic import BaseModel, Field


class RiskAssessment(BaseModel):
    risk_level: str = Field(description="low, medium, or high")
    reason: str = Field(description="Why this risk level was assigned")


risk_llm = llm.with_structured_output(RiskAssessment)


@traceable(name="assess_risk", run_type="chain")
def assess_risk(state: State) -&amp;gt; dict:
    assessment = risk_llm.invoke([
        SystemMessage(
            content="Assess the risk level of this customer service response. "
                    "high = involves money, legal, account changes, or could "
                    "be interpreted as a binding commitment. "
                    "medium = emotional topic, could escalate. "
                    "low = simple acknowledgment, FAQ, status update."
        ),
        HumanMessage(
            content=f"Complaint: {state['complaint']}\n\n"
                    f"Draft response: {state['draft_response']}"
        ),
    ])
    return {"review_decision": assessment.risk_level}


def route_by_risk(state: State) -&amp;gt; str:
    if state["review_decision"] == "low":
        return "send"
    return "review"


builder_v2 = StateGraph(State)

builder_v2.add_node("draft", draft_response)
builder_v2.add_node("assess", assess_risk)
builder_v2.add_node("review", human_review)
builder_v2.add_node("send", send_response)

builder_v2.add_edge(START, "draft")
builder_v2.add_edge("draft", "assess")
builder_v2.add_conditional_edges("assess", route_by_risk, {"send": "send", "review": "review"})
builder_v2.add_edge("review", "send")
builder_v2.add_edge("send", END)

graph_v2 = builder_v2.compile(checkpointer=InMemorySaver())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Fair warning: you've now introduced a second LLM call as a gate, and that gate can be wrong in both directions. Under-classify risk and messages go out without review. Over-classify and reviewers are right back to rubber-stamping everything. Run the classifier in logging-only mode for a couple weeks first (route everything through review, but record what the classifier would have done and use long term memory to tune the classifier). Then start skipping reviews on low-risk messages after you trust the data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bugs
&lt;/h2&gt;

&lt;p&gt;The demo works great... but...&lt;/p&gt;

&lt;h3&gt;
  
  
  Lost thread_id
&lt;/h3&gt;

&lt;p&gt;Someone approves a draft in Slack. The integration pulls out the approval decision but constructs a &lt;em&gt;new&lt;/em&gt; thread_id instead of looking up the one stored with the interrupt payload. Now &lt;code&gt;Command(resume=...)&lt;/code&gt; creates a fresh graph where the input is an approval decision, not the complaint. &lt;/p&gt;

&lt;p&gt;This happens a lot. Store the thread_id alongside the interrupt payload when you surface it to reviewers. Put it in a database. Put it in the Slack message metadata, Do not lose it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stale state
&lt;/h3&gt;

&lt;p&gt;Reviewer opens the draft at 11:30. Goes to lunch. Comes back at 1pm and hits approve. In the meantime, the customer sent two more messages and someone on the support team already replied manually. The approved draft is now responding to a conversation that moved on.&lt;/p&gt;

&lt;p&gt;LangGraph has no idea. It resumes from the checkpoint, which is frozen in time. Fix this by putting a &lt;code&gt;created_at&lt;/code&gt; timestamp in the interrupt payload and checking it against the customer record's &lt;code&gt;last_updated_at&lt;/code&gt; on resume. If anything changed, re-draft.&lt;/p&gt;

&lt;h3&gt;
  
  
  Double resume
&lt;/h3&gt;

&lt;p&gt;Shared review queue. Two reviewers see the same pending draft. Both click approve. Depending on the checkpointer implementation, the second resume is either a no-op or an error, but by then the send logic already fired on the first one. Maybe that's fine. Maybe you just sent duplicate emails.&lt;/p&gt;

&lt;p&gt;Build in idempotency to check if the thread already has a &lt;code&gt;review_decision&lt;/code&gt; before doing anything with the resume.&lt;/p&gt;

&lt;h3&gt;
  
  
  Interrupt reordering
&lt;/h3&gt;

&lt;p&gt;Two &lt;code&gt;interrupt()&lt;/code&gt; calls in one node (say, one for policy review and one for tone). LangGraph matches resume values to interrupts by position, not by name. There are no names. Refactor and swap the order, the policy answer goes to the tone check and vice versa.&lt;/p&gt;

&lt;p&gt;Don't put multiple interrupts in one node, instead use separate nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracing across the gap
&lt;/h2&gt;

&lt;p&gt;Interrupt-based workflows leave a gap in the LangSmith timeline where the human review happened. The draft trace ends, then hours later the resume trace starts, and nothing connects them unless you're deliberate about it.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import tracing_context

ticket_id = "TICKET-4821"
config = {"configurable": {"thread_id": ticket_id}}

# Phase 1: Draft
with tracing_context(
    metadata={"ticket_id": ticket_id, "phase": "draft"},
    tags=["production", "complaint-handler", "phase-1"],
):
    result = graph.invoke(
        {
            "complaint": "Your app crashed and I lost 3 hours of work.",
            "customer_id": "cust_2291",
        },
        config=config,
    )

# ... time passes, human reviews ...

# Phase 2: Resume
with tracing_context(
    metadata={"ticket_id": ticket_id, "phase": "resume", "reviewer": "jane@company.com"},
    tags=["production", "complaint-handler", "phase-2"],
):
    final = graph.invoke(
        Command(resume={"action": "approve", "notes": "Looks good."}),
        config=config,
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Put the ticket ID in the metadata for both phases. Now you can filter in LangSmith and see the full lifecycle of a single complaint even though draft and resume were separate invocations. The &lt;code&gt;reviewer&lt;/code&gt; field in phase 2 is your audit trail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;You need to know if drafts are any good before a human ever sees them.&lt;/p&gt;

&lt;p&gt;Dataset setup and evaluators live in &lt;code&gt;evals.py&lt;/code&gt; in the companion repo:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import Client, evaluate
from openevals.llm import create_llm_as_judge

from complaint_handler import graph

ls_client = Client()

DATASET_NAME = "complaint-handler-evals"

if not ls_client.has_dataset(dataset_name=DATASET_NAME):
    dataset = ls_client.create_dataset(
        dataset_name=DATASET_NAME,
        description="Human-in-the-loop complaint handler evaluation dataset",
    )
    ls_client.create_examples(
        dataset_id=dataset.id,
        inputs=[
            {
                "complaint": "Charged twice for order #A-1234. Want a refund.",
                "customer_id": "cust_001",
            },
            {
                "complaint": "App crashes every time I open the settings page.",
                "customer_id": "cust_002",
            },
            {
                "complaint": "Your CEO's tweet was offensive. Cancelling my account.",
                "customer_id": "cust_003",
            },
        ],
        outputs=[
            {
                "must_mention": ["refund", "order", "A-1234"],
                "risk": "high",
            },
            {
                "must_mention": ["crash", "settings", "investigating"],
                "risk": "medium",
            },
            {
                "must_mention": ["feedback", "understand", "account"],
                "risk": "high",
            },
        ],
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Three evaluators. LLM judge for draft quality, keyword coverage, and a check for unauthorized promises:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DRAFT_QUALITY_PROMPT = """\
Customer complaint: {inputs}
AI draft response: {outputs}

Rate 0.0-1.0 on empathy, accuracy, and professionalism.
Deduct points if the draft promises specific remedies (refunds, credits)
without explicit authorization.
Return ONLY: {{"score": &amp;lt;float&amp;gt;, "reasoning": "&amp;lt;explanation&amp;gt;"}}"""

draft_judge = create_llm_as_judge(
    prompt=DRAFT_QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="draft_quality",
    continuous=True,
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the draft actually address the complaint specifics?"""
    text = outputs.get("draft_response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def no_unauthorized_promises(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the draft promise refunds or credits without authorization?"""
    text = outputs.get("draft_response", "").lower()
    dangerous_phrases = ["refund has been", "credit has been", "we will refund",
                         "we will credit", "compensation of"]
    violations = sum(1 for p in dangerous_phrases if p in text)
    return {"key": "no_unauthorized_promises", "score": 1.0 if violations == 0 else 0.0}


def target(inputs: dict) -&amp;gt; dict:
    """Run the graph until the interrupt (draft phase only)."""
    config = {"configurable": {"thread_id": f"eval-{inputs['customer_id']}"}}
    result = graph.invoke(inputs, config=config)
    return {"draft_response": result.get("draft_response", "")}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;no_unauthorized_promises&lt;/code&gt; catches the failure mode from the top of this post. If the draft says "a refund has been initiated" when nobody authorized a refund, it scores zero. Run this eval every time you change the system prompt.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if &lt;strong&gt;name&lt;/strong&gt; == "&lt;strong&gt;main&lt;/strong&gt;":&lt;br&gt;
    results = evaluate(&lt;br&gt;
        target,&lt;br&gt;
        data=DATASET_NAME,&lt;br&gt;
        evaluators=[draft_judge, coverage, no_unauthorized_promises],&lt;br&gt;
        experiment_prefix="complaint-handler-v1",&lt;br&gt;
        max_concurrency=4,&lt;br&gt;
    )&lt;br&gt;
    print("\nEvaluation complete. Check LangSmith for results.")&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  When to Human In The Loop&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;If AI is writing things that go to customers, you need a gate. Processing refunds, updating account records, anything you can't undo with a quick "sorry about that" email. Regulated industries need the gate plus an audit trail of who approved what.&lt;/p&gt;

&lt;p&gt;You don't need this for internal stuff. Summarizing meeting notes, running analysis for a dashboard, generating reports that a human reads. &lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;The two function calls: &lt;code&gt;interrupt()&lt;/code&gt; and &lt;code&gt;Command(resume=...)&lt;/code&gt;. Pause execution, persist state, resume later.&lt;/p&gt;

&lt;p&gt;Most of the work is everything around those two calls. Thread IDs getting lost, the world changing during the review gap, two reviewers approving the same draft, traces that need to connect across a timeline gap of hours or days.&lt;/p&gt;

&lt;p&gt;Start by routing every response through review. Reviewers will complain. Good. Measure which categories they rubber-stamp, run your evals, and only then start auto-approving the boring stuff.  &lt;/p&gt;

&lt;p&gt;Technical References&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/focused-dot-io/02-human-in-the-loop/tree/9e328bdd3770541a764134efa7f87d53de2dad6b" rel="noopener noreferrer"&gt;Human in the Loop Github Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/interrupts" rel="noopener noreferrer"&gt;Interrupts (Human-in-the-loop / pause &amp;amp; resume)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/persistence" rel="noopener noreferrer"&gt;Persistence (Thread IDs &amp;amp; Checkpointers)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/overview" rel="noopener noreferrer"&gt;LangGraph Overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/evaluation-quickstart?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;LangSmith Eval Quickstarter&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/your-ai-just-emailed-a-customer-without-permission" rel="noopener noreferrer"&gt;https://focused.io/lab/your-ai-just-emailed-a-customer-without-permission&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Streaming Agent State with LangGraph</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:15:26 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/streaming-agent-state-with-langgraph-10kg</link>
      <guid>https://dev.to/focused_dot_io/streaming-agent-state-with-langgraph-10kg</guid>
      <description>&lt;p&gt;Your research agent takes 9 seconds to answer a question. It fans out to three sources, synthesizes results, returns a polished answer. The user sees a blank screen for all nine of those seconds. By second 5 they've refreshed the page, doubled your API costs, and still seen nothing.&lt;/p&gt;

&lt;p&gt;Streaming fixes this. Show the user what the agent is doing while it's doing it: "Searching knowledge base...", "Found 3 results...", "Synthesizing..." and then stream the final answer token by token. Same 9 seconds, but the user sees progress from millisecond 200.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Perception Math
&lt;/h2&gt;

&lt;p&gt;Identical work, different user experience:&lt;/p&gt;

&lt;p&gt;Pattern&lt;/p&gt;

&lt;p&gt;Wall time&lt;/p&gt;

&lt;p&gt;Time to first byte&lt;/p&gt;

&lt;p&gt;Perceived wait&lt;/p&gt;

&lt;p&gt;&lt;code&gt;invoke() (no streaming)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;Broken&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stream(stream_mode="updates")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;~200ms&lt;/p&gt;

&lt;p&gt;Working&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stream(stream_mode=["updates", "custom", "messages"])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;~200ms&lt;/p&gt;

&lt;p&gt;Can see what it’s doing&lt;/p&gt;

&lt;h2&gt;
  
  
  What we're Building
&lt;/h2&gt;

&lt;p&gt;A multi-step research agent that streams three types of events to the UI: node-level progress updates, custom status messages from inside nodes, and token-by-token LLM output for the final synthesis.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                          ┌─ stream: "Searching KB..."
[Intake] → [Research KB]  ┤
                          └─ stream: {results: 3}
                                    ↓
                          ┌─ stream: "Analyzing results..."
         → [Synthesize]  ┤
                          └─ stream: tokens... t-o-k-e-n-b-y-t-o-k-e-n
                                    ↓
                                     → END
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Three stream modes run simultaneously: &lt;code&gt;updates&lt;/code&gt; for graph state changes, &lt;code&gt;custom&lt;/code&gt; for application-specific progress events, and &lt;code&gt;messages&lt;/code&gt; for LLM token streaming.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Modes
&lt;/h2&gt;

&lt;p&gt;LangGraph exposes five stream modes. You'll use three in practice:&lt;/p&gt;

&lt;p&gt;Mode&lt;/p&gt;

&lt;p&gt;What it streams&lt;/p&gt;

&lt;p&gt;When to use&lt;/p&gt;

&lt;p&gt;&lt;code&gt;values&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Full state after each superstep&lt;/p&gt;

&lt;p&gt;Debugging, state inspection&lt;/p&gt;

&lt;p&gt;&lt;code&gt;updates&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;State delta from each node&lt;/p&gt;

&lt;p&gt;Production UIs — lightweight, shows which node ran&lt;/p&gt;

&lt;p&gt;&lt;code&gt;messages&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;LLM tokens + metadata&lt;/p&gt;

&lt;p&gt;Chat UIs — token-by-token output&lt;/p&gt;

&lt;p&gt;&lt;code&gt;custom&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Arbitrary data from &lt;code&gt;get_stream_writer()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Progress bars, status messages, structured events&lt;/p&gt;

&lt;p&gt;&lt;code&gt;debug&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Everything — internal execution details&lt;/p&gt;

&lt;p&gt;Development only&lt;/p&gt;

&lt;p&gt;In production, use &lt;code&gt;["updates", "custom", "messages"]&lt;/code&gt;. &lt;code&gt;values&lt;/code&gt; sends the entire state on every step. &lt;code&gt;debug&lt;/code&gt; is for development.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;State and two nodes: a research step that emits custom progress events, and a synthesizer that streams its LLM response token by token.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.config import get_stream_writer
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    question: str
    research: str
    answer: str
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The research node uses &lt;code&gt;get_stream_writer()&lt;/code&gt; to push status updates to the client. These show up in the &lt;code&gt;custom&lt;/code&gt; stream mode:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="research", run_type="chain")
def research(state: State) -&amp;gt; dict:
    writer = get_stream_writer()

    writer({"step": "research", "status": "starting", "message": "Searching knowledge base..."})

    response = llm.invoke([
        SystemMessage(
            content="You are a research assistant. Search for relevant information "
                    "about the user's question. Return a concise summary of findings."
        ),
        HumanMessage(content=state["question"]),
    ])

    writer({"step": "research", "status": "complete", "message": "Research complete."})

    return {"research": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The synthesizer uses the LLM normally. LangGraph automatically streams its tokens when &lt;code&gt;messages&lt;/code&gt; mode is active:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="synthesize", run_type="chain")
def synthesize(state: State) -&amp;gt; dict:
    writer = get_stream_writer()
    writer({"step": "synthesize", "status": "starting", "message": "Synthesizing answer..."})

    response = llm.invoke([
        SystemMessage(
            content="Synthesize the research into a clear, actionable answer. "
                    "Be concise but thorough."
        ),
        HumanMessage(
            content=f"Question: {state['question']}\n\nResearch:\n{state['research']}"
        ),
    ])

    writer({"step": "synthesize", "status": "complete", "message": "Done."})
    return {"answer": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Graph Assembly
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.graph import StateGraph, START, END

builder = StateGraph(State)

builder.add_node("research", research)
builder.add_node("synthesize", synthesize)

builder.add_edge(START, "research")
builder.add_edge("research", "synthesize")
builder.add_edge("synthesize", END)

graph = builder.compile()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Multi-mode Streaming
&lt;/h2&gt;

&lt;p&gt;A single &lt;code&gt;.stream()&lt;/code&gt; call can emit node updates, custom progress events, and LLM tokens simultaneously:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for mode, chunk in graph.stream(
    {"question": "What are the key differences between REST and GraphQL for mobile APIs?"},
    stream_mode=["updates", "custom", "messages"],
):
    if mode == "updates":
        # Node completed — chunk is the state delta
        node_name = list(chunk.keys())[0]
        print(f"[node] {node_name} completed")

    elif mode == "custom":
        # Custom progress event from get_stream_writer()
        print(f"[status] {chunk.get('message', chunk)}")

    elif mode == "messages":
        # LLM token — chunk is a tuple of (message_chunk, metadata)
        message_chunk, metadata = chunk
        if hasattr(message_chunk, "content") and message_chunk.content:
            print(message_chunk.content, end="", flush=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Note that the output shape changes with multi-mode. Single mode (&lt;code&gt;stream_mode="updates"&lt;/code&gt;) yields chunks directly. Multi-mode (&lt;code&gt;stream_mode=["updates", "custom"]&lt;/code&gt;) yields &lt;code&gt;(mode, chunk)&lt;/code&gt; tuples. Code that works with single mode breaks with multi-mode because the unpacking is different.&lt;/p&gt;

&lt;h2&gt;
  
  
  Async streaming
&lt;/h2&gt;

&lt;p&gt;For production APIs, use &lt;code&gt;astream&lt;/code&gt; with &lt;code&gt;async for&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import asyncio

from langsmith import traceable


@traceable(name="stream_research", run_type="chain")
async def stream_research(question: str):
    chunks = []
    async for mode, chunk in graph.astream(
        {"question": question},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                chunks.append(message_chunk.content)
                yield {"type": "token", "content": message_chunk.content}
        elif mode == "custom":
            yield {"type": "status", "content": chunk}
        elif mode == "updates":
            yield {"type": "node_update", "content": chunk}


async def main():
    async for event in stream_research("How do vector databases work?"):
        if event["type"] == "token":
            print(event["content"], end="", flush=True)
        else:
            print(f"\n[{event['type']}] {event['content']}")

asyncio.run(main())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  FastAPI + SSE
&lt;/h2&gt;

&lt;p&gt;The standard production pattern is a FastAPI endpoint that converts graph streams to SSE. SSE is one-directional (server to client), works over HTTP/1.1, and auto-reconnects:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from langsmith import traceable

app = FastAPI()


@traceable(name="sse_research_stream", run_type="chain")
async def generate_sse(question: str):
    async for mode, chunk in graph.astream(
        {"question": question},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                data = json.dumps({"type": "token", "content": message_chunk.content})
                yield f"data: {data}\n\n"
        elif mode == "custom":
            data = json.dumps({"type": "status", "content": chunk})
            yield f"data: {data}\n\n"
        elif mode == "updates":
            node_name = list(chunk.keys())[0] if chunk else "unknown"
            data = json.dumps({"type": "node_complete", "node": node_name})
            yield f"data: {data}\n\n"

    yield "data: [DONE]\n\n"


@app.post("/research/stream")
async def stream_endpoint(payload: dict):
    return StreamingResponse(
        generate_sse(payload["question"]),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",
        },
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Set &lt;code&gt;X-Accel-Buffering: no&lt;/code&gt; in the response headers and &lt;code&gt;proxy_buffering off&lt;/code&gt; in your nginx config. Without these, nginx buffers the entire response before sending it to the client and your streaming pipeline becomes a regular HTTP response.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bugs
&lt;/h2&gt;

&lt;p&gt;These break under load.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reverse proxy buffering
&lt;/h3&gt;

&lt;p&gt;You deploy behind nginx or a cloud load balancer. SSE events arrive at the client in one big batch after the stream completes. Cause: proxy buffering is on by default. Set the &lt;code&gt;X-Accel-Buffering&lt;/code&gt; header, disable &lt;code&gt;proxy_buffering&lt;/code&gt; in nginx, and check your cloud provider's load balancer settings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Message chunk ordering
&lt;/h3&gt;

&lt;p&gt;With &lt;code&gt;messages&lt;/code&gt; mode, you receive &lt;code&gt;AIMessageChunk&lt;/code&gt; objects. The &lt;code&gt;content&lt;/code&gt; field is usually a string, except when the model returns tool calls where it's a list of content blocks. Concatenating &lt;code&gt;.content&lt;/code&gt; naively produces garbled output. Check &lt;code&gt;isinstance(message_chunk.content, str)&lt;/code&gt; before concatenating and handle tool-call chunks separately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Backpressure on slow clients
&lt;/h3&gt;

&lt;p&gt;Your agent streams tokens faster than the client can consume them (mobile on 3G, overloaded browser tab). The server-side buffer grows until memory pressure kills the process. Use bounded async queues or configure your ASGI server's per-connection send buffer limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mixed single/multi mode unpacking
&lt;/h3&gt;

&lt;p&gt;Developer switches from &lt;code&gt;stream_mode="updates"&lt;/code&gt; to &lt;code&gt;stream_mode=["updates", "custom"]&lt;/code&gt; and doesn't update the unpacking code. The &lt;code&gt;for chunk in graph.stream(...)&lt;/code&gt; now yields &lt;code&gt;(mode, chunk)&lt;/code&gt; tuples, but the code tries to use the tuple as a dict. No error, just wrong data flowing through. Always use multi-mode from the start, even if you only need one mode today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Stream-based workflows produce many small events. Tag your traces so you can measure stream performance in &lt;a href="https://www.langchain.com/langsmith/observability" rel="noopener noreferrer"&gt;LangSmith&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import tracing_context

with tracing_context(
    metadata={
        "stream_mode": "multi",
        "client_type": "web",
        "session_id": "sess_12345",
    },
    tags=["production", "streaming", "v1"],
):
    for mode, chunk in graph.stream(
        {"question": "Explain vector similarity search"},
        stream_mode=["updates", "custom", "messages"],
    ):
        pass  # process chunks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The LangSmith trace shows per-node timings. Use this to find nodes that are slow to emit their first token (high time-to-first-byte) vs. nodes that produce tokens slowly (low throughput).&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;Streaming doesn't change what the agent produces, it changes how the output is delivered. Evals verify that streamed output matches what &lt;code&gt;invoke()&lt;/code&gt; would return, and that custom events are emitted correctly.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="streaming-agent-evals",
    description="Streaming research agent evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What are the tradeoffs between REST and GraphQL?"},
        {"question": "How do vector databases enable semantic search?"},
        {"question": "What is retrieval-augmented generation?"},
    ],
    outputs=[
        {"must_mention": ["REST", "GraphQL", "tradeoff"]},
        {"must_mention": ["vector", "embedding", "similarity"]},
        {"must_mention": ["retrieval", "generation", "context"]},
    ],
)


from langsmith import evaluate
from openevals.llm import create_llm_as_judge

QUALITY_PROMPT = """\
User question: {inputs[question]}
Agent response: {outputs[answer]}

Rate 0.0-1.0 on completeness, accuracy, and clarity.
Return ONLY: {{"score": &amp;lt;float&amp;gt;, "reasoning": "&amp;lt;explanation&amp;gt;"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the response address the key topics?"""
    text = outputs.get("answer", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def stream_completeness(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Does streaming produce the same output as invoke?"""
    streamed = outputs.get("answer", "")
    invoked_result = graph.invoke({"question": inputs["question"]})
    invoked = invoked_result.get("answer", "")
    # Exact match is too strict — LLM outputs vary. Check key content overlap.
    streamed_words = set(streamed.lower().split())
    invoked_words = set(invoked.lower().split())
    if not invoked_words:
        return {"key": "stream_completeness", "score": 1.0}
    overlap = len(streamed_words &amp;amp; invoked_words) / len(invoked_words)
    return {"key": "stream_completeness", "score": min(overlap, 1.0)}


def custom_events_emitted(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Were custom status events emitted during streaming?"""
    events = outputs.get("custom_events", [])
    expected_steps = {"research", "synthesize"}
    seen_steps = {e.get("step") for e in events if isinstance(e, dict)}
    coverage_score = len(seen_steps &amp;amp; expected_steps) / len(expected_steps)
    return {"key": "custom_events", "score": coverage_score}


def target(inputs: dict) -&amp;gt; dict:
    custom_events = []
    answer_chunks = []
    for mode, chunk in graph.stream(
        {"question": inputs["question"]},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "custom":
            custom_events.append(chunk)
        elif mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                answer_chunks.append(message_chunk.content)
        elif mode == "updates":
            if "synthesize" in chunk:
                pass  # answer is captured via message chunks

    return {
        "answer": "".join(answer_chunks) if answer_chunks else "",
        "custom_events": custom_events,
    }


results = evaluate(
    target,
    data="streaming-agent-evals",
    evaluators=[quality_judge, coverage, stream_completeness, custom_events_emitted],
    experiment_prefix="streaming-agent-v1",
    max_concurrency=4,
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;stream_completeness&lt;/code&gt; verifies that the streaming path produces equivalent output to &lt;code&gt;invoke()&lt;/code&gt;. This catches bugs where stream chunking drops content, like an SSE serializer silently truncating chunks that exceed a size limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Stream
&lt;/h2&gt;

&lt;p&gt;Use streaming for any user-facing agent interaction over 2 seconds, multi-step agents where progress indicators reduce perceived latency, and chat interfaces where token-by-token display is expected.&lt;/p&gt;

&lt;p&gt;Skip it for background jobs with no user waiting, when latency is already under a second, and when the output is structured data rather than natural language.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Three modes in production: &lt;code&gt;updates&lt;/code&gt; for node transitions, &lt;code&gt;custom&lt;/code&gt; for progress events via &lt;code&gt;get_stream_writer()&lt;/code&gt;, and &lt;code&gt;messages&lt;/code&gt; for token streaming. Combine them with &lt;code&gt;stream_mode=["updates", "custom", "messages"]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Deploy behind FastAPI + SSE with &lt;code&gt;X-Accel-Buffering: no&lt;/code&gt;. Watch for reverse proxy buffering, backpressure on slow clients, and the single-to-multi mode unpacking change.  &lt;/p&gt;

&lt;p&gt;Technical References:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/focused-dot-io/03-streaming-agents" rel="noopener noreferrer"&gt;Streaming Agent State GitHub repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/streaming" rel="noopener noreferrer"&gt;LangGraph Streaming (Python)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langchain/streaming/overview" rel="noopener noreferrer"&gt;LangChain Streaming Overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/add-metadata-tags" rel="noopener noreferrer"&gt;LangSmith Tracing Metadata &amp;amp; Tags&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/streaming-agent-state-with-langgraph" rel="noopener noreferrer"&gt;https://focused.io/lab/streaming-agent-state-with-langgraph&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Driving Value with LangSmith Insights</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:15:24 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/driving-value-with-langsmith-insights-5bp</link>
      <guid>https://dev.to/focused_dot_io/driving-value-with-langsmith-insights-5bp</guid>
      <description>&lt;p&gt;Imagine you have a deployed agentic system in production. Everything is going well, users are interacting with the product, and there are no critical issues going on. But what comes next? How can we monitor our system to understand what needs to be improved, fixed or built next? &lt;/p&gt;

&lt;p&gt;The first requirement is to have great observability. LangSmith is a great tool for this.&lt;/p&gt;

&lt;p&gt;We can use it to monitor all of our production runs, detect errors and understand how the model behaves across different interactions.&lt;/p&gt;

&lt;p&gt;In October 2025, October LangChain released a new feature: &lt;a href="https://www.blog.langchain.com/insights-agent-multiturn-evals-langsmith/" rel="noopener noreferrer"&gt;&lt;strong&gt;Insights Agent&lt;/strong&gt;&lt;/a&gt;. This feature allows an agent to analyze your LangSmith traces and surface usage patterns, common behaviors, and recurring error modes automatically. Instead of manually digging through logs, you can let an agent do the analysis for you. If you want to read more about it, here's a &lt;a href="https://docs.langchain.com/langsmith/insights?ref=blog.langchain.com&amp;amp;ajs_aid=d4bdd020-281f-4f7e-86e5-726ef5abdfe6&amp;amp;ajs_uid=141a090a-530d-4b83-96d1-d2b713439671&amp;amp;_gl=1*a05q53*_gcl_au*MTgzNDAzODM2OS4xNzY2MjUyNTM4*_ga*MTAyNTcyODQ0NS4xNzYzMzg1Mjg0*_ga_47WX3HKKY2*czE3NjYyNTI1MzgkbzkyJGcwJHQxNzY2MjUyNTM4JGo2MCRsMCRoMA.." rel="noopener noreferrer"&gt;link to the docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to run the Insights Agent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We are going to go through a simple demo of how to use this exciting new tool with a simple chatbot graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Plus or Enterprise LangSmith plan&lt;/li&gt;
&lt;li&gt;A tracing project with a good amount of traces to analyze&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first thing we need to do is go to our LangSmith project. Once there, we are going to see multiple tabs on the top of the screen. Click on the one that says “Insights”.&lt;/p&gt;

&lt;p&gt;If this is our first time running Insights, we are going to see an empty page and a “Create Insight” button. We can go ahead and click it.&lt;/p&gt;

&lt;p&gt;Now, we are presented with two alternatives for how to run the Insights Agent: auto or manual. For the sake of simplicity, let’s start with the “auto” mode.  &lt;/p&gt;

&lt;p&gt;We need to answer the following questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;“What does the agent in this tracing project do?”&lt;/em&gt;&lt;/li&gt;
&lt;li&gt; &lt;em&gt;“What would you like to learn about this agent?”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt; &lt;em&gt;“How are traces in this tracing project structured? Are there specific input/output keys to pay attention to?”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This information will be used in our agent prompt, and will help tailor the output to our needs.&lt;/p&gt;

&lt;p&gt;We can also choose if we want to use OpenAI or Anthropic as our provider. As a note, you will need an API key for either provider.&lt;/p&gt;

&lt;p&gt;After we click on “Run Job”, we are going to see a message saying the agent has started running in the background and that we will have our results in a few minutes. If we navigate to the Insights tab we are going to see the agent run in progress as well as the results that start to come out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to understand and use the results&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For this example, we are going to be using a chatbot that answers questions about restaurants and helps with making reservations.&lt;/p&gt;

&lt;p&gt;The first part of the output is a summary of the findings. This is going to be the answer to the question we were asked earlier about what we wanted to learn about this agent. In this case, we wanted to understand what customers were asking the chatbot in order to identify user patterns.&lt;/p&gt;

&lt;p&gt;We can see that in this example, 57% of the questions being asked to our chatbot are about feature discovery, 29% are about operating hours, and only 14% are about making reservations.&lt;/p&gt;

&lt;p&gt;This kind of result is interesting because it helps us understand what customers actually need. Maybe we initially assumed that most questions would be about making reservations, but this data doesn’t support that. &lt;strong&gt;LangSmith Insights is critical because it grounds our product decisions in real user behavior, helping us invest engineering effort where it delivers the most value.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If we click on the “Hide Findings” button, we can do a deep dive into the traces, broken down by category.&lt;/p&gt;

&lt;p&gt;If we click on any of the categories we can see all runs within that category and navigate to the trace we are interested in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using evaluation + Insights to get the highest impact on value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once we are comfortable with the categories of our generated insights, we can build evaluation datasets that mirror those categories. This way, we can understand how well our agent is answering questions across categories.&lt;/p&gt;

&lt;p&gt;Why do insights change this process? Imagine we run our evaluations and we discover the agent is only answering 40% of questions around reservations correctly. But insights reveal that reservation questions are actually the least common user queries. That context lowers the overall criticality of the issue and helps us prioritize fixes more intelligently.&lt;/p&gt;

&lt;p&gt;Insights add context to the analysis, but they don’t override business requirements. This is only an example: Depending on the use case, a low-frequency category like reservations may still demand zero errors if the business impact is high.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We have gone through a simple example to illustrate the power of this tool. But as we’ve seen, we can ask the agent virtually any question we want. For example, we could ask, &lt;em&gt;“What types of questions is my agent hallucinating on or answering incorrectly?”&lt;/em&gt; and the agent will find all traces that match that criteria. This is extremely flexible and powerful.&lt;/p&gt;

&lt;p&gt;LangSmith is still king when it comes to building and observing production grade AI applications, and this kind of feature is the reason why I encourage you to try it out and continue to create amazing applications with it!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/driving-value-with-langsmith-insights" rel="noopener noreferrer"&gt;https://focused.io/lab/driving-value-with-langsmith-insights&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Most Teams Don't Have a Data Flywheel</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 22 Apr 2026 18:56:27 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/most-teams-dont-have-a-data-flywheel-33o4</link>
      <guid>https://dev.to/focused_dot_io/most-teams-dont-have-a-data-flywheel-33o4</guid>
      <description>&lt;p&gt;&lt;em&gt;LangChain shows how the loop works. Here's why it stalls in production and what it actually takes to make it compound.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Austin Vance, CEO of&lt;a href="https://focused.io" rel="noopener noreferrer"&gt;Focused&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;LangChain has been pushing a clear idea: production data should make your agents better.&lt;/p&gt;

&lt;p&gt;The loop looks like this: production traces capture real behavior, those traces become datasets, evaluators score performance, feedback improves those evaluators, and improvements get deployed back into the system. Over time, the system compounds.&lt;/p&gt;

&lt;p&gt;That is the data flywheel.&lt;/p&gt;

&lt;p&gt;And it is directionally right.&lt;/p&gt;

&lt;p&gt;But most teams building agents today are not seeing that compounding effect. The loop exists on paper. In practice, it stalls.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Data Flywheel Actually Is
&lt;/h2&gt;

&lt;p&gt;In the LangChain ecosystem, especially with LangSmith, the flywheel connects three things: observability, evaluation, and iteration.&lt;/p&gt;

&lt;p&gt;Production traces become the source of truth. Failures are turned into datasets. Datasets become regression tests. Evaluators score performance at scale. Feedback improves those evaluators over time.&lt;/p&gt;

&lt;p&gt;The goal is simple: every production interaction should become an improvement signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Breaks
&lt;/h2&gt;

&lt;p&gt;The issue is not the idea. The issue is that most teams never fully implement the system required to make it work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Traces are collected, but nothing happens.&lt;/strong&gt; Teams instrument their agents. They capture inputs, outputs, tool calls, and intermediate steps. And then it stops there. The missing step is turning traces into something actionable — structured datasets, labeled failures, repeatable test cases. Without that, you are not building a flywheel. You are just logging behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. There is no real evaluation layer.&lt;/strong&gt; This is where most teams stall. They review outputs manually. They rely on intuition. They make changes based on what "looks better." There is no automated evaluation, no regression testing, no baseline performance. So when something changes, there is no way to know if it improved or regressed. If you cannot measure it, the loop does not spin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Evaluators are not trusted.&lt;/strong&gt; Even when teams introduce evaluation, it often breaks down. LLM-as-a-judge systems can scale evaluation, but only if they are clearly defined, calibrated against human feedback, and continuously refined. Without that, evaluator output becomes noisy. And noisy signals lead to random changes. If you do not trust your evaluation layer, you cannot rely on your flywheel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The loop never actually closes.&lt;/strong&gt; Even when failures are identified, prompts get tweaked ad hoc, changes are not versioned, and fixes are not tested against past failures. So nothing compounds. A real loop looks like this: a failure is captured, the failure becomes a dataset, the dataset is evaluated, a change is applied, and the change is tested against that dataset. If you skip any step, the loop breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. There is no real production pressure.&lt;/strong&gt; This is the quiet failure that kills most flywheels. If your agent is not embedded in a real system, you do not get meaningful traffic, you do not see real edge cases, and you do not generate useful data. Internal demos do not create real signals. Without real usage, the flywheel has nothing to work with.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Real Data Flywheel Looks Like
&lt;/h2&gt;

&lt;p&gt;At a system level, this is not a concept. It is a pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instrumentation.&lt;/strong&gt; Every step of the agent is observable — inputs, decisions, state transitions, outputs. Using structured systems like LangGraph makes this consistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dataset creation.&lt;/strong&gt; Production traces are turned into labeled examples, categorized failures, and reusable datasets. This is where the loop actually begins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation.&lt;/strong&gt; You define what "good" looks like and measure it — correctness, tool selection, completion quality. Evaluations run continuously, not just during development.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Calibration.&lt;/strong&gt; Evaluators improve over time. Human feedback corrects them, agreement is measured, alignment increases. This step is critical and often skipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration and deployment.&lt;/strong&gt; Changes are applied intentionally — to prompts, graph structure, and tool logic. Then tested against historical failures before being deployed. Only validated improvements ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift Most Teams Need to Make
&lt;/h2&gt;

&lt;p&gt;The data flywheel is often described like a product feature. That is the problem.&lt;/p&gt;

&lt;p&gt;It is not something you turn on. It is an engineering system that connects observability, evaluation, feedback, and deployment into a continuous loop. Without that system, you do not have a flywheel. You have logs and intuition.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Most teams do not have a data flywheel. They have a growing pile of traces and a sense that things might be improving.&lt;/p&gt;

&lt;p&gt;The teams that actually get better over time treat this differently. They build the system that makes improvement inevitable.&lt;/p&gt;

&lt;p&gt;If your agent only records what happened, it will stall. If your system learns from what happened, it compounds.&lt;/p&gt;

&lt;p&gt;That is the difference.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>langchain</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
