DEV Community: Focused

AI Agent Infrastructure Is Splitting at the State Layer | Focused Labs

Austin Vance — Thu, 16 Jul 2026 21:40:53 +0000

The agent stack is so boring where buyers keep looking: the loop.

Loop in sight, the rest of the system is elsewhere in the loop (the agent loop did not disappear). Even more, a layer has been created to manage durable records of work done by the agent: state (a durable record of the work done by the AI agent to service customer requests). This is the crux of enterprise work. The enterprise will configure a system (AI or not) to create and maintain such records and also decide who can read such records, how long to keep them, what they are allowed to do with such records, etc. With the introduction of the state layer, AI agent infrastructure now splits at the state layer. Model + a prompt-centered agent no longer represents the infrastructure argument for AI agents.

The state layer of an AI agent is now the dividing line between the agent being a product and the agent being a parlor trick that happens to access Salesforce etc.

The loop is the wrong buying surface

The harder questions live one layer down.

Where does the state live after the tab has been closed? What writes the checkpoint? What happens when a tool call is successfully completed but the model fails before a summary can be generated? Where do traces, policy decisions and permission grants enter the run? What can an enterprise export when the vendor contract expires? Who can delete memory? Who can subpoena memory? Who is responsible for the record when two agents are working with the same customer and writing to the same file?

The durable boundary is where the agent becomes operating infrastructure.

State turns agent behavior into infrastructure.

The database is becoming the backend

On Thursday, LangChain and MongoDB announced a partnership that frames agent systems as requiring retrieval, persistent memory, operational data access, observability, and reliable deployment. The real launch is a set of tools, and the proof that the partnership matters sits in the “more about” line: "Agents need more than a model and a prompt". A tool-centered agent (or prompt-centered builder) is not a product.

That sentence is doing more work than the launch language around it.

The official LangGraph documentation notes that durable memory should use a database-backed checkpointer, with examples for Postgres, MongoDB, Redis, and Oracle. The LangSmith deployment documentation exposes the state-layer switch directly: set LS_DEFAULT_CHECKPOINTER_BACKEND to mongo and provide LS_MONGODB_URI. That line belongs in the platform review, right next to storage, deployment, and data residency.

Boring. Also the point.

State is where lock-in turns real

O'Reilly's 2026 AI agent architecture diagram defines six layers between the LLM and an AI agent running real workloads. Memory / persistent state is a first level primitive in the architecture above the vector database. This architecture matches the actual experience of the market for the framework layer: easy to switch out when the application is well bounded. State is much harder to switch out.

TGVP outlined a similar argument for how stateful services create moats because memory graphs, identity stores, policy logic, and audit history compound over time. While I dislike moat arguments for infrastructure (such as how custom UIs in enterprise software create a barrier to entry that replaces simply offering a “connector catalog” which can then be replaced by a single adapter), there is merit in the increased memory graphs, identity stores, policy logic, and audit histories that compound over time.

I wrote this out originally as part of my vendor evaluation criteria. The first part was model neutrality. This seems to be something people want to hear. Durable memory should use a database-backed checkpointer, and implementation details for that checkpointer should be exposed to review (storage, deployment, data residency, etc) long before the memory format, the trace (or log) schema, the policy and decision-making portions of the state, and the audit export. Open model routing on top of closed state is a nicer cage with better lighting.

Governance follows the state

The state layer also pulls governance out of slideware.

Agent governance can sound abstract and be confined to slideware when it is stored in policy documents. But as soon as the platform is viewed as a series of layers, the team starts to ask concrete questions about a given run of an agent. For example: who was the principal that started the run off? What was the task that narrowed down the set of permissions that were granted to that principal? What was the tool call that was made that actually crossed the boundary of what that principal was allowed to do with that set of permissions, given by that task? What was the policy that actually allowed that tool call to be made by that principal, given by that task? What was the checkpoint that actually recorded that run of the agent? And what was the trace that actually proves that that run of the agent happened in that order.

Research into the security of Agent-based systems is only just beginning to explore the boundary between what the agent is granted permission to do and what the MCP server will actually allow the agent to do. AgentBound looked at 296 popular MCP servers and found that automatically generated permission manifests worked without modification 80.9% of the time, with 0.6 ms average enforcement overhead. It is clear that MCP tool access is heading in the direction of explicit manifests and runtime enforcement of access rights rather than the current model of trust-by-default and running of host processes.

Identity work also dovetails nicely with this analysis. A 2026 AI identity report defines AI identity as a continuous relationship between what an agent declares and what it is observed to do. This relationship is established through the declaration, the observation, and the confidence in that observation. Thus the same architecture that tracks the state of a productive AI also establishes and tracks the identity of that AI. Declaration without observation is a credential. Observation without a durable store of state is a log. Confidence in the system’s understanding of the relationship between an agent’s declared identity and its observed behavior requires that the system have a durable store of state and be able to compare what has happened to what the system expected to happen.

Task-scoped access control, a topic we’ve covered in the past, also falls under this topic. A writeup for TrueFoundry’s TBAC (Task-Based Access Control) function explained how current identity-centric access models can strain under the weight of agents, because each task may need to call for a different permission set. That can happen quickly, and is particularly exposed to prompt-injection attacks. In short, TrueFoundry argues that task-based controls bundle minimal permissions for the duration of work. And that’s a great architecture, even if the vendor packaging leaves a bit to be desired.

We have looked at the authorization side before through tool-call security and agent principals (e.g. running an agent as a user). From a state-layer perspective, these all tie together: authorization, identity, memory, checkpoints, traces and audit records. The record of an agent doing work.

The stateful loop is the product

A live agent run has a rhythm.

Load memory. Call tool(s) and enforce policy for each. Write trace for each tool call. Record audit event. Update run state. Later, resume run from evidence collected during run. As mentioned previously, each live agent run will have a history which can be inspected by the platform. Such a history provides much greater value than having the agent’s runs be driven by prompts to which the model may or may not answer correctly.

State compounds.

The work does not have to be “unlearned” when an agent run is abandoned for whatever reason. A new run of work can start from the exact point at which the previous run left off.

However, all of this work would be for naught if the evidence from past runs of an agent were not properly monitored. This is why agent monitoring is infrastructure. The stateful agent loop that we discussed earlier includes a section for “write trace and audit records” for a reason. This is the means by which all of the side effects of a run of an agent are recorded, so that they can be inspected by a human (or automated system) later.

It is therefore no surprise that agent platforms that can handle real workloads are focused on platform work (as illustrated before).

The model may still be the expensive line item. The state layer is where operational trust accumulates.

Platform evaluation moves to the state layer

The buying checklist should change.

Where does it store the checkpoints for previous runs? What is the format of the platform’s memory? How does it join together the information from thread ID, run ID, trace ID, and audit ID? Does it record permission decisions made by a tool call with the record of that tool call, or does it have to rebuild that information from elsewhere? Can it export the state information on the platform surfaces, or is that something that one has to ask support to do for them? (Note that just because something is exported as a state, it doesn’t mean that the state information exported is the same as the information that was stored as state information on the platform surfaces). Can the system delete the memory of one user without affecting the memory of other users that are stored in the same index? How does the system partition out the information that it stores for state information by tenants, by environment, by region, by retention rules, etc.

Then ask the uncomfortable version: what gets lost during migration?

Prompt history and agent builder configuration are one thing. Checkpoints, memory, traces, approvals, audit records, and policy history are another. If the vendor can’t migrate all of that to the customer's new backend then the vendor is in control of the live behavior of the agent. Yes, the agent runs on the customer's cloud. Yes, the customer pays for the model and has a model key that the customer uses to sign into the agent. But the memory, the record of the agent’s governance, the record of the agent’s work, that lives somewhere else.

Own that layer or the agent's memory, governance and operating history belong to someone else's backend.

Agentic Workflows Should Get Less Agentic | Focused Labs

Austin Vance — Wed, 15 Jul 2026 03:32:07 +0000

Agentic workflows are supposed to get boring.

The first pass can be exploratory and pretty expensive. The tenth pass should be boring and not have to rediscover the solution through a completely new set of tools. That workflow has earned a promotion.

That is the useful part of the new Progressive Crystallization paper. It names a lifecycle agent teams are going to rediscover the painful way: agents explore, traces prove repeated behavior, tests turn that behavior into workflow code, and telemetry from those runs decides when the workflow gets demoted back to the agent layer.

The paper presents data from a cloud-network operations system handling real incidents in which deterministic executions went from 0% to 45% over eight months. In the meantime, the per-incident agent cost went down by more than 70% and the number of incidents doubled. The model did not simply get cheaper. Solved work stopped being agent work.

Exploration belongs upstream

We continue to see agentic workflows get treated as a fixed architecture choice: build an agent, give it tools, add approvals, add observability, and hope the loop gets better.

The better shape is a lifecycle.

The Progressive Crystallization taxonomy organizes the various possible forms execution can take in a lifecycle agent. The Type 3 form of execution is 'agent-orchestrated'. That is to say that an agent investigates a problem with bounded autonomy, checkpoints reads and routes writes through human approval. Type 2 forms of execution are 'hybrid'. In a hybrid execution a workflow of steps is owned by a workflow, but the model is used to interpret or classify information at scoped points in the workflow. Type 1 forms of execution are 'deterministic'. They consist of typed API calls, conditionals, and so on, that together form a rigid and non-negotiable execution path. There is no place in a Type 1 workflow for model invocation at runtime.

The maturity curve moves agency out of repeated execution and leaves it for novelty.

This is how enterprise teams build out agentic workflows. First, a novel incident gets explored with tools and state. Then, as that incident repeats, the team wants a workflow that implements that path of work. Then, as that workflow starts to get weird on the edge cases, the system needs a path back to exploration. And that's what we've been saying all along about LangGraph as the foundation for enterprise agents.

The compile step is the economic boundary

First we "reason" once through a problem. Then we execute that same problem solution over and over again as repeated incidents of the same thing. The Progressive Crystallization paper documents a real-world crystallization workflow, which turns out to be a straightforward incarnation of this fundamental software pattern.

The Compiled AI paper describes a class of live AI systems in which an LLM is used once to generate an artifact, that artifact is then validated, and then a large amount of code is run with zero execution tokens. The authors evaluate function-calling, reporting 96% task completion with zero execution tokens, and a break-even point of 17 transactions to compile the function. They report 57x reduction in tokens at 1,000 transactions.

If a workflow gets run repeatedly (i.e., it is a repeatable process), the fact that it is an agentic workflow gets to be a tax on the number of times that workflow gets run. That tax is enforced by runtime inference and, thus, is best countered by reducing the number of runtime inference steps, in particular the number of execution tokens required to run each step of the workflow. There are different approaches to do this (for example, model routing, prompt caching, reducing the size of the LLM used to implement each step of the workflow), all pointing at the same idea: a process repeated enough times becomes worth compiling into a function that can be called directly. We previously laid out the cost argument in AI Agent Cost Is a Runtime Signal, but that was at the level of an entire agent, whereas this tax can be incurred by individual workflows within an agent. That is the basic move behind Compiled AI.

Note, a workflow that is promoted will have a slightly different contract on the trace. The trace will serve as a candidate specification, which in the end would have to pass acceptance tests. The steps in the promoted workflow would have to have side effects that are idempotent (i.e. with keys, upserts, read-before-writes). The human approvals in the trace would have to be encoded in the workflow as explicit states instead of as mere chat turns. The failure modes of the steps in a promoted workflow would have to be of a particular type (i.e. typed exits).

LangGraph keeps the seam visible

So where does the seam go between exploration and execution?

LangGraph is especially useful in this situation because it does not force a team to decide between using a free-form agent loop, and building a full-fledged workflow engine. Instead, it lets a team use existing Python control flow and wrap API calls and other execution logic with persistence primitives, such as memory and checkpointing. Two primitives in the Functional API, @entrypoint and @task, are plain, and they end up powerful. An entrypoint defines a workflow, while a task defines work within a given workflow.

This plainness matters for two reasons. The first is that repeated execution of a particular path through the agent is just repeated execution of a function. Therefore that function can be written as Python code, complete with a model call in the middle if that still makes sense for the application. Second, the task API lets a team write one task for database records and another task for approval. In the first task, the write can carry an idempotency key, and the approval can carry its own. This way, the workflow can still do approval steps, and the workflow still gets the benefit of persisted runtime state for the agent, without doing that in a single, tricky prompt.

LangGraph also tracks changes to past workflow runs, and implements important implications for a reproducible workflow API. Saved workflow checkpoints are loaded into a resumed workflow from the beginning of the entrypoint, re-running @entrypoint calls and loading saved results from @task calls completed before the last checkpoint. The backward compatibility docs make the implication blunt: adding, removing, or reordering task calls before the last resume point can replay incorrect cached values. Nondeterministic code outside a task can also change behavior during replay.

Just like compiled code, a crystallized workflow is something that can be run again and again. It can have a version number. It can have a window of time in which to complete (a "drain window"). It can have a set of rules for migrating to a newer version in the middle of a run (a "migration plan"). And a path that has been promoted out of an agent loop is now something that behaves like API calls against a persisted record of the execution of that path. So it can be edited as a whole as a single "implementation", as a single program, rather than as a series of individual edits to individual prompts.

So the pattern here for live agentic systems, as outlined in the LangChain documentation for agent frameworks is to define a system of a combination of workflows and agents and let workflows be the simplest, cheapest, fastest and reliable means to complete a task when it doesn't require agency. Thus the mix of workflows and agents for a system will change over time.

Promotion is an evidence gate

The fastest way to ruin this workflow is to promote it based on vibes.

A model claiming a repeatable path is largely meaningless. A corpus of traces for same cases going through same path is significantly more valuable. A proposed workflow which passes tests for candidate specification is even more valuable. A promoted path of work with no human overrides, bounded by appropriate tool permissions, idempotent for all side effects, and clean rollback semantics, is starting to look like durable software.

First, approval queues belong in the runtime for agentic AI workflows. Once the path is known, agent orchestration belongs in code. Evals should sit close to where changes happen, and traces should sit close to the decision. So, the gate to "promotion" (i.e. to live usage of a particular path for a given case) is now just another release gate for a service.

The checklist of persistent state, re-executable work, idempotent writes, etc. to determine safe replay of the node is boring but there. That's what needs to be there for repeated agent paths to be 'promoted' to newly 'durable' function calls. Until then, they remain fragile in-the-loop workarounds with no persistent contract and hence no way to track, test or reproduce them for regression testing.

Observability decides demotion

Promotion gets the money conversation. Demotion keeps the system honest.

A deterministic workflow can decay. The API that a task uses can change. Firmware can be updated to add a field. A support policy can change. Workflows can continue to pass syntactic validation while slowly drifting away from the real work being done. The promoted system needs a way to notice this kind of decay without incurring the cost of the agent on every run. Traces become the control plane.

OpenTelemetry's GenAI observability post describes a single trace for an end-to-end workflow of invoking an agent, having a chat with a model, and executing a tool. Honeycomb's fast AI feedback loop writeup shows how similar telemetry can be used to drive SLOs, alerts, and dashboards, and can even be queried with Honeycomb's native query language.

This is the same reason Agent Traces Rewrite the Harness matters: traces of agentic workflow execution under real traffic matter. They become eval cases, they become harness patches, they become release gates, and they provide solid proof that fixes to solved problems actually stay fixed.

Observability turns agent traces into promotion evidence and drift into a demotion trigger.

The number of repeated incidents where the platform is still burning full agent loops, even after it has seen the pattern before, belongs in front of leadership as the waste signal.

Spend agency on novelty

The goal is not to remove agents from agentic workflows, that would be counterproductive and as silly as keeping all solved paths inside an agent loop forever.

Agents are valuable because there is always going to be uncertainty and changing reality: new requests, new incidents, misbehaving APIs, and policy exceptions between systems. The agent does all of its work at the boundary of the known process.

The operating mistake is letting that boundary stay frozen.

Workflows in live systems should decrease in agency over time. They spend fewer tokens to execute solved work. They save receipts for the execution of risky work. And they use the model for the part of the workflow where judgment still matters.

The Type 1:2:3 mix is the scoreboard. If every execution stayed Type 3 forever then the platform would be charging the business for amnesia.

Multi-Agent Systems Break at the Collaboration Plane | Focused Labs

Austin Vance — Sun, 12 Jul 2026 18:08:49 +0000

Multi-agent systems break where agents coordinate.

This investigation begins much as the last investigation began: with a run that looks workable and a seemingly adequate summary for the supervisor from three individual specialists. The trace looks busy but largely as intended, with a nice stack of model calls as specialists work individually to develop focused answers to well-specified questions from the supervisor. Then there is a human query that changes everything: who knew what when, what did they find out when, and which thing found along the way actually drove the investigation in a particular direction.

Silence.

That silence is the collaboration plane missing from the architecture.

A common architecture for multi-agent systems has a Supervisor issuing requests to Workers, each of which performs a focused task. The results are returned to the Supervisor, which synthesizes the output from all the workers into a final answer for the Human. LangChain describes this as a subagent architecture, where the main agent calls subagents as tools, keeps track of the conversation memory, and uses stateless Worker calls as a way to isolate context and run in parallel (LangChain multi-agent subagents).

Coordination is where things start producing strange results.

This operating surface, the surface of a system that is used for work or investigation, stores the information needed by a cooperative system to function: the claims made by participants, the current work of every agent, the findings so far, summaries created for other users, and event queues that store the coordination of a team of workers across turns of a conversation. Without preserving evidence of those human-visible parts of the collaboration, the system quickly devolves into vibes tracked with logs rather than cooperative problem solving.

The supervisor pattern tops out

AWS Prescriptive Guidance makes the distinction that matters. Workflow agents run through a centralized coordinator. Multi-agent collaboration uses decentralized or role-based peers that negotiate, share information, adapt, and communicate through shared memory, messaging queues, or prompt chains (AWS Prescriptive Guidance on multi-agent collaboration).

That collaboration contract looks different. Tracking a workflow means caring about the point where the coordinator delegated a task. Tracking a cooperative investigation means caring about the point where each participant made a locally reasonable move given the evidence available to that participant.

I have recently come across a clean architecture for collaboration over investigation in Honeycomb's Canvas. In their architecture each investigation is modeled as an AWS AgentCore Runtime session. Instead of each user on the investigation having an isolated chat session with a single model in that session, all the agents on the investigation can read and write to a collaboration plane, and each user has an LLM session inside that investigation session (Honeycomb's shared-canvas architecture).

The collaboration boundary belongs to the investigation, not to each user's chat.

First, this surface is straightforward to manage. Every person can follow a lead. Their agents can follow completely different leads and still work together. In the multi-user case, each user has an LLM session, and the agents for all of the users run within the collaboration plane for that investigation.

That is the boundary I care about.

The idea we have long pursued is making agents as manageable as apps, by considering them runtime products made of model calls, tooling, sandboxes, policy, traces, evals, credentials, and audit receipts (enterprise AI agents are runtime products). Multi-agent systems add another critical boundary inside that space: the shared state surface that exists across agents collaborating to complete work.

Shared memory is too blunt

A shared memory bucket hides the actual boundary.

Memory is for facts and information. Collaboration state is for work in progress. Facts can help an agent recover from confusion. Work in progress decides whether the rest of the team gets confused with it.

When an investigation is shared with other people, the collaboration plane for that investigation contains five buckets: who claimed what hypothesis, what the current actions of the agents and humans are, what has been found, a summary of what each peer found and told their human, and a set of private event queues to coordinate the next moves of the agents and humans (Honeycomb's collaboration-plane details).

The shared state is small, boring, and decisive.

Claims prevent others from duplicating work that has already begun. Current activity shows other agents currently active within the investigation and provides situational awareness, no need for each agent to re-read the entire history of conversation for a topic. Findings allow negative results to travel across an investigation. A hypothesis that has been ruled out by one agent should shrink the search space for other agents investigating similar issues. Peer summaries allow humans to coordinate without forcing each human to read the entire conversation that another human had with an agent. An event queue coordinates turns within a collaboration and prevents work from becoming polling.

Tiny data model. Large operational consequence.

The shape is also familiar. Incident teams create a shared record out of the conversations that occur during an incident. Each person claims a thread of work, pins evidence, rules things in or out, and posts a note when context changes. I wrote about the same pattern in AI incident management breaks without a shared record, because the work is shared evidence under time pressure.

Agents become manageable only when their shared working space becomes the critical boundary that gets managed.

The runtime session is an architecture choice

AWS AgentCore Runtime makes the boundary concrete. The Bedrock AgentCore Runtime docs describe a session as a dedicated microVM with isolated CPU, memory, and filesystem for that session. Runtime sessions of different users are independent. The same runtime supports MCP and A2A communication, long running workloads up to eight hours, filesystem state across stop and restart, and traces for reasoning steps, tool invocations, and model interactions (Amazon Bedrock AgentCore Runtime).

Honeycomb took that approach for Canvas and mapped investigations to runtime sessions, because investigations are the durable units of analysis. User chats are transient. Agents come and go, stall and try again, then hand back context. The runtime reflects the nature of the work.

Generic multi-agent orchestration platforms are easy to demo but difficult to deploy for real work. Counting agents is easy. The state contract is harder: what agents assert to exist, what agents can modify, what gets recorded in trace attributes, what evals check, and what humans have to follow in order to retrace the steps of the agents to a particular conclusion.

No team_of_agents.final_answer field will save that design.

Collaboration bugs do not throw exceptions

MLflow's post about multi-agent observability names three failures that become apparent when multiple agents interact in a shared environment: cascading errors, shared memory pollution, and waiting loops (MLflow on multi-agent observability).

A stale finding is a fact. A specialist can time out while the parent agent waits. A negative result may never reach another agent, so the system spends time and money re-running the same series of tool calls that the first agent already ran. A peer summary can say something misleading, like "payment errors started after the deploy," based on a finding that actually ruled out the deploy in the first place.

The trace has to expose the collaboration plane, not only the LLM calls around it.

This means that instead of simply including the text of a message in a span as the event's description, the trace needs span attributes that track coordination objects: claim.created, claim.released, finding.superseded, peer_summary.read, event_queue.delivered, hypothesis.invalidated. The attribute names are less important than the discipline. The trace can now show which agent changed the shared state of the investigation, what evidence an agent attached to a finding, who read a peer summary, what events were delivered to an event queue, and what hypothesis an agent invalidated.

Awhile back, Focused made the observability point about agentic systems that I am making today about trace semantics: agentic systems need an observable collaboration surface, not a messy tangle of disconnected logs (observability and agentic systems).

The useful version joins them.

The eval should grade the handoff

Final-answer evals are too narrow for cooperative systems. They can report that a correct answer was produced by a multi-agent system, but they will not report the bad coordination path that produced it. The same bad path will, in the end, produce a wrong answer that no one will be able to explain.

I investigated how to measure behavioral structure for cooperative work. The arXiv paper on Entropy-Based Observability for AI Agent Behavior develops trace-derived signals that give insight into agent behavior: action entropy, trajectory entropy, tool entropy, information gain, and outcome entropy. Success, reward, latency, and cost fail to capture behavior that matters. I think these metrics would be useful for getting a handle on the collaboration plane, where such behavioral structure already exists and is now observable (Entropy-Based Observability for AI Agent Behavior).

A collaboration eval should ask boring questions.

Did the agent check active claims for other agents before starting to work on a finding? Was there enough evidence added to a finding to aid agents who may later refer to that finding? Were agents treating summaries from other agents as hints or as facts that led them down a false path? Were stale findings properly superseded by new findings? Did the trace show the key event that caused the second agent to change its work plan?

Those checks belong next to the regular success metrics. They are the difference between "the system answered correctly" and "the system coordinated in a way I would trust again."

This is where conversation IDs earn their keep. A single conversation ID across agents and tools lets the trace preserve the user-facing path. Cooperative investigations also require investigation IDs, claim IDs, finding IDs, and queue-event IDs for the trace to be of any use for the work being done (AI agent observability runs on conversation IDs). The ID model must mirror the collaboration model. Correlating calls is not enough to explain the work done by the system.

Own the collaboration plane

Multi-agent architecture should start with the shared record.

First, determine the durable unit: a case, incident, investigation, branch, or task. Then decide which collaboration objects reside within that record: claims, findings, active work, peer summaries, evidence, event queues, approvals, rejected hypotheses, and rollback notes. Agent handoffs, for example, can be implemented as routing, but that would miss the point of runtime state. Agent handoffs turn routing into runtime state, and LangChain handoffs persist state across turns by updating a state variable and changing behavior based on that state (LangChain multi-agent handoffs).

What is the shared surface of truth while agents work together?

If the durable unit of work is the transcript, then the system is simply a chat app with extra workers. If the durable unit of work is a vector store, then the system can store facts about work, but it cannot do the work. If the durable unit of work is a traceable collaboration plane with explicit objects, owners, mutations, and eval checks, then the system has a shot at real work.

Buyers will specify multi agent AI, multi agent architecture, and multi-agent collaboration because those phrases sound like capability requirements. I would interpret the ask as a collaboration-plane contract which the AI agents execute.

Show the objects. Show the trace. Show the eval. Show how two agents chase a single hypothesis for a while. Show how one agent can go down a wrong path and write a finding that is bad for subsequent work. Show how a human can find that finding and correct it. Show how another agent can then change direction as a result of the new finding.

That is the system.

The agent count is trivia.

Enterprise AI Agents Are Runtime Products | Focused Labs

Austin Vance — Fri, 10 Jul 2026 17:53:03 +0000

Enterprise AI agents are runtime products. Teams get tricked thinking they bought a ‘clever model wrapper’ (a Deep Agents agent), but the real work goes on in code, network egress, credentials, traces, evals, deployment revisions, and audit receipts.

First, as we’ve already discussed, enterprise AI agents have a control plane now. But then there’s the real work of an agent as a product with a runtime boundary, that of a control plane running within a specific product boundary. It has an API. It has a release path. It has a rollback story. It has an owner who can explain what happened after a regulated workflow goes sideways.

LangChain and NVIDIA’s NemoClaw announcement for the Deep Agents Blueprint actually names the surface area, where Deep Agents Code, Nemotron 3 Ultra, and OpenShell combine into an agent system of models, harness, evals, and runtime work (LangChain and NVIDIA launch the NemoClaw Deep Agents Blueprint).

An agent becomes a runtime product when the loop, policy, credentials, and receipts sit inside one owned boundary.

Sensitive code makes the boundary obvious

Coding agents are useful for testing the permission boundaries of an agent. A coding agent reads files and writes files, runs shell commands, installs packages, runs tests, etc. The repo contains secrets-adjacent assumptions even when no secret string is actually stored there. As such, a coding agent can rapidly grant itself permission to do all sorts of damage.

This is why we found the NemoClaw Deep Agents Code post much more interesting than the open models announcement. Its permission set is exposed. We wrote yesterday that AI agent security happens at the tool call. Running them securely is important. The post spends real detail on the runtime. Coding agents run dcode inside of an OpenShell sandbox by default. Network egress is denied by default and can be approved on a request basis. Credentials never enter the sandbox. Each run of a Coding Agent can be snapshotted into a per-session audit log.

This gets at the operating question of the runtime-product frame: who owns the product that enforces the security boundary. When a network request is denied, the denial needs a home. When a package install is approved, the approval needs a release record. When a runtime product edits a migration script, the trace, eval, and diff need to prove that the resulting script is safe enough to merge.

A regulated team does not buy an ‘agent’. They buy a product surface on which an ‘agent’ can operate without breaching policy. Boring (which is the point).

The model wrapper is the small part

The enterprise stack is moving toward productized runtime agents because work around the model has become more important than the model itself. LangChain’s Enterprise Agentic AI Platform Built with NVIDIA combines LangGraph, Deep Agents, NVIDIA NIM, NeMo Agent Toolkit, OpenShell, LangSmith observability and NeMo Guardrails for live systems.

Here the crucial point is that an enterprise has a strong incentive to lock in the harness layer because that is where all the dependencies of the model are. Indeed, an enterprise can change the model route (i.e. switch from one LLMOps to another) as cost, latency, data boundary etc. change. But it cannot easily change the operating memory of an agent that is running live.

That is the same argument behind the agent harness becoming the lock-in layer. NemoClaw Deep Agents Code, announced by NVIDIA in partnership with LangChain, is benchmarked against the LangChain eval suite with Nemotron 3 Ultra achieving an aggregate score of 0.86 at a cost of $4.48 per 100,000 tokens to run compared to $43.48 for the next closest model in the suite of evaluations (LangChain and NVIDIA launch the NemoClaw Deep Agents Blueprint). The useful work of the agent is above the model, and the product owner can route coding, retrieval, review, and incident work through model routes based on quality and cost recorded by the harness.

Similarly, once the model determines the best course of action, the runtime determines what moves are possible (i.e., which tools to call and with what permission), what pauses are warranted, what narrower credentials are required to perform a move, and what audit receipts are left behind.

Schneider shows the operating model

Schneider Electric’s LLMOps with LangSmith case study is one of the cleaner enterprise proofs of how to treat LLMOps as more than just a launch point for open models to be used as a dashboard project. In the case study Schneider details the use of LLMOps with LangSmith to support 160,000 employees across 107 countries with 40 Billion euros in annual revenue. They have an AI Hub of 350 experts that have developed 60-plus agents that cover Energy, Assets and Developer Productivity.

Schneider Electric’s case study on LLMOps with LangSmith is interesting because observability is treated as product infrastructure. Schneider Electric uses one LangSmith workspace per AI product across development, QA, pre-live, and live environments (Schneider Electric LLMOps with LangSmith). In the live environment, all traces are stored in a workspace that can be revisited by developers for offline evaluation in their development datasets. This is what I call a correct approach to observability. A workspace per environment is what the org chart would want, but that would break the learning loop.

That same case study goes on to explain that “One Jo” supports 160,000 employees in 107 countries around the world (Schneider Electric LLMOps with LangSmith). Every conversation on that platform is traced. Live traces are reused to feed regression datasets that test new models and prompts for employee automation work.

The enterprise loop is boring on purpose: trace, annotate, test, revise, deploy, repeat.

This brings us to the deployment. The agents discussed above require streaming, long-term memory, human-in-the-loop, and background processing. So, LangSmith Deployment was the natural choice here (Schneider Electric LLMOps with LangSmith). However, here too, LangSmith takes a different approach to the typical, ‘one runtime per enterprise’ approach that AI tools follow. Instead, LangSmith Deployment allows each AI product to run on its own runtime stack, so the product owner can reason about data residency, latency, eval gates and rollback inside one product boundary.

That is where agentic AI implementation turns into change control. Even a prompt update or a new model route can touch sandbox images, a credential broker, tool grants, evaluators, and the other tools used by the product. Each such update is a product change, and the release record has to list it with the same seriousness as any service touching customer data.

Observability is part of the product, not a sidecar

Honeycomb’s work around its hosted MCP server is relevant to the second half of the runtime. Honeycomb has GAed its MCP server with support for BubbleUp, heatmaps, and histograms (Honeycomb MCP GA). The team there found CSV output, rather than JSON, provided something like 40% token savings for tabular tool output for evals, a tiny but relevant thing because runtime products make the next tool call easier to evaluate in terms of cost.

Honeycomb support is quite human also. Canvas and Honeycomb MCP running on top of Slack, Linear, code views, docs, and data context allows support to typically complete investigation and reach correct root cause before handing off to others for escalation (How Support Uses Honeycomb to Debug Honeycomb). The root cause identified by support is not buried under misinformation, detours, and misdirection. Support remains in control throughout the process, with the chain of evidence preserved.

As traces are stored as harness changes, this becomes a real problem only if the runtime can use them as eval cases, regression datasets, SME annotations, and release evidence.

Runtime governance is path governance

Path-dependent governance. One read is allowed. One analysis step is allowed. One outbound message is allowed. Put that together with customer data and the runtime has a different problem to solve.

The runtime-governance paper says that the behavior of AI agents is non-deterministic and path-dependent. In it, the execution path of an agent is defined as the central object for its governance. The paper Runtime Governance for AI Agents: Policies on Paths describes a policy function that takes into account the identity of an agent, the partial path that the agent has already traversed, the proposed next action, and the organizational state. It then returns a probability for the policy-violation of that next action. Prompt rules restrict the possible behavior of an agent. Static access control restricts the set of possible actions an agent can perform. However, in both cases, no restriction is placed on the sequence of allowed reads, analysis steps, and so on that are performed by an agent before the next action is added to the agent’s behavior.

Runtime governance cares about the path, not the morality of each isolated tool call.

This is why approval queues as runtime state matter. There is no value in approvals being made after the fact and written down in Slack as theater. A durable interrupt in the runtime can record the path, the approval reason, the approver, the edited payload, the final decision, and the audit receipts along the way.

A July 5 Internet-Draft proposes a security evaluation benchmark for AI agents. The Draft is individual and has no formal IETF status. Still, agent security evaluation is becoming its own benchmark category.

Own the product boundary

The product owner of an AI agent for an enterprise is close to the API contract owner for that product. The owner might sit in platform or product. Fine. The author of prompts is not enough to own the product.

This is why I like to speak of “runtime products”. There is something concrete to this term. It applies cleanly during architecture review and goes through the operating surfaces that already exist for normal software: ownership, release, data, policy, evidence and operations.

Using “agent” as a label for this work is too weak. Using “runtime product” is harder to duck and weave around. It frames the key architectural point: the operating surface is already familiar to the organization. Ownership, release, data, policy, evidence and operations.

This is the correct benchmark for evaluating an AI agent for use as an enterprise AI agent: Does it act?

Agent Traces Rewrite the Harness | Focused Labs

Austin Vance — Thu, 09 Jul 2026 17:48:47 +0000

The basic function of traces is to make the agent harness better. Until then, they are simply expensive evidence. Stacks of model calls and tool calls and retries, of rejected actions and user fixes, of the dollar amount spent in tokens. Someone has to look at them.

We have seen companies spend millions to store traces, name it “observability”, and then the same incident reappear just with prettier pictures (e.g. trace with annotations). The agent still loops, the underlying tool schema is still wrong, the prompt still forces a fixed-size answer that model forgot 3 steps ago, the evaluation still is focused on end answer instead of the long windy path that produced that answer.

The issue is the missing improvement loop.

LangChain put the current version of this plainly: traces are becoming the currency of long-horizon agent improvement. That should make every engineering leader a little uncomfortable. A live agent is already generating the dataset. The question is whether the organization has a way to turn that dataset into labels, eval cases, harness patches, release gates, and new deployments.

That loop is the new agent harness work.

A trace is a work order

Agent observability used to be about tracing and debugging (as we typically think of it) for model-based agents. But this is table stakes for engineering and operations in general. Observability for agents is actually about understanding how an agent works on a day-to-day basis, and how it arrives at a particular decision or answer.

That is table stakes. LangSmith's observability model splits the world into projects, traces, runs, and threads. So, for example, tracing out a failure in a project can be useful for the engineering team working on that project. To debug a long-horizon agent, however, that trace must be opened up to show the run-level activity, i.e. the individual calls that together make up the trace. And then the threads of conversation through traces that were initiated as multi-turn conversations must also be openable.

A user correction is a label. A rejected tool action is a label. A repeated request is a label. A long reasoning path leading to a trivial step is a label. A tool call that succeeds but produces an answer that is vague is a label. Live traces are an expensive piece of evidence, so traces of live agent work, traces from which we could learn to improve long-horizon agents, become an expensive dataset. A cost signal from such a dataset would be a trace in which heavy spend arrived at a place with no improvement. Such traces are best studied outside of finance meetings with coding agents, because coding-agent spend already belongs in the trace. Such study can produce tool descriptions, information about middleware, information about routing, information about evaluators, and so on.

The trace matters when it changes the next harness release.

LangChain is moving the loop into harness engineering

The latest innovations in the field focus on improving the model (because apparently that’s been neglected). In the Nemotron 3 Ultra playbook, LangChain kept the model fixed and changed only the harness: the system prompts, tool descriptions, and even the middleware around individual model calls and tool calls. Generation settings are left at vendor defaults. Yet even with a fixed model, bigger gains are to be had from changing the harness.

That is the part enterprises own.

One more important point. The model vendor can fine-tune the model to perform better on a task. And that is great for the vendor. But the live system has a local contract, with local tool semantics, and local policy boundaries. The local approval points, local data access rules, local latency and cost budgets, all of these things get encoded in the agent harness, and that is what live evidence proves about.

On the one hand, the harness playbook starts with evals, i.e. the same things one would fine-tune in order to improve a model via more prompt editing. But on the other hand, since harness changes against real traffic do not have learning signals, these too must go through the harness eval process, which are nothing but traces of agent behavior running through the agent’s live harness. Better Harness describes live traces as a high-throughput source of eval material. The only additional step is to tag, holdout, human review, and sample representative traces as with any other source of eval material.

And this is a perfectly reasonable thing to do: go through traces of Agent executions, cluster failure cases together, select a representative subset of those traces to test the improved harness functionality (i.e. to run those traces through the harness with the updated logic), holdout traces (if there are enough of them) aside for verification of lack of regression, go through all of those with human review for correctness of interpretation of failure, add tags or other identifying markers to the subset the team tests with to determine whether particular failure cases were resolved for particular changes to the harness, and then, as before, watch the new traces as they are produced by the system after the improved harness has gone live.

Trace mining produces the missing evaluation material

Opening one trace is trivial. But to get value out of traces, a team needs to mine a large number of them.

See work around fine-tuning a perceived-error classifier from Fireworks and LangChain on trace-mining for superior traces-for-learning in a Qwen fine-tuned judge, using chat-langchain and Fleet data. Their results found that, for the task of trace judging, a trace-mining based solution was 10 to 100 times cheaper at scale while matching or beating frontier models on the trace-judging task.

That product produces artifacts a reviewer can inspect:

labels that explain what went wrong
issue clusters that group repeated failure modes
representative examples for offline evals
holdout sets that protect generalization
monitors that reopen a problem when it returns to the foreground
evidence for deciding whether a failure opened by an agent belongs to the harness or to the model

The LangSmith Engine also maps traces to a structured development workflow through its product loop. The docs say it works from live traces to surface recurring issues, diagnose root cause, propose fixes, deploy evaluators, create dataset examples, and reopen issues if they resurface. All of this is also analogous to agent failures opening tickets for a human reviewer to inspect and debug. As such, trace review from such agents also surface named work with owners, with evidence, that can also be used to setup regression checks on the harness as well as the model.

Later that month, Focused argued that AI agent evaluation ends too early when it is restricted to pre-release experiments. That AI agent accuracy is an observability problem, evidence from live systems is required to prove that the system worked, follows naturally from this.

The agent harness changes after the trace

This post looks at how reviews of traces can be “focused” by having them end with a name for the update surface that will fix things. The kind of note that says “Model hallucinated” is essentially useless as it essentially says “we gave up here” and does nothing to actually help improve things.

The reason for tool misuse is that the tool contract (the documentation) and/or the schema and/or the return format of the tool are not correct. So it might be good to have a closer look at the field that is wrongly included in the schema, and why the returned blob by the tool cannot be used as is in the dashboard.

Looping points to the area of middleware. A cap. A checklist. State-aware interventions after the second search for the same thing. Do not rely on model to stop looping and then treat that as governance in the system prompt in the 7th paragraph below.

Bad reasoning budget: This is the issue of how teams route work to models. An inexpensive model can do classifier work. A more expensive model can do multi-step repair work. A model that is reading the same file over and over again deserves a harness intervention before the high bill becomes a personality trait of the model.

Perceived error (human judgment) that there is an error: These are situations where users may correct the AI agent, reject output, or even rephrase the same request over and over again until they get the desired response. In these scenarios, the traces should be used as training material for the next round of evaluation by the agent.

Domain gaps in the trace or issues not reproducible require fine-tuning the model after the problem has been understood from the traces generated before. Fine-tuning before resolving the root cause of a failure in a live system creates fossilized mistakes, pleasant to trip over as a well placed concrete boot.

The failure type decides which part of the agent system changes.

Day one can be boring. The trace can create a record that an engineer can trust. Which traces contributed? Which label fired? Which eval case was added? Which holdout protected the change? Which harness patch shipped? Which release gate passed? Which new traces prove the failure stayed fixed.

Those are receipts, not vibes.

AI agent lifecycle management starts here

Similar views on the role of live traces in AI agent lifecycle design come from Honeycomb in its write-up on AWS AgentCore. Canvas on AgentCore is a full fledged application that has session, deployment version, runtime state, online and offline evaluation, tool usage, and so on, all tracked for agent based applications. Honeycomb says to instrument the agent from day one.

But keeping them connected after release is a much harder problem, i.e. AI agent lifecycle management.

The trace-to-harness loop gives that lifecycle a spine.

Unless an organization has a real AI agent lifecycle management practice in place, all the work can just become folklore: “I remember we had to fix this bad trace once...”, “Okay, I added this one docstring...”, “Oh yeah, we replaced that model for that edge case...”. A week or two later nobody is able to say whether any of that actually fixed the particular failure being investigated, moved its expression to a slightly different case, or just masked it with a newer and cleaner-sounding incorrect answer.

The improved system creates a useful organizational structure around the agent, and therefore improvements to it, which is of greater value to the buyer than an agent that works in a workshop (as all AI do by now). This structure is a simple linear path from trace-evidence, via evals and so on through to harness-updates that in turn make their way through release-gates, until traces prove that changes worked or not. Conversation IDs make the trace coherent across the path, but coherence is only the start. The more valuable structure is one that shortens that path.

The winning team will have the shortest path from trace evidence to a harness release that safely changes what the agent does next time.

AI Agent Security Happens at the Tool Call | Focused Labs

Austin Vance — Wed, 08 Jul 2026 23:45:08 +0000

At what point does a team treating AI security as plumbing, meaning nobody wants to investigate, debug, or test it because that would slow the work down, fail? At the tool call.

A dangerous question that must be answered at every tool call: who is acting, which resource is canonical, what grant applies, which capability is being invoked, where the output can flow next, and what receipt exists if the runtime says no.

Connectivity to such resources is becoming easier to set up. The recent MCP SDK betas for the 2026-07-28 spec release candidate make that point plainly. So security needs to move from the setup of connection to a resource to control of execution within that resource.

The HCP paper, From Tool Connection to Execution Control, first defines eight runtime invariants that MCP-style systems need to satisfy: metadata non-authority, grant-backed approval, canonical resources, principal binding, scoped capability invocation, source-and-target data-flow authorization, deny-path audit, and explicit protocol state. For a long time these read like abstract requirements. In reality each of these is the minimum viable security invariant for an action that an agent is performing by updating CRM records, creating a customer, creating a sales process, or reporting on data in CRM. The runtime has to resolve the canonical resource, bind the principal to a grant, invoke only the scoped capability, authorize data flow, and make every denial auditable.

The approval prompt fires too early

A consent prompt tells one thing: did a human approve of this action or not. But approval of a proposed action, meaning a model asking to do something with the agent, and approval of an action after resources, policy, and tool state have resolved are two different things.

In reality, tools are not isolated containers to which an agent may make a single call. Rather, agents discover tools, read the tools’ metadata, copy data from one tool’s output to another’s input, run through loops of previously learned actions, re-run previously successful code with slightly modified parameters, and pipe the output of one tool to the input of another. As such, the place to put governance on an agent is before it makes a tool call, which is where we wrote about governance before the tool call, and after the tool call has been approved, which is where MCP security starts after tool approval. And that is narrower than it sounds, because after approval has been given to a call, the runtime is now responsible for governing the execution of that call, which is to bind the call to the particular principal, grant, resources, and capabilities that had been approved in the dialog that prompted the approval in the first place.

Now on to the HCP benchmark. The naive baseline for an MCP-like architecture allows all ten attacks to be modeled, the practice-informed mitigation strategy allows six, and the HCP benchmark blocks all ten attacks while providing complete audit evidence for the controlled benchmark in the paper. The paper itself is not something to run as a product tomorrow, but it gives a practical perspective on what prompts, metadata warnings, and approval dialogs are missing when the runtime does not own the execution objects for the calls that the agent makes.

The security decision belongs in the runtime path of the tool call.

Tool metadata cannot carry authority

Tool metadata is descriptive metadata. The model uses the tool’s metadata to choose an action. But the tool’s metadata is not authoritative for that action.

It is boring to think about tool metadata only in terms of setting up the action for the model to execute. But what if the tool’s metadata contained instructions for the model? What if a resource identifier that was provided as friendly names to the user were to map to different provider objects in the runtime? The runtime has to go ask the provider for the canonical resource for that resource identifier, and then compare that against the grants for that principal for that capability that was requested for that action. That is the set of runtime invariants for MCP-type systems as defined in the HCP paper.

Finally the MCP integration pattern transitions to the security model phase. An updateRecord CRM tool is not just another tool, even if there is already a tool with that name in the MCP workspace. It is a write tool for customer data, sales processes, reporting, and permissions. Therefore, integration of such a tool into MCP infrastructure is runtime work, which is why CRM integration becomes runtime work. Exactly the same work a user would do by pressing Save in a system of record. The only difference is that instead of a human, it is a model-guided loop, running through an integration layer.

So the simple way to look at this is that the runtime must reject the call if the following conditions are not met: the principal is missing, the grant is not found, the provider cannot canonicalize the resource, the capability is outside the scope of the grant, or the approval was for a different resolved object. Put this in a policy broker. And write down the reasons for denial.

Output handles are part of the security model

The scariest tool call is rarely the first one. It is the second one, after the agent has learned something.

When modeled data is used as input for other tools, the data can be treated as handles with owners, data classes, and checks on downstream pipes. The HCP paper investigates modes in which such data can be used, including search results, document excerpts, transcript snippets, ticket bodies, and rows in databases. Kept inside the runtime, the data can stay scoped to particular actors and capabilities instead of becoming free floating text that the agent can hand to other tools.

I’d like to provide an example of how Anthropic implemented code execution with MCP. In the example, Anthropic created a direct MCP loop between pulling a transcript from a Google Drive account and copying that transcript to update a record in a user’s Salesforce account. This would normally take 150,000 tokens, but by putting the intermediate results in the execution environment, Anthropic was able to reduce the required tokens to 2,000 tokens. Great for them, and it highlights the added security obligations of code execution.

But, if we keep the transcript in the runtime, we can then check things like: can this principal read this? Can this capability write this to Salesforce? Can this output flow to email, Slack, or external webhooks? Should the next call in the workflow see the raw text, a filtered projection of the text, or nothing at all? These are the kinds of things that belong inside AI agent authorization.

This is also where audit stops being compliance theater. Audit records for denied pipe operations should include a receipt with source handle, target capability, principal, policy version, and reason code.

Code execution makes the runtime more important

Code execution is going to win because less context and fewer round trips are worth the workflow tradeoff. The CE-MCP paper, From Tool Orchestration to Code Execution, formalizes the jump from context-coupled execution in MCP to context-decoupled code execution in MCP. The paper also introduces sixteen different attack classes that can be launched through five different execution phases of code execution.

Similarly, LangChain defines the sandbox-as-tool pattern with the agent and its state outside of the sandbox, API keys outside of the sandbox, and the execution state of the sandbox separate from the agent state. This structure is nice as it provides a natural home for the platform to manage the credentials, the policy, the trace context, and other runtime state for the generated code, model, and harness, rather than having it all in a generated code file in a harness.

Code execution saves context, but it moves the trust boundary into the runtime.

Another AI tooling fallacy: code execution is simply another loop around the model, with potentially cheaper rounds of computation. This misses the mark fundamentally. Execution has become a privileged substrate within the AI-powered end-to-end workflow, and one that cannot be governed by simple prompt-level rules. It requires a runtime policy broker to adjudicate the required authorizations along the way.

Enterprise security teams will ask for receipts

Once agents are running across employee accounts inside a company, there is real security work in running an MCP runtime. A buyer has to confirm that the runtime has implemented the functions listed in Arcade’s guide to MCP gateways, runtimes, and registries: OAuth lifecycle management, credential storage in a vault, multi-user support with respective authentication and authorization, permission intersection, auditing, policy management, enforcement, and observability.

Zuplo’s 2026 protocol-stack post separates the MCP used for agent-to-tool communication from the A2A used for agent-to-agent communication. Such a split makes sense because while MCP allows an agent to communicate with tools, A2A enables one agent to directly communicate with another agent. Gateways for cross-agent communication will enable functions such as token binding, request validation, capability filtering, and full request logging.

From an auditable decisions standpoint, we would document each decision in a detailed record which includes: principal_id, agent_id, grant_id, resource_id, capability_id, input_handle_ids, output_handle_id, policy_version, decision, reason_code, and trace_id. All of the information from the allow path and the deny path for a particular invocation, tied back to the conversation or workflow from which the invocation was invoked. We have made the same argument from the observability side: a conversation-level audit trail gives incident response a spine instead of a pile of disconnected spans.

This is where we start to put the AI agent security practices to work. If an agent is sending emails, the receipt should name the user principal, canonical mailbox, send grant, attachment handles, and policy version. If an agent created a Salesforce opportunity, then the receipt should list the same information against the opportunity as the canonical resource in that system. The same would hold for creating a new pull request.

No magic here. Just a dull and thankless job of documenting model-driven side effects.

Own the tool call

Just as we moved from unusual tool access to the normal usage of tools via MCP, now every meaningful tool call must be treated as an event for the following classes: authorization, who is allowed to make this call; data flow, what is the input to the tool and what are the handles to its output; and audit, a record of every such call.

The first part of the test is not hard: tool calls should require different policy. A read-only weather API is different from a payment API, for example. A chart generated by a model is different from an external email. The API calls should be treated differently by the runtime, and therefore by the agent, which is programmed by a human.

The test I care about is this: can an operator explain why a particular tool call was permitted or denied by the AI agent or program without relying on the transcript of the conversation as the source of truth?

If that cannot be determined beyond the model’s transcript, then the agent security program is based on manners, and it matters not whether the system uses MCP, OAuth, a nice organized catalog of approved tools, and a tidy approval prompt dialog. The real boundary is the tool call itself, and either the runtime owns that boundary and all the vagaries of the model, tool metadata, the sandbox in which the generated code runs, and the APIs that the runtime and the model and the generated code in turn call, or the runtime inherits all that.

Own the tool call. Write the receipt. Make denial a first-class path.

Coding Agent Spend Belongs in the Trace | Focused Labs

Austin Vance — Sun, 05 Jul 2026 23:26:05 +0000

Coding-agent spend gets weird the second it leaves one developer's laptop.

Here is a specific example where a single feature goes through Claude Code, Codex, Cursor, Copilot, OpenCode and a Deep Agents custom harness before finally being merged in a pull request. Each of these, as standalone tools, can be “perfectly reasonable to buy” with “local usage screen” in front of the team. But then the invoice arrives and the work actually costs something completely different, a question that the engineering lead has to answer.

That question cannot be answered from a vendor bill. It has to be answered from the run.

The problem with “Coding-Agent Bills: Feature Cost in the Workflow of Multiple Coding-Agents” was first discussed on LangChain’s blog in a July post outlines the shape of the problem. As previously mentioned, the feature can interact with several coding agents like Claude Code, Cursor or even Copilot. Each of the mentioned coding agents keep logs of their activities in various formats. As a consequence, it is very difficult to tell the cost of a feature in a workflow. LangChain says, that first a normalized trace of all activities in all root sessions through all turns of interaction of the user with the coding agents in all tool calls, etc. including all the corresponding metadata has to be generated. This trace then can be filtered by session_id, thread_id, by model or provider or even by the names of individual coding tools. The hardest part of solving the problem is the generation of this trace.

The unit is the trace.

The invoice arrives after the damage

Finance sees an invoice after it has gone through the run process. Engineering sees a loop that is still costing money.

The waste that coding-agents produce is behavioral waste. That is to say, it is the same patterns of suboptimal behavior that other agents produce. Time and again an agent will keep retrying the same failing test. Every time it goes to generate some text for a lint fix it will use the most expensive model available. A monolithic repository summary will be included in every interaction turn. Slow tools will be called to perform tasks only to be given very vague error messages. The same tool will then be called again and again until eventually another agent is called in to help debug the mess. Eventually a bill will be generated for all that was spent, which will tell the finance department how much was spent. The trace of all the calls, etc, generated while that bill was being incurred will reveal the precise sequence of tool calls that resulted in the bill having that amount.

Agent spend is a runtime signal. Coding agents make that signal less abstract. Spend gets tracked against a repository, branch, commit, pull request, developer, team, model, provider, tool, and session. That abstract finance problem scattered across product dashboards becomes an engineering problem once the fields coalesce into a single session trace.

Engineering problems can be fixed.

Many of the classic cloud cost playbook steps are still widely adopted by organizations building out FinOps practices, as laid out by the FinOps Foundation FinOps Foundation. Cost per token, highly volatile pricing for cloud compute and memory, particularly for GPU instances, as well as usual quotas on consumption of cloud resources, tagged resources to group costs by application or team, and real-time finance metrics that track to business outcomes. Coding-agent spend is typically treated as spend that runs in loops and programs, as opposed to being treated as spend that can be represented in spreadsheets. Thus a quota is only hit when spend hits a line, and a trace shows what led to that line in the first place (e.g. bad retry logic, too much context in function, missing cache hit, poor model routing).

Vendor dashboards answer local questions

Tool dashboards are useful until they become the only record.

Claude Code has local usage statistics at /usage, and Anthropic's cost guidance covers team spend limits, context compaction, model selection, MCP overhead reduction, hooks, skills, and subagent delegation. Fine operator surface for Claude Code. It still does not tell the team how much a PR cost after Claude Code, Codex, Copilot review context, and Cursor all touched it.

LangSmith Codex tracing feature for Codex tracing logs: agent turns, model metadata, token usage, tool calls, and subagent threads. OpenCode tracing OpenCode tracing logs: session root runs, assistant turns, nested tool calls, tool errors, time, attachments, subagent activity, token usage, and thread or session ID metadata for root and child sessions. Note that the Chat feature in Copilot exports OpenTelemetry spans where the invoke_agent, chat, and execute_tool spans contain total token usage, model data, and subagent context that gets propagated through tool execution. The plugin for the Codex tracing feature in LangSmith’s Codex LangSmith's Codex tracing plugin. Copilot Chat can also export OpenTelemetry spans Copilot Chat can export OpenTelemetry spans.

More importantly, the larger ecosystem is starting to generate trace-like evidence while the output of individual tools from session to session can be quite different from one another. Unlike a dashboard, this would be a normalized log of an entire coding session that makes full use of all of the single tools as well as all of the intermediate functionalities in between.

The trace is the cost unit because it joins behavior, identity, and spend.

LangSmith’s coding-agent metadata contract lists the shared fields by name instead of relying on observability to discover them. In addition to global identity fields like ls_agent_kind, ls_integration, ls_agent_runtime, thread_id, and ls_trace_schema_version, the contract lists run types: root, llm, tool, subagent, and interrupted. It also lists repo fields the runtime may expose: repo, branch, commit, working_directory, provider, model, tool, and subagent.

That contract is boring in exactly the right way. Boring fields make cost queryable. Boring fields let an engineering manager ask which repo burned spend last night, which agent runtime did it, which model got selected, which tool failed, and which team owns the pattern.

The same thing is true for spend. Agent observability runs on a stable conversation or session ID because that ID lets traces, tools, queues, APIs, evals, and incidents line up. Cost should line up on that same spine, not show up as line items with no causality.

Cost without behavior is accounting

A cost report that cannot point at behavior is just accounting with better charts.

One thing I would really like to see reported in a lot of detail, are all of the ways a coding-agent session could go over budget. That report could be a simple list of causes and fixes for each. The number for each cause would be the smoke for that area, and then a trace of the session in the room would show where the extra cost occurred.

LLM cost management is a behavior problem, not just a finance problem. Cut out the irrelevant stuff. Simple stuff should be done by cheaper models. Cache the constant stuff. Deciding which tool to use for a particular task. Fixing the docs that cause an agent to fail for a particular task. Trimming the subagent path that reads and re-reads same files for a particular task. That’s a finance system approval for a budget for this. Here’s a trace for that approved budget.

For clients of the agent platform, it matters to tie out spend on agents to engineering outcomes. A spend dashboard is only so useful if no one can connect the spend to outcomes such as did the agent make the team faster, did it create useful PRs, did it take review work off the team’s plate, did incidents go up or down, can we stop those runaway-until-breakfast sessions next time. Building a run record, not just a screenshot, is the only way to get there.

Gateway policy closes the loop

Visibility without a control boundary turns into monthly regret.

LangChain’s internal LLM Gateway rollout is an interesting example because the company tied spend to live control. Their writeup describes how one heavy coding-agent user could have spent thousands of dollars per week before anybody noticed. By putting coding-agent calls through Gateway, they set monthly, weekly, daily, and hourly budgets, then tied spend to traces, users, keys, agents, model calls, and failure modes. That is the correct sequence: see the run, fix the behavior, cap the boundary.

Documentation for the Gateway that this sits in front of is here: The LLM Gateway sits between agents / clients and model providers. It holds the secrets for the various providers, authenticates the caller with a LangSmith API key, evaluates spend and redaction policies for that caller, proxies the request to the upstream model provider, and traces the return back to LangSmith. The spend policies here Spend policies are organized by organization, workspace, API-key, or user, and can be limited to a monthly, weekly, daily, or hourly period. When a request is blocked (i.e. it would exceed budget), it returns a 402 with the spend-policy violation as trace metadata for the request that was issued. Gateway docs

A critical product decision here was to make policy violations trace events as opposed to just finance events that got blocked. But the block is part of the same evidence stream as the prompt, the model call, the tool results, the repo metadata, and the agent runtime.

Cost control works when the trace feeds the policy boundary.

This is where spending authority moves into the runtime. A coding agent is spending shared budget on behalf of a person, team, repository, and task. Policy has to sit near that action: approvals, budget caps, provider credentials, redaction, trace metadata, and issue creation.

Just setting a monthly limit is not enough. One also should consider hourly and daily limits for a single command run over night for example. User / API-key limits to catch misuse through integration paths that a team did not think to budget for. Workspace limits to stop experiments in live environments going out of control. And finally organization level caps to put a roof on all of this, where each of these limits describes a different way to define ownership.

The first version can be simple

The first version does not require a grand platform migration.

The trace should contain enough information to correlate the money spent with the work done by the engineers. The trace fields that need to be included in the trace in order to use them in the dashboards are the following trace fields that join spending with work done by engineers: session ID, user, team, repo, branch, commit, PR, agent runtime, integration, model, provider, tool name, token usage, cost, status and error. Subagent IDs can be added to the trace later when delegation starts to surface in traces. Also environment and service tags can be added to the trace later when coding-agent work is tied to runbooks or other work done by the infrastructure coding teams. All of the above fields should be included in the trace before they can be used in the dashboards.

Then add three questions to the weekly engineering review.

Which coding-agent sessions were expensive and useful?

Which sessions were expensive and stupid?

Which stupid session can be prevented next week?

Not glamorous but useful to know. This review will find out whether the Context Packs, repo documentation missing repo documentation etc are up to par. It will find out if the agent instructions prevent tool spam, if the model defaults are current. It will also find out whether people are using coding agents for work that should really be scripted instead. And then there are the useful expensive sessions. Deep code archaeology costs money and while a multi-hour migration agent may cost a lot of money for a multi-hour run of work, a clear outcome and a traceable run can still be worth it.

The control plane conversation continues. As coding agents become shared engineering infrastructure, registry, identity, policy, monitoring, cost, approvals, and retirement start to live together in the same operating estate we covered in Enterprise AI Agents Have a Control Plane Now.

Cost is part of the execution record

The cost of coding agents is not a side channel. It is part of the execution record for how software changed.

The uncomfortable truth here is that agent spend is someone’s engineering responsibility. Not in the performative sense of saying “use fewer tokens”, but in the normal operating sense of owning the fields, the trace shape, the gateway policy, the retry loops, and the review to confirm that very expensive work was worth it.

This is the operating model I trust: it does not shame developers for using agents, it does not imply that the cheapest model is the best. Spend should be managed the same way that good teams manage latency, errors, deploy risk, and incident noise. The signal should follow the execution path.

The invoice still matters. It just shows up too late to be the source of truth.

Agent Orchestration Belongs in Code | Focused Labs

Austin Vance — Sun, 05 Jul 2026 23:26:03 +0000

Agent orchestration is moving from prompt choreography into code.

There’s a really cool feature of LangChain’s new dynamic subagents under that subagent headline. A root model can now write a tiny program, run it in an interpreter and have that program work out step by step within a narrow task() bridge. For example, a prompt can ask a model to cover every file, then the model can write a loop that enumerates those files instead of pretending the request itself creates coverage.

The piece we kept missing from the ‘ai agent orchestration’ conversation is: tell the model to plan, delegate, verify, retry, & synthesize. Then ask it to remember the entire flow as tool results get streamed back into the context window. Hope is not a strategy, and is expensive.

Documentation for dynamic subagents in LangChain dynamic subagents documentation explains the “cleaner contract”. In this model, the interpreter code dispatches subagents that were configured for the task at hand, in series, in branches, or in parallel in batches. The model is still deciding what work to do and the code is for coverage.

The long-standing harness problem in real AI agents real AI agents is showing up again here. The framework conversation has value ONLY when it graduates to deployment / tracing / evals / ownership. That runtime shift previewed at LangChain’s Interrupt runtime shift LangChain previewed at Interrupt conference is already manifesting in the orchestration layer.

The old pattern samples work

For small action chains the normal agent loop is perfectly fine. The model computes a course of action, invokes a tool, reads the tool’s output, invokes the next tool in the chain, and so on. For very small tasks this way of thinking is perfectly adequate.

Batch work breaks the rhythm.

A security review of a route directory in a filesystem is not a vibes exercise. The agent discovers the files, dispatches the review for each of the files found, keeps track of line numbers for code reviews, removes duplicates from the findings, verifies the severity of the findings etc. until a full report is generated. Turn-by-turn delegation as described above requires the model to keep a lot of bookkeeping in memory. The model can dispatch subagents, but the number of times this is needed, the retry shape, etc. is all in text.

LangChain has a good launch post for dynamic subagents launch post by LangChain for dynamic subagents. They mention the example of one subagent per page of a 300-page document. The important line in that example is “Promise.all”. And once the subagent orchestration is written as code, coverage is no longer a prompt request.

The loop makes coverage structural instead of aspirational.

Inside the interpreter, the shape is boring in the best way:


const issuesSchema = {
  type: "object",
  properties: {
    issues: {
      type: "array",
      items: {
        type: "object",
        properties: {
          file: { type: "string" },
          line: { type: "number" },
          severity: { type: "string", enum: ["high", "medium", "low"] },
          description: { type: "string" },
        },
        required: ["file", "severity", "description"],
      },
    },
  },
  required: ["issues"],
};

const files = (await tools.glob({ pattern: "src/routes/**/*.ts" }))
  .split("\n")
  .filter(Boolean);

const reviews = await Promise.all(
  files.map((file) =>
    task({
      description: `Review ${file} for auth issues. Cite line numbers and severity.`,
      subagentType: "reviewer",
      responseSchema: issuesSchema,
    }),
  ),
);

const highRisk = reviews
  .flatMap((review) => review.issues)
  .filter((issue) => issue.severity === "high");

highRisk;

The code follows the current Deep Agents interpreter docs: programmatic tool calling, where applicable and after allowlisting, is exposed under the tools.* namespace, and dynamic subagents are exposed through task() when configured.

One can also note that the architecture described is very similar to the earlier one: the context window is no longer just a temporary storage for intermediate values. The interpreter holds a working set of values of interest, and the model is presented with the relevant result.

RLMs make the direction obvious

Recursive Language Models (RLMs) follow the so-called RLMs pattern. Here, long input strings are considered as external environments which are loaded into a REPL (Read-Eval-Print-Loop) of a language model. In this REPL, the model generates code that examines parts of the input string, decomposes the task into a set of sub-tasks, recursively calls language models on short code snippets. The authors report inputs that are two orders of magnitude larger than the context window of the corresponding RLM, and significantly better than direct model calls as well as other approaches for dealing with long input strings. report inputs up to two orders of magnitude beyond model context windows

LangChain's RLMs in Deep Agents post translates that to agent infrastructure. Deep Agents keeps the working set in interpreter variables, selects context slices, dispatches subagents with task(), and synthesizes the objects those subagents return. The post also reports an OOLONG proof of concept where the REPL-backed agent performed better at 128k context than the plain agent on long-context aggregation work.

This is the correct mental model for ‘agentic ai architecture’. It shouldn’t be sitting in front of a huge pile of context trying to work out what to do next. Instead it should have a work queue, a loop and typed return types. The model should apply itself to the judgment of how to decompose a problem into constituent components, and then synthesize those components back together into solutions, as opposed to trying to keep track of each individual item in a sequence, and then recall them in order.

Context and work become data that the harness can partition, route, and verify.

Claude Code is heading down a similar path. Anthropic says dynamic workflows let Claude write orchestration scripts that run parallel subagents, save progress, check work, and return one coordinated answer. Similar direction, different harness. Agent orchestration is becoming executable.

The security question moves to the bridge

Another set of fears come into play for teams who fear the agent writing code. They then imagine the full complexity of a sandboxed shell: package manager, file system, network, etc.. All that new found chaos a bored engineer can unleash on a Friday. That is one set of fears.

Interpreters are smaller.

QuickJS interpreter code has no host filesystem, network, shell, package manager, or even clock by default. It can compute, hold variables, and print to console. The two explicit bridges are programmatic tool calling through an allowlist and subagent dispatch via task().

The interpreter is useful because the bridge is narrow.

LangChain’s post on running untrusted agent code without a sandbox running untrusted agent code without a sandbox describes the boundary. QuickJS runs in WebAssembly. The host code exposes only the capabilities that we intend for the untrusted code to use. The WebAssembly project’s front page describes Wasm as a “memory-safe, sandboxed execution environment for general-purpose programming” WebAssembly project describes Wasm as a memory-safe, sandboxed execution environment. The QuickJS-ng home page describes QuickJS as a “small embeddable JavaScript engine” QuickJS-ng describes QuickJS as a small embeddable JavaScript engine. Small here means less surface area, which is what I care about. If the core JavaScript engine has less surface area, then there are fewer strange ways in which odd privileges could have been introduced to the orchestration code.

Note that full sandboxes are still relevant today. LangChain’s sandbox guidance for example lists the following features for a sandbox relevant for coding, data analysis, browser etc. work (where an agent usually works with many dependencies and thus often needs to run for a long time with previous runtime values of the sandbox being used in a controlled way): isolated filesystem, restricted for making outgoing net connections, limited resources, controlled reuse of prior runtime values of the sandbox, full kernel level isolation. In short, an agent simply replaces a computer for such work.

An interpreter is not a general-purpose sandbox but rather a tool to fulfill a specific set of tasks such as loops, conditional statements, filtering, aggregation, and fanout within a given capability. For tasks where an agent needs to actually run on a machine, a full sandbox is more appropriate. For tasks where an agent needs to run orchestration, an interpreter is more suitable.

Meta's Agents Rule of Two details the risk: until prompt injection is solved, an agent should not exceed two of untrusted inputs, sensitive or private data, and external state change or communication in a session. Interpreter bridges give the harness a place to enforce that rule: define which tools exist, which subagents run, where approval happens, and what data crosses back.

This is where the harness owns the orchestration boundary. The model proposes a program. The harness decides what the program can touch.

Executable orchestration has receipts

Programmatic orchestration will create a whole new set of script bugs. Simple scripts can still pass the wrong schema. More complex scripts can spawn too many subagents, or filter away the signal in the jobs submitted. A stale interpreter variable can survive across turns because state persistence was configured that way.

Good. Those are inspectable failures.

For LangChain’s interpreter persistence there are three modes: “thread”, “turn” and “call”. So middleware stores the state of the interpreter between the turns in “thread” mode. Every “eval” in “call” mode starts from scratch. As with tool allowlists, task() visibility, concurrency limits, schema contracts, and approval gates, the trade-offs here need to be discussed in code review, not locked away in prompt folklore.

This is a warning from the dynamic subagents docs and it should go into the checklist: task() dispatches from within an already-running eval call. Parent-agent interrupt_on approval workflows are not enforced per task dispatch. That means gate the eval tool when approval has to happen before the orchestration runs.

That line belongs in the operating model.

First, there is a huge benefit to creating a record of the script that was run, the subagents that were spawned, the calls to tool-bridge, the results that were filtered, the approvals that were issued, and the final synthesized result. Without this record, a fanout-based system is left to its own devices, a beautiful machine with no receipts for what was done. Agent UI is runtime infrastructure, Agent UI is runtime infrastructure for the simple reason that products need typed handles to the various tools, to issue approvals, to manage subagents, to handle errors, and to track the state of a workflow. When we go to programmatic orchestration, the issue becomes even more pronounced. The events that are generated by subagents are the only way to get insight into what actually happened during the run of the main program.

Observability too follows this trajectory. As orchestration moves to code, traces must reveal the program boundary as well as the modeled spans between them. Observable agentic systems observable agentic systems are seen by their teams as a ‘movie’ of how the system behaves, where each frame of the ‘movie’ of execution is visible to them as a sequence of operational events. They can see the exact point at which an interpreter’s call started a worker, what that worker returned, the exact point at which a bridge call crosses a capability boundary, and where synthesis is dropping information. This is how their operational dashboard transitions from ‘theater’ to ‘operational evidence’.

Own the loop

The classic approach to orchestrating many agents was always about choosing the right supervisor pattern: supervisor, swarm, handoff, or router. So much beautiful vocabulary that almost never gets used in practice.

The more important operating question is where the loop lives.

By writing a simple loop, we can in effect give the team (a) coverage, (b) a schema to work with, (c) specific replay points of interest, (d) approval gates for specific pieces of output, and (e) traces of exactly what the model was doing. However, the model still has a real purpose: deciding (1) how to decompose the problem, (2) which specialists to bring to bear, (3) which facts are salient, and (4) how to pull it all together into a coherent answer. The harnessing code then implements these decisions in the form of a program of finite scope.

That’s where the real value is in dynamic subagents, RLMs, and Claude Code workflows, it’s all about executable orchestration within a runtime that knows what code to run. (Bigger prompts & more sophisticated supervisor labels don’t cut it here).

Own that boundary. Trace it. Review it. Gate it.

Let the model write the loop. Don’t let the loop float around in the prompt.

AI Agent Observability Runs on Conversation IDs | Focused Labs

Austin Vance — Fri, 03 Jul 2026 17:15:36 +0000

Agent observability gets weirdly polite at the exact moment it should get nosy. It records the model call, stores the prompt, counts tokens, and then loses interest right when the agent starts touching software.

That makes the trace look clean and the incident feel impossible.

The failure rarely lives inside the LLM span. Honeycomb's Agent Timeline instrumentation guide says a GenAI span can be any work the agent caused: model calls, tool calls, handoffs, downstream services, database queries, and background jobs. That is the right unit of observation. The agent is runtime software making choices and causing side effects.

The conversation ID is where that reality starts to show up.

The trace breaks where the work starts

LLM-call logging feels useful because it gives teams an artifact. The prompt was this. The response was that. The model returned tool calls. Fine.

Then the tool calls a service. The service writes a row. A queue worker picks up the job. A downstream API retries. A database query times out after the agent has already answered the user. The model trace stays green because the trace lost the work.

The older observability mindset matters even more with agents. Logs answer the question a developer guessed in advance. Traces let the team follow the shape of the system after the weird thing has already happened. Agent systems add a worse version of the same problem because the execution path is partly selected at runtime.

A token count will not explain why the refund tool wrote to the wrong account. A prompt transcript will not explain why a sub-agent retried the same API call through two paths. A model latency chart will not explain why a background reconciliation job created the user-visible failure.

The agent caused work. The trace has to follow the work.

The useful trace is the work the agent caused, not the model call it made.

Conversation ID is the live unit

Honeycomb's docs make the minimum viable shape pretty plain. Agent spans need gen_ai.conversation.id, gen_ai.agent.name, and gen_ai.operation.name so the timeline can group spans into a session, attribute work to an agent, and distinguish operations like chat, execute_tool, invoke_agent, and invoke_workflow inside the Agent Timeline.

The phrase reads like metadata, but it decides whether the team has a trace or a receipt.

The trace ID describes one distributed execution. The conversation ID describes the user-facing unit of work across traces, services, and turns. A support agent classifies a request, fetches account state, hands off to a billing agent, and waits for a queue worker before it sends the final response. Those steps can land in different traces. The customer experienced one conversation. The system created a pile of spans.

Without gen_ai.conversation.id, the pile stays a pile.

There is one easy mistake here: do not fake the identifier at the leaf. The OpenTelemetry GenAI agent-span conventions say gen_ai.conversation.id should be populated only when a real identifier is readily available, and should not fall back to a new UUID, trace ID, or hash of request content when no conversation identifier exists. That guidance matters. If every service invents its own conversation ID, the attribute becomes confetti with better naming.

Mint the conversation or session ID at the boundary where the product understands the interaction. Pass it inward. Attach it to agent spans, tool spans, HTTP client spans, queue jobs, DB spans, and eval events. If a downstream service cannot carry it, that service is where the trace goes blind.

We have written before about trace context crossing the tool boundary. Conversation IDs add the user-facing thread across those technical boundaries. Trace context keeps parent-child relationships intact. Conversation context lets the team ask, "show me everything this agent conversation caused." Both have to survive the tool call.

Span ancestry is not paperwork

LangSmith's OpenTelemetry docs include a nasty little failure mode: a child span whose parent never reaches LangSmith can be accepted with a 200, buffered, and then dropped later if the parent never arrives because the OTEL endpoint processes spans asynchronously. That is exactly the kind of observability bug that looks fine in CI and ruins an incident review.

The system said yes. The evidence vanished.

This is why partial agent tracing is dangerous. A team can have beautiful traces for the orchestrator and still lose the tool execution. Another team can export the tool service and lose the parent agent span. Sampling can keep the cheap part and drop the causal root. A vendor console can show green ingestion while the useful run never materializes.

Agent observability has to be treated as an export contract.

Every service that participates in the agent path needs to agree on ancestry, conversation ID, sampling behavior, redaction rules, and ownership. That includes boring services. Especially boring services. Billing APIs, CRM updates, search indexes, authorization checks, queue workers, and database writes are where the agent becomes real software. The model span is just the planner with a transcript.

This is why agent monitoring as infrastructure is the right operating model. The app team cannot sprinkle tracing on the agent wrapper and call it done. The platform has to make propagation easy, collectors safe, sampling legible, and missing-span failures visible.

Agent names decide accountability

Multi-agent systems make the naming problem uglier.

Honeycomb's docs warn that each agent needs a unique gen_ai.agent.name, and that sub-agents should not inherit the parent agent's name because duplicate or missing names make investigation impossible inside a grouped agent timeline. That sounds fussy until the first live handoff fails.

The billing agent invoked the policy agent. The policy agent called a CRM tool. The CRM tool wrote a field that changed the next user message. If every span says support_agent, the team has a transcript and no ownership record. If every sub-agent gets a stable name, invoke_agent becomes a runtime transfer with evidence attached.

That is the same boundary we hit in agent handoffs as runtime state. Handoffs are not vibes. They are ownership transfers. The receiving agent needs state. The sending agent needs a receipt. The trace needs to show who acted, who delegated, who observed the result, and which name owns the failure.

Agent names are also release artifacts. A renamed agent can break dashboards. A reused name can hide a new implementation behind an old label. A temporary experiment name can leak into live traffic and turn a week of traces into archeology. Name the agent like a service. Version the behavior somewhere else.

Prompts are data with a governance boundary

A useful agent trace wants the thing security teams hate storing: prompts, responses, tool arguments, tool outputs, retrieved context, and evaluation notes.

That tension will not go away. It has to be designed.

Honeycomb recommends storing full prompts, chat history, and completions as span events because they may be large or contain sensitive data, while the docs call out that an OpenTelemetry Collector can filter or redact them before ingestion when events carry prompt and completion content. LangSmith documents the same architectural move: route application traces through an OpenTelemetry collector, apply transform rules that redact sensitive span attributes, and forward sanitized traces to LangSmith through a gateway-style redaction path.

That collector draws the line between operational evidence and data sprawl.

Agent observability still has a data-governance boundary.

The safe pattern is boring: keep low-cardinality routing attributes on spans, put sensitive prompt and completion payloads in events or controlled attributes, redact at the collector, and make the redaction policy testable. The team should know which fields are kept, which fields are dropped, and which fields are transformed before they hit a vendor or shared backend.

There is another boundary hiding here. LangSmith's distributed tracing guide warns that langsmith-trace and baggage headers should be accepted only from trusted internal services, not public callers, because they are consumed as trusted tracing context and can influence how runs are recorded if a gateway passes them through. Good. Trace context is not harmless just because it is metadata.

A public request can carry lies. Internal propagation can carry evidence. The gateway has to know the difference.

The trace should feed control

The point of agent observability is not a prettier incident screenshot.

Candidly's LangSmith writeup is useful because it shows traces turning into a control surface. Their Cait financial-planning agent moved from post-hoc conversation evaluation toward turn-level state inference. The labeling pipeline reached 92.3% agreement with a human-labeled LangSmith dataset, and trace-derived features predicted resolved versus abandoned conversations with 0.90 AUC in the Candidly case study.

That is the loop worth copying.

Live traces should become regression cases, eval datasets, policy changes, routing fixes, and release gates. A conversation ID makes that possible because it lets the team collect the entire causal path, not just the final answer. The trace shows which tool ran, which state changed, which sub-agent owned the handoff, which prompt version ran, and which downstream span failed. The eval can score the behavior against the actual evidence.

That is also why AI agent evaluation has to steer the harness. The trace cannot stop at telemetry collection. If a live conversation exposes a broken handoff, the fix should not be a screenshot in Slack. It should become a test case and a control.

The best traces become software.

Instrument the boring path first

The first pass does not need a grand observability platform. It needs a clean propagation contract.

Start with the product conversation or session boundary. Create or retrieve the real conversation ID there. Pass it through the agent runtime. Attach it to the root invoke_agent span. Carry it into LLM calls, tool execution, service calls, queues, and database work. Add gen_ai.agent.name at every agent boundary and use stable names for sub-agents. Set gen_ai.operation.name with the closest standard value instead of inventing labels for basic operations.

Then break the system on purpose.

Force a tool timeout. Drop a parent span in a staging environment. Send a request through a queue. Trigger a sub-agent handoff. Verify the timeline still tells the story. Check that the collector redacts the fields it claims to redact. Confirm public trace headers get stripped at the edge. Find the first span where the conversation ID disappears and fix that before buying another dashboard.

This work feels unglamorous because it is. Also because it is the part that decides whether the next incident review has evidence.

AI agent observability follows what the agent caused. The conversation ID is the thread. Pull it through the stack, or accept that live systems will keep handing back neat little model traces while the actual failure happens somewhere else.

Documentation Drift Breaks Coding Agents | Focused Labs

Austin Vance — Thu, 02 Jul 2026 23:11:04 +0000

Documentation drift used to be a huge problem for human developers. Waste their time. Now it can cause coding agents to take wrong actions, and even to ship the wrong change.

Previously boring documentation issues get to take on a whole new degree of importance because they can now affect when the wrong change gets shipped by a coding agent.

Software documentation tools used to sit beside the delivery process. They helped onboarding, audits, support, architecture reviews, and the occasional brave soul trying to understand why one module still talks SOAP. Coding agents move that documentation into the execution path. The words in AGENTS.md, CLAUDE.md, repo wikis, rubrics, runbooks, and generated summaries become operating instructions.

LangChain is making this capability explicit with OpenWiki, an open source agent and CLI for generating and maintaining repo documentation for coding agents. As documentation of repo knowledge for agents to execute commands on the repo, the repo documentation is now also the substrate for the agents.

Docs are part of the runtime now

A human can always "fix" the problem with the worst documentation ever. Slow, but true. All a senior engineer needs to do is remember from last quarter how the billing path was moved around, figure out who wrote that offending doc page, check blame, search for the conversation in Slack, and change the docs. Easy peasy.

A coding agent's failures, on the other hand, are insidious because they become commits, even bad ones, with the pretense of doing work. The bad assumption or drift in a document, for example the billing path was in the old service, gets serialized in a diff and it looks productive!

Code documentation AI has become an operational boundary. Yesterday's prompt-cache argument was about how context order became runtime policy. This is the next layer down: the actual repo knowledge being cached, retrieved, summarized, and obeyed.

So while hot files are typically cursed, the simple fact of an agent following a set of instructions for which there is a corresponding set of documentation means the documentation for AGENTS.md and CLAUDE.md for a given project can be maintained as part of the overall repo knowledge. The project README says OpenWiki creates or updates an openwiki/ directory, appends agent instructions, supports openwiki --init and openwiki --update, and includes a GitHub Action template for documentation updates.

Small hot file. Retrieved cold knowledge. Maintenance loop.

The hot file should point to maintained repo knowledge, not carry the entire codebase story itself.

The hot file should contain rules which apply to every run of code such as commands to build, test, and run code; where services and packages are stored; ownership bounds of a service; security constraints to be applied; where a service might be routed to. Every agent run should pay for all the junk that's been added to a file and therefore it should not become a repository for all the architecture decisions that were made along the way.

The giant prompt dump is the lazy fix

Of course, the obvious thing to do with all that knowledge is to include the entire wiki in the prompt. After all, more context must be better, right? But that way lies madness. And not just because of all the noise. It's also that the knowledge in the wiki is likely to be stale in places, and the model will simply latch onto the first relevant-looking paragraph it finds.

Chroma's context-rot work helps to keep the complexity of a task constant while increasing the length of the input, and it finds that LLM performance gets less reliable as the context grows across 18 tested models. The key point for this essay is simply that it is not enough for information to be in context. How that information is presented matters.

Similarly, in large code bases, a huge amount of Markdown can be written down in the end, but the key point is which context is actually hot and is being retrieved, which is being ignored and which is proven to be wrong by a trace or by a test.

The Codified Context paper outlines the same ideas with a slightly different application. In the paper the author creates a 108,000-line C# system with a three-tier context architecture: hot-memory constitution, specialized agents, and a cold-memory knowledge base. This work clearly outlines that documentation that is read by agents is load-bearing infrastructure that the agents rely on for correct results.

This gets software documentation out of the "vibes" category and into the realm of something which can be used in a practical manner. In this case a wiki page used to feed a coding agent is much like a config file and should have similar characteristics: it should have owners, it should change in review, it should be easy to diff, and it should have a way to fail.

So far, I've been framing this problem space within the realm of document quality. Documents are searchable, pleasant to read, versioned, and easy for their authors to maintain. But now that agents read these docs while executing work against repo knowledge, we are asking a sharper question about the software the docs live inside. Can that system serve up the context that an agent needs, with all the properties of good knowledge: provenance, scope, freshness, and feedback from the work it was used for?

Drift is the failure mode

Documentation drift has been around for a long time. The repository changes. The documentation for the codebase does not get updated in time. A new engineer shows up and has to spend a day or two figuring out why a piece of code does not work, and then the team updates the relevant wiki page or documentation for that feature during their next big clean up.

A single stale architecture note can affect a pull request and get engineers mistakenly working on the wrong thing. Stale runbooks can cause an incident assistant to work against the wrong dashboard. A coding agent's running of a suite of tests that the team no longer runs, indeed one they may no longer even have on disk, can result in the agent "successfully" running tests and pushing incorrect implementation bugs into live systems. The code review for an implementation change needs to be able to find these context bugs, today.

This is why agent-facing docs have to be in the same loop as agent evaluation. Once we've established that agent evaluation has to keep running after release, we also have to establish that the agent-facing documentation is in that same continuous loop of work.

LangChain's Pendo case study talks about connecting product analytics, user behavior, session replay, code context, and LangSmith traces to code fixes. Pendo says Novus reached a 90%+ success rate on PM-reviewed evals and moved to live use in days. This is done through their product Novus and LangSmith tying traces to code.

This also means that all the evidence collected during the evaluation needs to update the documentation that was used by the agent to produce the work in the first place. If an engineer or reviewer repeatedly corrects the same thing, for example an agent repeatedly creates code that uses the wrong subsystem for a given task, or uses the wrong convention for something, then that thing belongs in the agent-facing documentation for the repo. Similarly, if a trace runs through a series of tool calls and they detour through an old API during the call chain, the wiki should describe the current migration boundary for that API.

Documentation drift has to be caught by the same delivery loop that catches agent behavior.

When a documentation update for an agent-facing piece of repository documentation causes changes to the code, meaning it works by having the agent implement something different, then that documentation update should be a pull request.

Faster code makes stale context more expensive

AI-assisted development is producing much more output. Mixpanel reported 50% more pull requests with the same engineering team after AI entered the workflow. The writeup also describes teams connecting MCP and AI coding agents to observability data so agents can inspect traces and see evidence from live changes.

Documentation drift on coding agents gets worse quickly. On every pull request the documentation has to be updated for programming agents, on every new work item, and for every generated change against the wrong premise of stale documentation. Reviewers lose time. Rework loops multiply. CI proves the wrong thing out. The whole system starts sounding busy while being wrong.

This problem manifests as a legacy-codebase problem for generated work, now with a new execution engine. The hard task of a large system is not writing new code, but rather understanding local knowledge: where the bodies are buried, which abstractions are merely formalized and so can be ignored, which tests actually signal real failure, which naming conventions are actually law and so matter, which service boundaries are political. We've already written about legacy codebase knowledge as a team practice and discussed a variety of approaches. The rest of this piece focuses on the new execution engine.

For years we have advocated for teams to write down their legacy codebase knowledge and share it with others on the team. What we failed to recognize, however, was that code written by agents would have its own local context that could be as hard to understand as existing code and which, indeed, would have a lifecycle, would have ownership, could be monitored with telemetry and tested through runtime QA.

Document that can change an agent's code output to be written as code. That's a useful operating rule that treats document the same way as code and so is subject to same scrutiny: written in version control and so can be rolled back.

When doc changes in the agent-facing documentation can affect the output of code generated by an AI-assisted coding agent, this code should also be treated like code: written, reviewed, modified and tracked in a repository, traced back to fixes, released to live systems, etc. It has to be possible to reverse it when the agent gets worse.

Own the agent-facing docs

Ownership is the boring part. Also the part that decides whether this works.

An agent-facing documentation system needs a maintainer model. The platform team can own the mechanism: index, retrieval, templates, tracing, update automation. Product teams should own domain facts. Security should own permission boundaries and forbidden patterns. Test owners should own the commands that prove done. Architecture owners should own migration notes and subsystem maps.

So I described agentic AI implementation running through change control and made the point that because AI is implemented as change to a system, that change must run through records, gates, a rollback owner, and evidence, just as human change does. And similarly, agent-facing documentation deserves to be subject to the same change control as the rest of the system's context that the agent is using to generate output.

The loop can be simple:

A code change lands.
The docs job checks whether agent-facing context still matches the touched subsystem.
The coding agent opens a documentation update PR when drift appears.
Evals run against representative agent tasks.
Reviewers can approve the change of context together with the change of code, including a trace, or a failed task.

Agent-readable docs are a separate layer of context infrastructure. They need to be retrievable, citable, testable, and repairable by the humans who own the system.

The fact that documentation drift used to be a tax and with coding agents it becomes a runtime defect is something the team has to own.

Your RAG Pipeline Hallucinates Because It Never Checks Its Own Work

Austin Vance — Wed, 01 Jul 2026 23:06:29 +0000

Your team ships a documentation chatbot. It retrieves chunks, stuffs them into a prompt, and generates an answer. Demo day goes great. Then a customer asks "what's the rate limit for the batch API?" and the bot confidently answers "10,000 requests per minute" — citing a doc about a completely different API. Nobody catches it because the answer sounds plausible.

This is the core failure mode of naive RAG: the retriever returns something, the generator uses it, and nobody checks whether the retrieved context actually answers the question. The fix isn't better embeddings or bigger context windows. The fix is a pipeline that grades its own retrieval, rewrites the query when results are poor, and refuses to generate when the context doesn't support an answer.

This post builds a corrective RAG pipeline using LangGraph. Retrieve, grade, rewrite if needed, generate with citations. The architecture adds ~1.5 seconds of latency on the retry path but drops hallucinated citations from ~18% to under 3% in our evals. That's not a prompt trick — it's structural.

The Latency Math

Naive RAG is fast because it skips the hard parts:

Step	Latency
Embed query	~0.1s
Vector search	~0.05s
Generate answer	~2.5s
Total	~2.7s

Corrective RAG adds grading and optional rewriting:

Step	Latency
Embed query	~0.1s
Vector search	~0.05s
Grade relevance	~0.8s
Generate answer	~2.5s
Total (good retrieval)	~3.5s
Rewrite query + re-retrieve + re-grade	~1.5s
Total (with retry)	~5.0s

The retry path costs 1.5 seconds extra. But it only fires when retrieval quality is low — roughly 15-25% of queries in a typical technical docs corpus. The alternative is hallucinating an answer 100% of those times. That math is easy.

The Corrective RAG Pipeline Architecture


                              ┌──────────────┐
                              │   Retrieve    │
                              └──────┬───────┘
                                     │
                              ┌──────▼───────┐
                          ┌───│ Grade Docs    │───┐
                          │   └──────────────┘   │
                      relevant              not relevant
                          │                      │
                          │               ┌──────▼───────┐
                          │               │ Rewrite Query │
                          │               └──────┬───────┘
                          │                      │
                          │               ┌──────▼───────┐
                          │               │ Re-Retrieve   │──→ (back to Grade)
                          │               └──────────────┘
                          │
                   ┌──────▼───────┐
                   │   Generate    │
                   └──────┬───────┘
                          │
                          END

The key insight: grading is a gate, not a filter. If the retrieved documents don't answer the question, the pipeline doesn't generate a worse answer — it rewrites the query and tries again. After a configurable number of retries, it generates with a "low confidence" flag rather than hallucinating.

State: Track Everything

RAG state needs more than question and answer. You need to track retrieval quality, rewrite count, and the documents themselves — because your evals will need all of it.


from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)

class RAGState(TypedDict):
    question: str
    rewritten_query: str
    documents: list[dict]
    relevance_score: float
    rewrite_count: int
    max_rewrites: int
    answer: str
    citations: list[dict]
    confidence: str

Document Ingestion

Before the pipeline runs, documents need to be split, embedded, and indexed. This is the part most tutorials skip — and where most production RAG systems break


from langchain_core.documents import Document
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

DOCS = [
    Document(
        page_content=(
            "The Batch API allows you to send up to 50,000 requests in a single batch. "
            "Each batch is processed asynchronously and results are available within 24 hours. "
            "Rate limits for the Batch API are separate from real-time API limits. "
            "Maximum batch size is 100MB per file."
        ),
        metadata={"source": "api-docs/batch-api.md", "section": "overview"},
    ),
    Document(
        page_content=(
            "Real-time API rate limits depend on your tier. Tier 1: 500 RPM, 30,000 TPM. "
            "Tier 2: 2,000 RPM, 120,000 TPM. Tier 3: 5,000 RPM, 600,000 TPM. "
            "Rate limit errors return HTTP 429. Implement exponential backoff."
        ),
        metadata={"source": "api-docs/rate-limits.md", "section": "tiers"},
    ),
    Document(
        page_content=(
            "Authentication uses API keys passed via the Authorization header. "
            "Keys are scoped to organizations. Rotate keys every 90 days. "
            "Never commit API keys to version control."
        ),
        metadata={"source": "api-docs/authentication.md", "section": "keys"},
    ),
    Document(
        page_content=(
            "The streaming endpoint supports Server-Sent Events (SSE). "
            "Connect to /v1/stream with your API key. Events include 'message.delta', "
            "'message.complete', and 'error'. Connection timeout is 5 minutes of inactivity."
        ),
        metadata={"source": "api-docs/streaming.md", "section": "sse"},
    ),
    Document(
        page_content=(
            "Error codes: 400 Bad Request (malformed JSON), 401 Unauthorized (invalid key), "
            "403 Forbidden (insufficient permissions), 404 Not Found (invalid endpoint), "
            "429 Too Many Requests (rate limited), 500 Internal Server Error (retry with backoff)."
        ),
        metadata={"source": "api-docs/errors.md", "section": "codes"},
    ),
]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    add_start_index=True,
)
splits = text_splitter.split_documents(DOCS)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = InMemoryVectorStore.from_documents(splits, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

Chunk size matters more than you think. Too small (100 chars) and you lose context — the retriever returns sentence fragments that the grader can't evaluate. Too large (2000+ chars) and you dilute relevance — a chunk about "rate limits AND authentication AND billing" matches everything and answers nothing. 500 characters with 50-char overlap is a reasonable starting point for technical docs. Measure it with your evals, not your intuition.

Node 1: Retrieve

The retriever converts the query into an embedding, searches the vector store, and returns the top-k documents. Simple — and the node most people over-engineer.`python

@traceable(name="retrieve", run_type="retriever")
def retrieve_node(state: RAGState) -> dict:
query = state.get("rewritten_query") or state["question"]
results = retriever.invoke(query)
documents = [
{
"content": doc.page_content,
"metadata": doc.metadata,
}
for doc in results
]
return {"documents": documents}
`

Notice: we use rewritten_query if it exists, otherwise fall back to the original question. This is how the retry loop works — the rewrite node updates rewritten_query, and the retriever picks it up on the next pass.

Node 2: Grade Relevance

This is the node that makes corrective RAG work. Use structured output to get a binary relevance score from the LLM:``


from pydantic import BaseModel, Field

class RelevanceGrade(BaseModel):
    """Grade the relevance of retrieved documents to the user question."""
    relevant: bool = Field(
        description="Are the documents relevant to answering the question?"
    )
    reasoning: str = Field(
        description="Brief explanation of the relevance assessment."
    )

grading_llm = llm.with_structured_output(RelevanceGrade)

@traceable(name="grade_relevance", run_type="chain")
def grade_node(state: RAGState) -> dict:
    docs_text = "\n\n".join(
        f"[Source: {d['metadata'].get('source', 'unknown')}]\n{d['content']}"
        for d in state["documents"]
    )

    grade = grading_llm.invoke([
        {
            "role": "system",
            "content": (
                "You are a relevance grader. Given a user question and retrieved documents, "
                "determine if the documents contain information that can answer the question. "
                "Be strict: if the documents are tangentially related but don't actually "
                "answer the question, mark as not relevant."
            ),
        },
        {
            "role": "user",
            "content": (
                f"Question: {state['question']}\n\n"
                f"Retrieved documents:\n{docs_text}"
            ),
        },
    ])

    return {
        "relevance_score": 1.0 if grade.relevant else 0.0,
    }

Why structured output instead of a prompt that says "respond yes or no"? Because free-text parsing is brittle. The LLM might say "Yes, mostly relevant" or "Somewhat" or "The documents partially address..." and now you're writing regex. with_structured_output forces a clean boolean — no parsing, no ambiguity, no silent failures when the LLM gets creative with its formatting.

Node 3: Rewrite Query

When retrieval quality is low, rewrite the query to be more specific. The LLM sees the original question and the failed documents, so it can identify what's missing:


@traceable(name="rewrite_query", run_type="chain")
def rewrite_node(state: RAGState) -> dict:
    failed_docs = "\n".join(
        d["content"][:200] for d in state["documents"]
    )

    response = llm.invoke([
        {
            "role": "system",
            "content": (
                "You are a query rewriter. The original query returned irrelevant documents. "
                "Rewrite the query to be more specific and targeted. "
                "Focus on the key technical terms and concepts the user is asking about. "
                "Return ONLY the rewritten query, nothing else."
            ),
        },
        {
            "role": "user",
            "content": (
                f"Original question: {state['question']}\n\n"
                f"Documents returned (not relevant):\n{failed_docs}\n\n"
                "Rewrite the query to find more relevant documents:"
            ),
        },
    ])

    return {
        "rewritten_query": response.content,
        "rewrite_count": state["rewrite_count"] + 1,
    }

Node 4: Generate with Citations

The generator doesn't just produce an answer — it maps every claim to a source document. This makes hallucination detectable:


@traceable(name="generate", run_type="chain")
def generate_node(state: RAGState) -> dict:
    docs_text = "\n\n".join(
        f"[{i+1}] (Source: {d['metadata'].get('source', 'unknown')})\n{d['content']}"
        for i, d in enumerate(state["documents"])
    )

    is_low_confidence = state.get("relevance_score", 0) < 1.0

    system_prompt = (
        "You are a technical documentation assistant. Answer the user's question "
        "using ONLY the provided documents. For every claim, cite the source using "
        "[1], [2], etc. If the documents don't contain enough information to fully "
        "answer the question, say so explicitly — do not make up information."
    )

    if is_low_confidence:
        system_prompt += (
            "\n\nWARNING: The retrieved documents may not be fully relevant to the question. "
            "Be especially careful to only state what the documents support. "
            "Prefix your answer with 'Note: This answer is based on limited context.'"
        )

    response = llm.invoke([
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": (
                f"Question: {state['question']}\n\n"
                f"Documents:\n{docs_text}"
            ),
        },
    ])

    citations = [
        {"index": i + 1, "source": d["metadata"].get("source", "unknown")}
        for i, d in enumerate(state["documents"])
    ]

    confidence = "low" if is_low_confidence else "high"

    return {
        "answer": response.content,
        "citations": citations,
        "confidence": confidence,
    }

Graph Assembly

The routing logic is where the corrective pattern lives. Grade → decide → rewrite or generate:


from langgraph.graph import StateGraph, START, END
from langgraph.types import RetryPolicy

def route_after_grading(state: RAGState) -> str:
    if state.get("relevance_score", 0) >= 1.0:
        return "generate"
    if state["rewrite_count"] >= state["max_rewrites"]:
        return "generate"
    return "rewrite"

builder = StateGraph(RAGState)

builder.add_node("retrieve", retrieve_node, retry=RetryPolicy(max_attempts=3))
builder.add_node("grade", grade_node, retry=RetryPolicy(max_attempts=3))
builder.add_node("rewrite", rewrite_node, retry=RetryPolicy(max_attempts=3))
builder.add_node("generate", generate_node, retry=RetryPolicy(max_attempts=3))

builder.add_edge(START, "retrieve")
builder.add_edge("retrieve", "grade")
builder.add_conditional_edges(
    "grade",
    route_after_grading,
    {"generate": "generate", "rewrite": "rewrite"},
)
builder.add_edge("rewrite", "retrieve")
builder.add_edge("generate", END)

graph = builder.compile()

The rewrite → retrieve → grade cycle is the corrective loop. max_rewrites caps it at 2 by default — enough to refine a vague query, not enough to burn through your API budget on a genuinely unanswerable question.

Running the Pipeline


from langsmith import tracing_context

with tracing_context(
    metadata={"pipeline": "corrective-rag", "version": "v1"},
    tags=["production", "rag-v1"],
):
    result = graph.invoke({
        "question": "What are the rate limits for the batch API?",
        "rewritten_query": "",
        "documents": [],
        "relevance_score": 0.0,
        "rewrite_count": 0,
        "max_rewrites": 2,
        "answer": "",
        "citations": [],
        "confidence": "",
    })

print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']}")
print(f"Rewrites: {result['rewrite_count']}")
print(f"Citations: {result['citations']}")

Production RAG Pipeline Failures

These are the failure modes that separate a demo from a production system.

1. Retrieval Drift. The query is "batch API rate limits" and the retriever returns documents about real-time API rate limits. The content is about rate limits, so naive RAG generates a confident answer — about the wrong API. The grading node catches this because "Tier 1: 500 RPM" doesn't answer a question about batch processing. Fix: the relevance grader needs to be strict about whether documents answer this specific question , not just whether they're in the same topic area.

2. Hallucinated Citations. The generator cites "[1]" but the claim it's supporting doesn't appear in document [1]. The citation exists, the source exists, but the mapping between claim and source is fabricated. This is almost impossible to catch without a dedicated faithfulness eval. Fix: the citations field in state makes this auditable. Your eval checks whether each cited claim is actually supported by the cited document.

3. Context Window Overflow. You retrieve 10 documents at k=10, each is 500 chars. That's 5,000 chars of context before the question and system prompt. Sounds fine. Then a user asks a compound question, the rewriter expands it, and on the second retrieval pass you're stuffing 10,000 chars of context. The generator starts ignoring the later documents. Fix: cap total context length, not just k. And put the most relevant documents first — LLMs have a recency and primacy bias even in their context window.

4. Rewrite Loop Hallucination. The rewriter makes the query more specific but also less accurate. "What's the rate limit?" becomes "What is the maximum requests per second for the enterprise streaming endpoint?" — a question that's more specific but targets a concept that doesn't exist in your docs. The second retrieval is even worse. Fix: the rewriter should see the failed documents so it knows what the corpus actually contains. Our implementation passes failed_docs to the rewriter for exactly this reason.

5. Embedding Staleness. Your docs update weekly. Your embeddings update monthly. For three weeks out of every four, the retriever is searching a stale index. New features return zero results; deprecated features return confident but wrong answers. Fix: re-index on every doc update. If that's too expensive, at minimum track the embedding timestamp and surface a "stale index" warning when results are older than your update threshold.

Observability

The @traceable decorator on every node gives you per-step visibility in LangSmith. For RAG specifically, you want to see the retrieval→grading→generation flow in one trace:


from langsmith import tracing_context

with tracing_context(
    metadata={
        "rag_version": "corrective-v1",
        "corpus_size": len(splits),
        "embedding_model": "text-embedding-3-small",
    },
    tags=["production", "corrective-rag"],
):
    result = graph.invoke({
        "question": "How do I authenticate with the API?",
        "rewritten_query": "",
        "documents": [],
        "relevance_score": 0.0,
        "rewrite_count": 0,
        "max_rewrites": 2,
        "answer": "",
        "citations": [],
        "confidence": "",
    })

In the LangSmith trace, you can see:

- The retriever span showing which documents were returned and their similarity scores

- The grading span showing the structured relevance assessment

- Whether the rewrite loop fired (and how many times)

- The generator span showing the final prompt with context

- Total token usage and latency per node

The first thing to check when answer quality degrades: are rewrites spiking? If rewrite_count averages above 0.5 across queries, your retrieval quality has drifted — probably stale embeddings or a corpus change that shifted the embedding space.

Evals for Retrieval Augmented Generation

RAG evals need three axes: retrieval quality, answer faithfulness, and answer relevance. Skipping any one of them leaves a blind spot that will eventually become a production incident.


from langsmith import Client, evaluate
from openevals.llm import create_llm_as_judge

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="corrective-rag-evals",
    description="Corrective RAG pipeline evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What are the rate limits for the batch API?"},
        {"question": "How do I authenticate API requests?"},
        {"question": "What error code means I've been rate limited?"},
        {"question": "What is the connection timeout for streaming?"},
    ],
    outputs=[
        {"expected_topics": ["batch", "50,000", "asynchronous"], "expected_source": "api-docs/batch-api.md"},
        {"expected_topics": ["API key", "Authorization header", "rotate"], "expected_source": "api-docs/authentication.md"},
        {"expected_topics": ["429", "rate limit"], "expected_source": "api-docs/errors.md"},
        {"expected_topics": ["5 minutes", "SSE", "timeout"], "expected_source": "api-docs/streaming.md"},
    ],
)

FAITHFULNESS_PROMPT = """\
Question: {inputs[question]}
Retrieved documents: {outputs[documents]}
Generated answer: {outputs[answer]}

Rate 0.0-1.0 on faithfulness: Is every claim in the answer supported by the retrieved documents?
A score of 1.0 means no hallucinated information. A score of 0.0 means the answer fabricates claims.

Return ONLY: {{"score": <float>, "reasoning": "<explanation>"}}"""

RELEVANCE_PROMPT = """\
Question: {inputs[question]}
Generated answer: {outputs[answer]}

Rate 0.0-1.0 on relevance: Does the answer actually address the user's question?
A score of 1.0 means the answer directly and completely addresses the question.
A score of 0.0 means the answer is off-topic or doesn't address the question.

Return ONLY: {{"score": <float>, "reasoning": "<explanation>"}}"""

faithfulness_judge = create_llm_as_judge(
    prompt=FAITHFULNESS_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="faithfulness",
)

relevance_judge = create_llm_as_judge(
    prompt=RELEVANCE_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="relevance",
)

def retrieval_precision(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did the retriever find documents from the expected source?"""
    expected_source = reference_outputs.get("expected_source", "")
    retrieved_sources = [
        d.get("metadata", {}).get("source", "")
        for d in outputs.get("documents", [])
    ]
    hit = any(expected_source in s for s in retrieved_sources)
    return {"key": "retrieval_precision", "score": 1.0 if hit else 0.0}

def topic_coverage(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Does the answer mention the key topics?"""
    answer = outputs.get("answer", "").lower()
    topics = reference_outputs.get("expected_topics", [])
    hits = sum(1 for t in topics if t.lower() in answer)
    return {
        "key": "topic_coverage",
        "score": hits / len(topics) if topics else 1.0,
    }

def rewrite_efficiency(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """How many rewrites were needed? Fewer is better."""
    count = outputs.get("rewrite_count", 0)
    if count == 0:
        score = 1.0
    elif count == 1:
        score = 0.7
    else:
        score = 0.4
    return {"key": "rewrite_efficiency", "score": score}

def target(inputs: dict) -> dict:
    return graph.invoke({
        "question": inputs["question"],
        "rewritten_query": "",
        "documents": [],
        "relevance_score": 0.0,
        "rewrite_count": 0,
        "max_rewrites": 2,
        "answer": "",
        "citations": [],
        "confidence": "",
    })

results = evaluate(
    target,
    data="corrective-rag-evals",
    evaluators=[
        faithfulness_judge,
        relevance_judge,
        retrieval_precision,
        topic_coverage,
        rewrite_efficiency,
    ],
    experiment_prefix="corrective-rag-v1",
    max_concurrency=4,
)

retrieval_precision is the canary. If it drops below 0.8, your embeddings are stale or your chunk size is wrong — fix the retrieval layer before touching prompts. faithfulness catches hallucinated citations. topic_coverage catches incomplete answers. rewrite_efficiency monitors whether the corrective loop is firing too often. Run all four on every PR that changes retrieval, prompts, or document processing.

When to Use This

- Your corpus has overlapping topics (rate limits for different APIs, config for different services)

- Wrong answers are worse than slow answers (legal, medical, financial docs)

- You need auditability — every answer traced back to source documents

- Your retrieval quality varies (some queries match perfectly, others return tangential results)

- Your corpus is small and well-differentiated (a 10-page FAQ)

- Latency budget is under 3 seconds

- You're building a search UI, not a Q&A system (show results, don't generate answers)

- Retrieval quality is consistently high (>95% of queries return relevant doc

The Bottom Line

Naive RAG is a demo. Corrective RAG is a system. The difference is a grading node, a rewrite loop, and a confidence flag — maybe 40 lines of extra code and 1.5 seconds of extra latency on the retry path.

The architecture is: retrieve, grade, rewrite if bad, generate with citations. The grading node is the key — without it, you're trusting the retriever to be right every time, and it won't be. The citations field makes hallucinations detectable. The rewrite loop makes bad queries fixable. The confidence flag tells the caller when to trust the answer and when to escalate.

Ship the faithfulness eval before you ship the pipeline. If your retrieval precision drops below 0.8, fix your embeddings before you fix your prompts. And set max_rewrites to 2 — if two query rewrites can't find relevant context, a third won't either.

Austin Vance is the CEO of Focused , where we build AI-powered software that solves complex problems in production. If you're building RAG pipelines and want to talk retrieval strategies,reach out .

Technical References:

Corrective RAG pipeline GitHub Repo

Build a custom RAG agent with LangGraph

Evaluate a RAG application with LangSmith

Build a semantic search engine with LangChain

Agent Prompt Caches Are a Runtime Boundary | Focused Labs

Austin Vance — Wed, 01 Jul 2026 23:06:14 +0000

Prompt caching turns context order into runtime architecture.

Moving a timestamp, session id, memory write, or even tool schema to the front of an agent’s prompt can single-handedly destroy the cache economics of every long-running task. This still causes trouble even though the model does answer the prompt, the trace even looks normal, and it still works more or less as before. But the increased bill and first token latency can add up quickly, only becoming apparent later for expensive agents where the difference is barely noticeable at first.

The fresh work from LangChain named the long-buried architecture decision hidden in plain sight for everyone using Deep Agents. The numbers are blunt: properly implemented prompt caching, as opposed to a simple cache of last N tokens, can reduce inference token cost by 41 to 80%. The more interesting number is from Deep Agents eval trajectories in LangChain’s harness: when provider-aware caching is turned on, average token-cost reduction lands between 49 and 80%.

A prompt cache is a test for stable context.

The cache boundary lives inside the prompt

Provider prompt caches work off of prefixes. The reusable part of the prompt must occur in the same place, in the same order, with the exact same text, above the provider’s minimum token threshold. Cache hits for OpenAI occur on exact prefix matches, starting at 1,024 tokens, with stable prompt content first and the request-specific part last. OpenAI also exposes prompt_cache_key to help with routing locality for requests with reusable prefixes.

That turns prompt assembly into architecture.

Here the stable part of the prompt is the beginning of the prompt and consists of ‘boring’ parts that used to be referred to as prompt boilerplate, e.g. system prompt, developer prompt, tool prompt, output contract, examples, skills, static policy, project memory, and other material reused over multiple turns. The dynamic suffix consists of current input, plus relevant context, retrieved snippets, current memory writes, tool output, timestamps, request metadata, auth-scoped resource hints, and whatever was generated in the current turn.

While there is a graph, an eval set and a model router it does not automatically mean there is agentic AI architecture. It all leaks money as soon as stable parts of context are mixed with volatile parts, and those parts of context are repeated.

Small mistakes made to the prompt individually don’t seem to do much harm, but all together ruin cache hits. For example, a new middleware is introduced to prepend a request_id to the system prompt. A user profile summary is moved to the top of the prompt (what a “personalized” AI must look like). A retrieval component appends fresh citations to the top-level instruction. A tool registry is sorted by last-use time. A skill loader inlines the complete selected skill before the stable part of the prompt. Each modification is defensible by itself, but in total they turn the prompt cache into a rumor.

This is why context as product architecture is coming back time and time again. People’s understanding of context in terms of architecture is that agents don’t perceive the context only as text. They perceive the context as ordered state with associated costs and latencies. Agents also have understanding of context in terms of permission and failure.

Cache hits survive when the runtime keeps stable context ahead of volatile work.

The harness owns the shape

The provider can offer the cache. The agent harness owns the runtime boundary because it decides what the provider sees.

In LangChain’s Deep Agents stack, the relevant line is create_deep_agent. The docs say caching is enabled by default for Anthropic models and Bedrock Claude or Nova models, and is applied to static parts of the system prompt that repeat on every turn.

The interesting part of the default stack is that provider-specific prompt caching happens after the patch/custom middleware and before the memory middleware. The Deep Agents customization docs note that memory is placed later so memory updates do not coincidentally invalidate the cache prefix. This is the crux of the argument against caching as a setting in the SDK buried near the model call. Instead, cache policy is a runtime boundary just like the rest of the system.

This is another point where lazy-loading tool schemas into context for running agent scenarios makes sense. Tool selection in an MCP-powered runtime is critical for performance and control, and dumping all tool schemas into a prompt pays nothing and churns the cache when tool lists change. Static tool definitions should be explicitly included in the prefix for cache hits, not accidentally created by adding the entire MCP registry to the prompt.

Skills are runtime-governed artifacts. They can be durable operating knowledge, cacheable prefix content, or raw unreviewed payloads. When skills as governed runtime artifacts affect cache hit rates or provider bills, they fall outside a narrow security view and become a runtime concern.

Provider differences are runtime inputs

The biggest problem with this interface is that the caching model exposed by each provider is different.

As mentioned earlier, Anthropic supports explicit cache breakpoints for tools, system and messages in that order. The cache for these objects has a default TTL of 5 minutes but can also be set to 1 hour for objects that live longer. Anthropic’s documentation for caching also details the pricing for cache writes and reads as well as the hard limit of 4 breakpoints per model. This becomes important when an agent has definitions for tools, the system, examples and then the conversation state.

OpenAI supports caching for a number of models, automatically for supported models and explicitly for others. The caching is rewarded by the agent runtime when stable parts of the agent input come first, followed by variable parts. This can be controlled by the layout of the input and by the routing of the cache key. For example, in cases where there are shared prefixes, the cache key should be sized to the point at which those prefixes fan out from having had the same prefix, to maximize hits. The OpenAI docs for caching also carry the important detail that cache-key traffic that is too broad can overflow locality and thus decrease hit rate.

Gemini splits the caching behavior between implicit caching for newer Gemini models such as Gemini 2.5 and larger, explicit cached-content objects for repeated large context. In these architectures the runtime determines that a large document, a long recorded audio or video transcript, or a similarly long collection of written instructions should be cached as a named artifact for repeated use by the agent, rather than the agent relying on implicit caching for prefix matches. Google's docs show cachedContent objects with TTL, similar to checkpointed caching model features.

Bedrock’s caching is also done via cache checkpoints, and there are model-specific minimums for number of tokens, supported fields, TTLs, and checkpoint limits. Amazon's docs are a reminder that enterprise deployment details eventually surface: model family, endpoint type, region, and checkpoint limits all affect caching behavior.

A generic model interface is enough to hide the invocation details of a model. However, it cannot hide the important cache details, such as the fact that a provider rewards exact prefix layout, as opposed to breakpoints that must be explicitly invoked for tools, system, or messages, cached content where the cache stores a large piece of context instead of a prefix match, or cache checkpoints with field support, TTL, and checkpoint-count rules. Those details become routing keys, object IDs, and provider policy in the runtime.

Model routing is an architecture boundary because routing a task from OpenAI to Anthropic or from Gemini to Bedrock changes the cache controls available to the agent runtime. Treat the providers as strings and the cache boundary has no owner.

Provider APIs do not expose the same cache knobs, so a generic wrapper is not enough.

Cache hits belong in traces

Prompt caching has an observability shape.

The cache information is not complicated. Every serious runtime should record the cache policy that was applied to a model call. This includes the stable-prefix hash, the list of prompt sections included in the cacheable prefix, the number of cache breakpoints, the provider’s cache key or cached-content id, the write and read tokens for the cache, the TTL, and the reason why the cache was skipped for any given model call. This information should be observable in the same trace as model output, tool usage, cost, latency, retry attempts, and other evaluative data. In particular, this information should be recorded by the runtime, as opposed to being obscured inside a provider or runtime component. LangChain's prompt-caching model docs frame caching across implicit provider caching, provider-level controls, and middleware. Information about caching within a runtime should follow that model rather than being flattened into input tokens and output tokens.

Cost is similarly observable. Agent cost is a runtime signal when the runtime can connect a bill spike to a changed tool schema, memory injector, model route, or prompt template. This is finance cleanup, instead of a monthly bill where the team can only guess which agent got expensive.

Missing prompt-cache hits without reason is a runtime bug that behaves nicely and quietly continues to make the agent slower and more expensive while the dashboards stay green.

The runtime rule is simple

Place the stable context first, the volatile context last. Make the cache controls of a provider explicit in the corresponding middleware. Record cache behavior in the traces as well. Change cache policy the way runtime parameters change, because cache policy changes runtime parameters.

The hard part is ownership.

Everyone has a reasonable request: Product wants personalization on top of security’s policy coverage, platform abstracts away provider-specific details, finance toggles the cache on and off, and engineering wants the prompt to work even after memory, tools, skills, and retrieval have all changed. The runtime has to satisfy all of these requests and produce a context layout that is able to survive live-system changes.

That is agentic AI architecture now. The graph matters. The tools matter. The evals matter. The cache boundary also matters because it is where cost, latency, context, and provider behavior meet.

Those teams will make long-running agents cheaper without making them stupid. The others will learn the hard way that discounts don’t last when architecture is accidental and they treated prompt caching like a provider toggle.