DEV Community: James O'Connor

Step-level agent evals exist now. Most teams still grade the finish line.

James O'Connor — Fri, 24 Jul 2026 08:02:38 +0000

The thesis: an agent that fails at step 2 of 7 and an agent that fails at step 7 of 7 get the same score from an outcome eval, and they should not. The first one picked the wrong tool while holding the right context; the second one did everything right and hit a flaky API. Those are different bugs with different fixes, and the outcome eval hands back one number that cannot tell them apart. Through 2024 and most of 2025 the tooling mostly could not see the difference. As of July 2026 it can, across at least six frameworks, and the interesting divide is no longer "can you evaluate agents" but which step-level questions each tool answers deterministically versus by asking another model.

In June I argued that the trajectory, not the final answer, is the unit of agent evaluation. This is the follow-up: the question has moved from whether the tooling can see the path to which step-level questions it answers deterministically. I spent this week reading the current source and docs of six frameworks to map that divide. Everything below is as of July 2026; all six move fast, so treat version-dependent claims as dated the day you read them. Tools in alphabetical order throughout, because the ranking depends on your stack anyway.

The four questions a step-level eval can answer

When people say "evaluate the trajectory," they mean at least four separable checks, and conflating them is how teams buy the wrong tool.

Tool choice. Given the state at step i, was this the right tool to call at all.
Argument correctness. Right tool, but were the parameters right. (In my experience this is where the production failures actually live: the tool choice is right and one parameter is subtly wrong, a date filter scoped a day too wide, an ID passed where a name belongs.)
Path quality. Right calls, wrong shape: loops, backtracking, redundant steps, order violations.
Task completion. Did the whole trajectory achieve the goal.

Questions 1 and 4 are judgment calls, and most frameworks route them to an LLM judge. Questions 2 and 3 are checkable by code against a reference or a rule, and the frameworks that treat them that way give you something you can gate CI on without inheriting a judge's variance.

What each framework actually checks

Arize Phoenix (Elastic License 2.0). Phoenix splits single-step tool use into three prebuilt LLM-judge evaluators: ToolSelectionEvaluator ("was the correct tool selected"), ToolInvocationEvaluator (arguments and formatting), and ToolResponseHandlingEvaluator (did the agent use the result properly). All three are judge-based classifications. Whole-path evaluation is a cookbook recipe rather than a shipped metric: their docs walk you through writing a code evaluator for path convergence yourself. The tracing underneath is OpenTelemetry-based with a wide set of agent-framework auto-instrumentors, and since June 2026 evals can run as pytest tests in CI. Worth knowing: the license is Elastic 2.0, source-available rather than OSI open source, which matters if your legal team reads licenses closely.

DeepEval (Apache-2.0). The broadest agentic metric menu of the six. Five metrics consume a full execution trace (TaskCompletion, StepEfficiency, PlanAdherence, PlanQuality, plus AgentLoopDetection), and the split between judge and code is explicit in the source: TaskCompletion, StepEfficiency and the plan metrics are LLM judges over the serialized trace, while AgentLoopDetection and ToolPermission are documented as fully deterministic, no API key required. ToolCorrectness is the interesting hybrid: deterministic matching of called tools against expected tools (exact, ordered, or set), with argument matching optional, but hand it the available-tools list and it quietly adds a judged selection score. If you gate CI on it, know which mode you configured. Ships as a pytest plugin, which makes the CI story the most conventional of the six.

Future AGI (Apache-2.0). The eval library takes the opposite bet from Phoenix: agent metrics as deterministic heuristics rather than judges. The trajectory set (task_completion, step_efficiency, tool_selection_accuracy, trajectory_score, plus goal_progress, action_safety and reasoning_quality) scores a structured trajectory input with keyword and sequence heuristics, no model call. The function-calling set parses calls with an AST and checks name match, parameter validation, and either exact match or a weighted accuracy score, including parallel calls. The tradeoff reads both ways: deterministic scoring is fast, free, and identical on every run; the cost is that a keyword-overlap notion of "task completed" is bluntly literal in a way a judge is not. The platform around it spans tracing, simulation, and a gateway, a scope comparable to Phoenix, Langfuse, and LangSmith.

Langfuse (MIT core, ee folder for enterprise plumbing). Langfuse's angle is that the trace is the eval target: evaluators, judge or code, attach to any observation in a multi-step trace, and tool calls arrive in the evaluator's context with names, arguments, and counts. The 23 managed judge templates are quality-flavored (hallucination, correctness, plus Ragas partner metrics like Goal Accuracy and Topic Adherence); there is no off-the-shelf tool-call correctness metric, so step grading means writing a sandboxed code evaluator or a custom judge against the tool-call fields. The agent graph view for traces is in beta. Notably, the eval features live in the MIT tree, and self-hosted deployments get unlimited judge evaluators; the code-evaluator runtime needs an explicit dispatcher config when self-hosting.

LangSmith + agentevals/openevals (platform proprietary; evaluator libraries MIT). The two open libraries are the most complete deterministic trajectory matchers of the group: strict, unordered, subset, and superset match modes over message-plus-tool-call trajectories, with per-tool argument comparators you can override down to a custom equality function. LangGraph users additionally get graph-trajectory matching at the node level. Judge variants exist for the same shapes when you have no reference trajectory. The libraries run standalone under pytest or Vitest with response caching for CI; the LangSmith platform itself, per its own FAQ, is proprietary software, so the split to understand is MIT evaluators, closed dashboard.

Promptfoo (MIT). The change that surprised me most this year: Promptfoo now ships a deterministic trajectory assertion family that runs against traced execution rather than final output. trajectory:tool-used, trajectory:tool-args-match (partial or exact, with ignore-lists for volatile arguments), trajectory:tool-sequence (in-order or exact), and trajectory:step-count, plus a judged trajectory:goal-success and a tool-call F1 scorer. The catch is honest and structural: these require trace data, so your agent must emit OpenTelemetry spans to Promptfoo's receiver before any of it works. Pair that with its declarative configs and CI exit codes and it covers questions 2 and 3 with plain YAML.

The pattern, stated plainly

Lay the six side by side and the divide is clean. Argument correctness and path shape, the checkable questions, are deterministic where they are best developed: agentevals' match modes and comparators, Promptfoo's trajectory family, DeepEval's tool matching and loop detection, Future AGI's AST-based function-calling checks. Tool choice and task completion, the judgment questions, are LLM judges nearly everywhere they are prebuilt, with two exceptions taking the deterministic bet at the cost of bluntness (Future AGI's heuristics) or narrow scope (DeepEval's loop detector).

So the selection question for a team is which of the four questions matches your actual failure mode, and whether you want it answered by code or by a judge. If your agents fail on arguments and paths, the deterministic matchers gate cleanly in CI. If they fail on judgment, on picking the wrong tool while every call is well-formed, you are buying a judge somewhere, and its run-to-run variance comes with it into any blocking check.

One more cost that applies across the board: every step-level anything requires the steps to exist. Traced spans, structured trajectories, instrumented tool calls. The instrumentation tax comes before the first metric fires, whichever framework collects it. Budget for that first; it is most of the adoption work.

Where I'd push back on this

The steelman against step-level evaluation: outcomes are what users experience, and a trajectory metric can flag an "inefficient" path that a model chose for good reasons the metric cannot see. Grading the path risks optimizing agents into brittle choreography, matching the reference trajectory instead of solving the task. That objection lands, and the honest answer is that step-level checks earn their keep as diagnostics and regression tripwires, not as the definition of success. Gate on outcomes plus the deterministic invariants you truly require (no unauthorized tools, no argument corruption, no loops), and use the rest of the trajectory data to explain failures rather than to score them.

Where I hold the line: the claim that outcome evals alone are enough. An outcome eval on a seven-step agent is a test with one assertion at the end of the program. I have never seen a team accept that coverage for ordinary code, and I have not heard a good argument for why agents should be graded that way.

Disclosure: capability descriptions above come from each project's public repositories and documentation, read the week of July 20, 2026. All six ship changes weekly; verify against current docs before deciding.

Our agents reported success on tool calls that had already failed. Here's the pattern.

James O'Connor — Wed, 22 Jul 2026 05:21:25 +0000

I spend most of my time on the seam between a model and the tools it calls. Over the last year, the failure mode that has cost me the most debugging hours is a tool that breaks quietly while the model reports back as if the call had gone fine.

I want to be careful about the claim, because it is easy to overstate. The model is not "lying," and it is (usually) not hallucinating in the way people mean when they say that word. It is reading a tool result that does not clearly say this failed, and it fills the gap the way it fills every gap: by continuing. The fix is almost never in the prompt. It lives in how the tool boundary encodes failure.

Here is the number that made me stop treating this as anecdotal. Over one 30-day window (roughly 41,000 tool invocations across two agents in production), 1,142 calls returned something other than a clean success: a timeout, an error body, a partial write. In 331 of those, about 29%, the model continued the plan as if the call had succeeded. That 29% is the figure I care about, because every one of those is a silent wrong answer with no exception in the logs to catch it.

Four cases, same shape each time.

Case 1: The timeout that came back as an empty string

Our retrieval tool had a 5-second budget. When it blew past that, the wrapper caught the timeout and returned "" (an empty string) rather than raising. The intent was defensive. The effect was that the model read an empty result as the search ran and found nothing, and it confidently told the user there were no matching records.

There were matching records. The search never completed.

The tell (in hindsight) is that "found nothing" and "did not run" collapsed into the same token stream. A human on-call would notice a 5-second gap and a suspiciously empty payload. The model has no clock and no baseline, so it cannot notice either.

Case 2: The 200 that carried an error body

This one is almost a genre. An upstream API returned HTTP 200 with a body of {"status": "error", "message": "rate limited, retry after 30s"}. Our tool wrapper checked the HTTP status code (200, so "success"), serialized the body, and handed it back. The model saw a JSON object with fields in it, treated the fields as data, and reasoned over message as though it were a result.

I would flag the general rule here, because it burned us more than once: transport-level success (the request completed) and application-level success (the thing you asked for happened) are different questions, and a lot of wrappers only answer the first one. If your success check is response.ok, you are trusting the upstream service to never return an error inside a 200. In my experience that trust is misplaced roughly as often as you would expect, which is to say: often enough to matter, rarely enough that you forget about it between incidents.

Case 3: The partial write that reported as whole

A multi-step tool plan wrote three records: two committed, the third failed on a constraint violation. The orchestration layer returned a single top-level success: true because it read one sub-call's status instead of aggregating all three, and that sub-call was one of the two that committed. The model summarized the operation as complete. Downstream, one record was missing, and nothing in the transcript hinted at it.

Partial failure is the case I think teams underinvest in, because it does not look like a failure from either end. The tool did not throw. The model did not confabulate. The plan just had a hole in the middle that neither side was responsible for noticing. (I will concede this one is as much an orchestration bug as a model-boundary bug. But the model happily papered over it, and that is the part I can actually harden.)

Case 4: The exception stringified into the result field

The one I find hardest to defend against. A tool caught its own exception and did return {"result": repr(e)}. So the result field, the field the model is trained to read as the answer, now contained the text of the exception. The model, gamely, tried to use it. In one trace it read KeyError('user_id') and reported to the user that their ID was KeyError.

In a test that reads as a bug. In front of a customer it reads as us not knowing our own data.

The shared cause, and the thing that actually fixed it

The four cases look different (a timeout, a 200, a partial commit, a swallowed exception), but they share one property: failure was encoded in-band, in the same channel and often the same field as a real result. The model had no structural way to tell "here is your answer" from "here is why you have no answer."

What moved the 29% was not a better system prompt. It was forcing every tool to return an explicit status envelope, and putting the failure signal somewhere the payload can never live:

{
"ok": false,
"error_kind": "timeout", // timeout | upstream_error | partial | exception | not_found
"retriable": true,
"partial_results": null, // present only when error_kind == "partial"
"data": null, // populated ONLY when ok == true
"message": "retrieval exceeded 5s budget; no results fetched"
}

Two rules make the envelope earn its keep. First, data is populated only when ok is true, so the model cannot accidentally read an error message as an answer (Case 4 dies here). Second, not_found is its own error_kind, distinct from timeout, so "ran and found nothing" and "never ran" stop collapsing into the same thing (Case 1 dies here). Cases 2 and 3 need the wrapper to actually check application-level status and to aggregate every sub-call, which the envelope does not do for you, but it at least gives them a place to report the truth once you do.

After we rolled this out across the tool layer, the silent-continue rate over the next comparable window dropped from 29% to about 4% (roughly 46 of 1,180 non-success calls). I do not think 4% is a floor anyone should be proud of, and I have not fully chased down what is left. But going from "one in three quiet failures sails through" to "one in twenty-five" changed which bugs made it to users.

Where I'd push back on this

If I were reviewing my own argument, here is the strongest version of the counter-case, because I think it is partly right.

The steelman: envelopes push complexity onto every tool author, and they invite a false sense of safety. You can hand the model a perfectly structured ok: false and it can still barrel ahead and ignore it, because nothing forces it to branch on that field. The status envelope makes the failure legible, not respected. And a determined team could get most of the same benefit with strict typing and a retry layer that never lets a malformed result reach the model at all, no envelope required. That is a fair hit. The envelope is a convention, and a convention only holds while every tool author keeps honoring it.

My concession: yes. On our own numbers, the residual 4% is mostly the model reading a clean ok: false and continuing anyway, which is exactly the failure the envelope was supposed to prevent. So the envelope solved the encoding problem (the model can no longer confuse an error for data) and only dented the compliance problem (the model still sometimes ignores a well-formed error). Those are two different problems and I conflated them for longer than I would like to admit.

Objections I'd accept: that this is really a contract-design problem, not a model problem; that a hard retry-or-halt layer outside the model is stronger than any envelope the model can choose to ignore; that my 4% is under-investigated and might be papering over a subclass I have not named yet.

Objections I wouldn't accept: that better prompting ("if a tool fails, stop and report it") fixes this. We tried the prompt-only version first. It moved the 29% by a couple of points and drifted back within a week. If the failure is not structurally distinguishable from a success, no amount of instruction makes the model reliably see a difference that is not encoded in what it reads.

The short version I would give a teammate: assume every tool will fail silently at least once, and design the boundary so that a silent failure is impossible to read as a success. You still have to get the model to branch on the signal. That is a separate problem, and it only starts once the signal is unambiguous in what the model reads.

The tool-call number I published has expired. The claim it supported has not.

James O'Connor — Mon, 20 Jul 2026 00:24:57 +0000

Function-calling robustness is a model selection problem masquerading as a prompt problem. I proved that with a percentage, which was the mistake.

TL;DR. Two months ago I posted a number in a comment thread: same tool description, 1,000 enterprise queries, one model called the tool correctly 96% of the time and another managed 88%. People liked it. Someone quoted it back at me in a design review last week to justify a model choice, and the quote was doing work the number could not support any more. The claim underneath it is structural and still true: if your agent ignores a tool, the cause is more often the model you picked than the words you wrote. The evidence I gave for it has a shelf life of about one release cycle, and I published it with no expiry date on it. This is what I should have shipped instead: the harness, the metric definition, and the decision rule. Your numbers, not mine.

The design review

We were picking a model for a new extraction flow. Document in, structured record out, one tool call to fetch the customer's contract template before the extraction runs. Ordinary work.

Halfway through, an engineer on the team pulled up my comment. He had found it while searching for exactly this decision, which is how these things go. "O'Connor got 96 versus 88 on this, we should just use the same model."

He was quoting me correctly. That is the part that bothered me.

The test he was quoting ran in late 2025. It compared two models that are now two releases back on both sides. It used our tool descriptions, our schema depth, our document mix, and a definition of "called the tool correctly" that I never wrote down anywhere he could read. He had none of that context. He had a percentage, my name attached to it, and a decision to make on a Tuesday.

I have been on the other side of this. I have cited someone's benchmark from a blog post because it was the only number I could find and the meeting was in ten minutes. Everyone does it. The number is not lying to you, exactly. It is answering a question that was asked somewhere else, about something else, a while ago.

So this post is partly a correction and partly an apology to anyone who has quoted that comment since May.

What the number actually was

I want to be precise about it, because the imprecision is the whole point.

Late last year we shipped a contract-extraction agent. It had one tool worth arguing about: fetch_template(customer_id, doc_type). The agent was supposed to call it before extracting, so that extraction ran against the right schema. When it skipped the call, extraction ran against a generic schema and quietly produced worse output. Not an error. Just worse.

We were seeing skips. The team's instinct, and mine at first, was that the tool description was bad. So we did what everyone does. We rewrote it. We made it more imperative. We added "ALWAYS call this first." We moved it up in the tool list. We added an example. Each change bought a couple of points and cost a week.

After about three weeks of this I got annoyed enough to run the comparison properly. Same tool description, byte for byte. Same 1,000 queries sampled from real traffic. Two models. The gap was eight points, and it was larger than everything the prompt rewrites had bought us put together.

We hard-routed that flow to the model that won and stopped iterating on the prompt. That was the right call and I would make it again.

Then I wrote "96% versus 88%" in a comment box and walked away from it.

Three things that broke the number

1. The models moved, and they did not move in parallel

This is the obvious one and it is still worse than people expect.

The intuition most teams carry is that model releases lift all boats, so a gap measured last year roughly holds this year, just at higher absolute numbers. That intuition is wrong in a specific way: tool-calling behaviour is shaped by post-training choices that are not on the same schedule as general capability. A vendor can ship a model that reasons better and calls your tool less often, because they retuned how eagerly it reaches for tools, or changed how it handles a tool whose description sounds optional.

I have watched a model get better at the task and worse at the tool call in the same release. If you are only tracking the end metric, that shows up as noise. If you are tracking the tool call separately, it shows up as a decision.

Which direction any given release moved is a question about this month, and if I answer it here that answer rots too. That is not me dodging. It is the actual finding.

2. It was my workload, not yours

The eight-point gap was measured on one tool, in one schema, in one document domain, with one tool-list length.

Every one of those is load-bearing:

Tool count. One tool is a different problem from fourteen. Selection pressure between similar-sounding tools is a separate failure mode from whether the model reaches for a tool at all, and models that are good at one are not automatically good at the other.
Schema depth. A flat two-field schema and a nested schema with optional sub-objects do not fail the same way. (I have written about optional fields before. They remain the sharpest edge in this whole area.)
Description ambiguity. Our description was decent. If yours is genuinely ambiguous, you have a prompt problem sitting on top of your model problem, and my number tells you nothing about the ratio.
Domain. Contracts. If you are routing support tickets, the retrieval-shaped context is different enough that I would not transfer the number across.

So the honest scope of "96 versus 88" is: for this tool, in this schema, on this traffic, at that time. Anyone outside that scope is reading a number that was never about them.

3. The metric was underspecified, which is my fault

Here is the one that embarrasses me.

"Called the tool correctly" is at least four different metrics:

Did it call the tool at all, when it should have?
Did it call the right tool, when several were plausible?
Did it pass arguments that validate against the schema?
Did it pass arguments that were semantically right, meaning valid and also correct?

My 96 and 88 were mostly metric 1, with a bit of 3 folded in because a call that failed validation got counted as a miss. I did not say that. Someone reading the comment could reasonably assume I meant 4, which is the one they actually care about and the one where the gap between models is different again.

That gap between 3 and 4 is where most production pain lives. A tool call can validate perfectly and be wrong. My number does not speak to that at all, and it was quoted at me as if it did.

What I should have published

Not a percentage. A harness that produces yours.

Here is the shape we use now. It is deliberately small. The point is not that it is clever, it is that it takes under an hour to point at your own traffic, and after that you never have to cite a stranger's blog post in a design review again.

from enum import Enum
from typing import Any, Callable, Sequence

from pydantic import BaseModel, ValidationError


class Outcome(str, Enum):
    """The metrics people collapse into 'it worked'. Keep them apart."""

    NO_CALL = "no_call"                  # metric 1: should have called, didn't
    SPURIOUS_CALL = "spurious_call"      # should have called nothing, called anyway
    WRONG_TOOL = "wrong_tool"            # metric 2: called something else
    INVALID_ARGS = "invalid_args"        # metric 3: args failed schema validation
    VALID_BUT_WRONG = "valid_but_wrong"  # metric 4: schema-valid, semantically wrong
    CORRECT = "correct"


class Case(BaseModel):
    """One row of your traffic, with the answer you'd have wanted."""

    query: str
    expected_tool: str | None            # None = the model correctly calls nothing
    expected_args: dict[str, Any] | None


class Result(BaseModel):
    case: Case
    model: str
    outcome: Outcome


def grade(
    case: Case,
    tool_name: str | None,
    raw_args: dict[str, Any] | None,
    arg_schema: type[BaseModel],
    semantic_check: Callable[[Case, BaseModel], bool],
) -> Outcome:
    if tool_name is None:
        return Outcome.CORRECT if case.expected_tool is None else Outcome.NO_CALL
    if case.expected_tool is None:
        # It reached when it should have sat still. That is not tool *selection*.
        return Outcome.SPURIOUS_CALL
    if tool_name != case.expected_tool:
        return Outcome.WRONG_TOOL
    try:
        parsed = arg_schema.model_validate(raw_args or {})
    except ValidationError:
        return Outcome.INVALID_ARGS
    # The step almost everyone skips. Schema-valid is not the same as right.
    return Outcome.CORRECT if semantic_check(case, parsed) else Outcome.VALID_BUT_WRONG

And the runner, which is the boring part that matters:

def run_matrix(
    cases: Sequence[Case],
    models: Sequence[str],
    call_model: Callable[[str, str], tuple[str | None, dict[str, Any] | None]],
    arg_schema: type[BaseModel],
    semantic_check: Callable[[Case, BaseModel], bool],
    repeats: int = 3,
) -> list[Result]:
    """Same cases, same tool description, every model. repeats>1 because
    these are sampled, not deterministic, and a 2-point 'gap' across a
    single pass is usually just temperature."""
    out: list[Result] = []
    for model in models:
        for case in cases:
            for _ in range(repeats):
                tool_name, raw_args = call_model(model, case.query)
                out.append(
                    Result(
                        case=case,
                        model=model,
                        outcome=grade(
                            case, tool_name, raw_args, arg_schema, semantic_check
                        ),
                    )
                )
    return out

Three things about this that are not obvious until you have run it wrong once.

repeats defaults to 3 and should probably be higher. These calls are sampled. Run 200 cases once against two models, see 94 and 91, and you have learned almost nothing. I have watched a "gap" evaporate on the second pass. If the difference you are chasing is inside the run-to-run spread, you have not found a difference, you have found the noise floor.

semantic_check is yours and nobody can write it for you. For fetch_template ours is roughly "does this customer_id belong to the customer the ticket is about, and is doc_type one this customer actually has." It is not glamorous. It is also the only part that measures the thing you care about, which is why every generic harness stops at validation and leaves the interesting metric on the floor.

Keep the tool description byte-identical across models. The first time we ran this we had per-model prompt tweaks left over from the three weeks of iteration, so we were comparing prompt-plus-model against prompt-plus-model and calling it a model comparison. That run told us nothing and we nearly shipped a decision off it.

The decision rule

The harness gives you a distribution across six outcomes per model. What you do with it:

Mostly NO_CALL, and it varies a lot by model. Model selection. Stop editing the prompt. This was us. If the spread across models is wider than the spread you can buy with prompt changes, you have your answer, and the prompt work you were about to do is a tax you are choosing to pay.

Mostly NO_CALL, and every model does it equally. Now it is a prompt problem, and specifically a description problem. The models agree with each other and they all disagree with you. That is information about your description, not about them.

Mostly SPURIOUS_CALL. The model reaches when it should sit still. Same axis as NO_CALL pointing the other way, and it moves between releases for the same reason. Same rule: check the spread across models before you rewrite the description.

Mostly WRONG_TOOL. Usually a naming and boundary problem, and it usually does not go away by switching models. Two tools that sound alike will confuse a better model too. I have written about naming before and I still think it is underrated relative to how cheap it is to fix.

Mostly INVALID_ARGS. Constrained decoding or a validation-retry loop, and bound the retries. A retry loop with no ceiling turns one bad request into a bill.

Mostly VALID_BUT_WRONG. The hard one. No model switch saves you. This is a precheck and eval problem, and it is where I would spend the time if the other four buckets are clean.

The rule I would have written into that comment if a comment box made you write rules instead of numbers: run the matrix before you touch the prompt, because the matrix is an afternoon and the prompt iteration is a quarter.

Why I am not publishing new numbers

I re-ran this on current models before writing this. I am not going to print what I got.

Not coyness. Two reasons.

The first is that it would recreate this exact post in six months. Someone would quote it in a design review in January, the models would have moved twice, and the number would be doing the same unearned work.

The second is that my re-run is still my workload. Same tool, same schema, same domain. Publishing it dressed up as a general finding is precisely the error I am describing, and doing it knowingly would be worse than doing it by accident in a comment box.

What I will say is about the method, not a leaderboard, and you can check it against your own run instead of taking it from me. Nothing forces a model's buckets to move together. A model can reach for the tool more often and be worse at argument semantics, and averaged into a single "correctness" percentage those two cancel. That is what my original number did. It collapsed buckets that were telling different stories into one figure that hid the decision instead of informing it. Which is a fifth thing broken about it, and arguably the worst one.

Jason Liu has been saying a version of this for years in the Instructor docs and around them: the interesting work is in the schema and the validation, not in the sentence you write above it. I read that, agreed with it, and then went and published a sentence-level percentage anyway. Anthropic's "Building Effective Agents" makes a nearby point about tool documentation deserving the same care as an API you hand to a junior engineer. Both hold up better than my comment did.

Where I'd push back on this

The steelman is real, and it deserves a proper hearing.

Published numbers are how anyone starts. If every practitioner refused to publish measurements because they might be misused, nobody would have a prior, and every team would rediscover from zero that model choice matters here. My 96 versus 88 did do work: it made people run their own tests. That is a real contribution, and a purist "measure it yourself" position is a bit rich coming from someone who benefited from other people's benchmarks for years.

I will concede that. What I would not concede is publishing it naked. A number with its scope, its metric definition, its date, and its models attached is a contribution. The same number in a comment box with none of that is a liability with my name on it.

And "just run it yourself" is not free. An afternoon of engineering time is an afternoon you do not have, and 200 labeled cases with a real semantic_check is more like two days than one afternoon if you are honest about the labeling. For a team of three shipping on a deadline, taking a stranger's number and moving on is a defensible call. I have made it. The thing I would ask is that you know you are making it, and that you write down which model and which month the number came from, so that when it stops being true you can find out.

The strongest objection is that the claim itself might not survive. "Function-calling robustness is a model selection problem" is a statement about a period in which models differ a lot on this axis. If tool-calling reliability commoditizes and every frontier model lands within a point of the others, my claim becomes a historical note and the prompt people were right all along, just early. I do not think that has happened as of July 2026, because I keep measuring gaps that are wider than my prompt work can close. But I hold the claim more loosely than I hold the method. The method survives either way. If the gap goes to zero, the matrix tells you that too, and then you go and fix your description with a clear conscience.

The number expires. The harness does not.

Parallel tool calls corrupted our shared state four times. Here's the pattern.

James O'Connor — Thu, 16 Jul 2026 01:11:44 +0000

TL;DR. If your LLM agent uses tool calling, the model can return several tool calls in a single turn, and most agent loops execute those calls at the same time. For about a year I treated the bugs that fell out of this as prompt problems: tighter schemas, better tool descriptions, more few-shot examples. The bugs were not coming from the prompts at all; they were concurrency problems: several handlers doing read-modify-write against shared state with no coordination, which is the same read-modify-write race databases have handled for decades, just triggered by a new caller. Below are four production incidents, the mechanism behind each, and the specific guard that fixed it. The through-line is straightforward: parallel tool calling amounts to concurrent writes against shared state, and nothing in the framework will tell you that.

Here is the setup, because the wording in the docs hides it.

When you send tools to the OpenAI or Anthropic API, the model can reply with more than one tool call in one assistant turn. OpenAI calls this parallel function calling and returns them as a tool_calls array (function-calling guide: platform.openai.com/docs/guides/function-calling). Anthropic documents the same behavior for tool use (docs.anthropic.com/en/docs/build-with-claude/tool-use). The part that matters happens next, and it happens in your code, not theirs. The model does not run anything. Your executor takes that array and runs the handlers, and in every agent loop I have read or written, the default is to run them concurrently (an asyncio.gather, a thread pool, a task group). The model emitted them together because, from where it sits, the calls look independent. It has no idea that call A and call B both land on account 4021.

That gap (independent to the model, contended at the data layer) is the whole article. What follows is four times it cost us something real, in the order we hit them.

Notice how the framing pushes you the wrong way. Every doc I have read sells parallel tool calling as a latency win: the model can ask for the weather in three cities at once, so why make it wait. True, and for read-only calls that is the entire story. But the same feature, pointed at tools that write, is unsynchronized concurrent access to shared state, and none of the guides say that part out loud. You opt into a distributed-systems problem by flipping what reads like a performance setting. That is the part worth being suspicious about before you ship: a setting that reads as performance tuning is actually opting you into concurrent writes.

1. The double write (lost update)

What we saw. A support agent could grant account credits. One afternoon, 14 accounts ended the day exactly one credit short of what our own logs said we had granted. Not random amounts. Each was missing precisely one apply_credit out of the two the model had issued in the same turn.

The mechanism. The model, trying to be thorough, split a goodwill gesture into two apply_credit calls of 2500 cents each (it does this more than you would think, especially when it reasons about two separate reasons to compensate someone). Both calls hit the same handler at the same time. The handler read the balance, added to it, and wrote it back:

def apply_credit(account_id: str, amount_cents: int) -> dict:
    balance = db.get_balance(account_id)        # both siblings read 5000
    new_balance = balance + amount_cents        # both compute 7500 from 5000
    db.set_balance(account_id, new_balance)     # last write wins, one credit vanishes
    return {"balance": new_balance}

Both siblings read 5000. Both computed 7500. Both wrote 7500. The account landed at 7500 instead of 10000, and one 2500-cent credit evaporated with no error anywhere. Martin Kleppmann spends a good chunk of the transactions chapter in Designing Data-Intensive Applications on exactly this read-modify-write race (he calls it the lost update, and it predates LLMs by decades). The only new thing here is what pressed the button.

The guard. Serialize the read-modify-write per account, and dedupe on the call id so a retried delivery does not double-apply:

import threading

_account_locks: dict[str, threading.Lock] = {}
_registry_lock = threading.Lock()

def _lock_for(account_id: str) -> threading.Lock:
    with _registry_lock:
        return _account_locks.setdefault(account_id, threading.Lock())

def apply_credit(account_id: str, amount_cents: int, tool_call_id: str) -> dict:
    with _lock_for(account_id):
        if db.already_processed(tool_call_id):      # edge case: this exact call ran already
            return {"status": "duplicate_ignored", "balance": db.get_balance(account_id)}
        balance = db.get_balance(account_id)        # read
        new_balance = balance + amount_cents        # modify
        db.set_balance(account_id, new_balance)     # write
        db.mark_processed(tool_call_id)
        return {"status": "ok", "balance": new_balance}

In a single process that is enough. Across processes (you will get there) an in-memory lock is theater, so push the atomicity into the store instead: UPDATE accounts SET balance_cents = balance_cents + %(amt)s WHERE id = %(id)s is atomic and needs no read in app code. (If your store cannot do that, a row lock via SELECT ... FOR UPDATE around the same block gets you the same guarantee.)

The reason we caught this one at all is a nightly reconciliation that summed the credit ledger against stored balances and alerted on any drift. If you take one operational thing from this piece, take that: a cheap after-the-fact check that recomputes state from an append-only log will surface every one of these four bugs, usually before a customer does. We had that check for balances. We did not have it for refunds, which is exactly why case 3 ran unnoticed for four days.

2. Read-after-write staleness inside one turn

What we saw. A shopping agent quoted checkout totals that were low. Not always. Roughly 1 in 200 sessions, the number the agent read back to the user was missing the item it had just added. Users noticed before we did, which is its own kind of embarrassing.

The mechanism. The model emitted add_to_cart(item) and get_cart_total() in the same turn. Because the model emits every call in a turn before it sees any result, it cannot hold a data dependency it is aware of. The handlers had one anyway: the total depends on the add. Run concurrently, get_cart_total frequently won the race and read the cart before the add had committed. The model then reported the stale total in perfect good faith. (This is the quiet one, because nothing throws. You just serve a wrong number with total confidence.)

The guard. Do not run a read against state that a sibling call is mutating. The cheap version: partition the tool_calls array into mutating and read-only, run the mutations first (serialized, per case 1), then run the reads. The model never needs to know. You are simply declining to honor its implied ordering with real parallelism when the calls touch the same state.

I will be honest that this one resisted a clean fix for a while, because "which tools touch the same state" is not something the framework knows about your handlers. We ended up tagging each tool as reads, mutates, or pure and letting the executor schedule on those tags. Boring. It held.

If you want the shape of it: the executor groups a turn's calls by tag, awaits the mutates group to completion in a fixed order, then dispatches the reads group against the settled state. Two extra lines of scheduling, and the total the model reads back is always the total after the add, never a snapshot from halfway through.

3. Idempotency-key collision (my own fix caused this one)

What we saw. After I shipped the case 1 guard, refunds started going missing. Over one week, 9 customers who were owed two separate refunds got one. My monitoring for case 1 stayed green the whole time, which is why it took four days to find.

The mechanism. For the case 1 dedupe I had gotten clever and keyed idempotency on the semantics of the operation, hash(account_id, amount_cents, "refund"), so a retried delivery of the same refund would not double-pay. That is correct for retries. It is wrong for two legitimately identical calls in one turn. When a customer was owed two 2500-cent refunds (two separate orders that happened to be the same price), the model issued two identical refund calls, my hash collapsed them to one key, and the second was dropped as a "duplicate." The guard I had written to stop a data race was itself dropping legitimate refunds.

The guard. Split the two questions that "idempotency key" smears together. To dedupe accidental re-delivery (a framework retry, a double-dispatch), use the tool_call.id the API already gives you: it is unique per call in the response, so two distinct refunds get two distinct ids and both go through. To dedupe an intended effect (the user should only ever be refunded once for order X), key on the business fact that makes it unique, the order id, never the amount. If you cannot name the specific field that makes an operation unique, then the key you are hashing is not really identifying the operation. It is identifying a set of arguments that can happen to match, so it will sometimes drop calls that are genuinely distinct.

(Transaction-id collisions are the same bug from the other side: two parallel handlers each run txn_id = max(existing) + 1, both read the same max, both claim the same id. Let the store mint ids, or reuse the tool_call.id, instead of a read-then-increment in app code.)

4. Partial failure across a parallel group

What we saw. A travel agent could, in one turn, emit book_hotel, book_flight, and charge_card. Three cards got charged for hotels that were never booked before we caught it. The model received the three tool results (two ok, one error), apologized to the user, and moved on. The money stayed moved.

The mechanism. Parallel tool calls have no transaction boundary around them. Your executor runs three handlers, one fails after the others have already committed side effects, and there is no rollback because there was never a transaction. The model is not a coordinator. It sees a mixed bag of results after the fact and does whatever its next-token instincts suggest, which is usually to say sorry, not to issue a compensating refund.

The guard. Two options, and I have shipped both. First, do not model a multi-step transaction as a set of independent parallel tools at all. Collapse the atomic unit into one tool (book_trip) that owns its own transaction or saga internally, so the model makes one call and your code owns the all-or-nothing. Second, when you genuinely cannot collapse it, disable parallelism for that toolset so the model has to sequence the work and you can stop after the first failure:

from openai import OpenAI

client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4.1",
    messages=messages,
    tools=tools,
    parallel_tool_calls=False,   # model returns at most one tool call per turn
)

Anthropic exposes the same switch through tool_choice with disable_parallel_tool_use: true (as of mid-2026). Neither switch is a real distributed transaction. They just stop the model from opening N side effects you have no clean way to close.

If the work is irreducibly multi-step and refuses to live in one tool, be honest that you are now writing a saga. Each step needs a compensating action (refund_card to undo charge_card), and something durable has to drive those compensations when a later step fails, because the model will not. That is far more machinery than most agents carry, which is itself the strongest argument for collapsing the unit into a single tool and keeping the transaction inside your own code, where a database can actually enforce all-or-nothing.

When this does not bite you

I am not telling you to put a mutex on every tool. Most of an agent's tools are read-only, and for those, parallel calls are pure upside (fan out ten retrievals, good). This class of bug needs three things to line up at once, and if any one is missing you can ignore me:

Two or more calls in the same turn that mutate state. Purely read-only toolsets are safe, so fan them out freely.
The calls touch the same key. Two writes to different accounts do not contend at all. The danger is specifically same-row, same-turn.
Your app code does the read-modify-write. If the mutation is a single atomic statement in a store with real isolation, the database is already doing the coordination and your handler has nothing left to protect.

There is also a lucky case worth naming: naturally idempotent, set-to-value mutations (set_status("shipped")) survive a lost-update race because the last writer lands on the value you wanted anyway. They do not survive read-after-write, though. A sibling can still read the pre-write status and mislead the model. So "we only ever set values, never increment" buys you case 1, not case 2.

And the honest one: at low tool-call fan-out you may never see any of this. If your agent emits more than one mutating call per turn maybe once a week, the race is real but rare, and rare races present as "flaky, could not reproduce, closed." That does not mean the bug is absent. It means the bug is currently cheap enough to ignore, until it is not (ours was a refund, so the day it stopped being cheap arrived with a finance ticket attached).

Where I'd push back on this

Steelman the other side, because it has a case. Most production agents are read-heavy, the model emits conflicting parallel mutations rarely, and bolting locks onto handlers adds latency and a brand new failure mode (deadlocks, an unbounded lock registry, a lock ordering you now have to reason about). You could argue I am importing distributed-systems ceremony into what is, most days, a chatbot that calls a search API twice. That is fair, and I have over-built this before.

Here is the concession. The default I actually ship now is not "lock everything." It is duller than that. Read-only and pure tools run in parallel, untouched. State-mutating tools run serialized per turn, in a defined order, and only the ones that need atomicity get a store-level guard. For genuinely transactional multi-step work, I collapse it into one tool instead of trusting a parallel group to behave. That is slower on the rare turn with two mutations, it is correct, and I stopped losing refunds. If your workload is all reads, you pay none of this cost and you should not adopt any of it.

Objections I'd accept. "Push atomicity into the datastore, not app-level locks." Yes, wherever the store supports it. The in-process lock is a stopgap for single-process deployments, and I said as much. "This is just CS 101 concurrency, there is nothing LLM-specific here." Correct, and that is the point. The mechanism is 40 years old. What is new is that a model now introduces the concurrency invisibly, out of natural language, with nothing annotating that two of its calls collide.

Objections I wouldn't accept. "Just prompt the model not to emit conflicting calls in the same turn." No. You can lower the rate with prompting, but reducing the probability of a data race does not remove it, and a rarer race is harder to reproduce and harder to debug on call than a frequent one. "Then turn parallel tool calls off globally and forget it." That buys correctness and throws away the real latency win on read-heavy fan-out, which is most turns. Scope the disable to mutating toolsets: correctness where you have shared state, parallelism where you do not, and a tag on each tool that says which is which.

The frameworks will get here eventually (some are adding tool-level concurrency hints already). Until they do, it is safer to assume any two mutating calls in the same turn can contend on the same row, because in our workload they did a few times a week.

Your agent's tool call passed validation. That tells you nothing about whether it was the right call.

James O'Connor — Fri, 10 Jul 2026 08:58:46 +0000

Schema validation is a property of one output's shape. Whether the agent picked the right tool with the right arguments is a property of the decision, and a valid decision and a wrong decision are the same shape.

We had a support agent that could issue refunds, escalate to a human, or reply with an article. Every tool call it made was schema-valid: the refund calls had a well-formed amount and a currency, the escalate calls had a priority enum and a reason string, all of it passed pydantic on the way out. The dashboard for malformed tool calls sat at zero for weeks. Then finance asked why we had refunded a customer who had only asked how to change their email address. The call was perfect. Amount was a valid number, currency was a valid enum, the whole thing serialized clean. It was also completely wrong, and nothing we had was built to notice, because everything we had was checking shape.

Here is the position I have landed on, and it is the same one I keep landing on with agents: reliability is a property of the contract you can check, not the model you hope behaves. Validation checks that the output is well-formed. It does not check that the output was the correct thing to do, and for a tool-using agent the correctness of the decision is the entire game. Below is the argument in cases, the assertion I now write first, and where I think the line actually is.

Case 1: validation answers "is this well-formed," never "was this right"

A validator is a function of one value's structure. refund(amount=39.99, currency="USD") either matches the schema or it does not, and this one does. The validator has no access to the thing that would tell you it is wrong, which is the input state: the customer asked about their email, there was no order in the conversation, no payment to reverse. All of that context lived in the transcript, and the schema check never looks at the transcript. It looks at the object.

So the failure mode is specific and it is nasty: a wrong action and a right action are structurally identical. Both pass. There is no exception, no validation error, no retry that fires, nothing for a downstream guard to catch, because at the type level refund when you meant reply is indistinguishable from refund when you meant refund. The signal that would have caught it (this tool was the wrong tool for this state) is a fact about the relationship between the input and the call, and a validator only ever sees one side of that relationship.

This is why "add stricter schemas" does not fix it and sometimes makes it worse. Tighter enums and required fields raise your confidence that the output is well-formed, which is exactly the confidence you should not have, because well-formed was never the property in question.

Case 2: to catch a wrong decision you assert on behavior, against an expected action

The check that actually works is a different kind of check. Instead of "does this output match a schema," it is "given this input, did the agent take the action I expected." That is an evaluation, not a validation: it is a function of the input and the output together, graded against a labeled expectation. In practice, for the cases you understand, it looks like an ordinary test.

python
import pytest
from dataclasses import dataclass

@dataclass
class ToolCall:
    name: str
    args: dict

# Your agent under test: takes a conversation, returns the tool call it chose.
# This is a stand-in so the file runs green; swap in your real agent.
def agent_decides(conversation: str) -> ToolCall:
    text = conversation.lower()
    if "refund" in text:
        return ToolCall("refund", {"amount": 9.99, "currency": "USD"})
    if "charged twice" in text or "double" in text:
        return ToolCall("escalate", {"priority": "high", "reason": "billing"})
    return ToolCall("reply", {"article_id": "kb_change_email"})

# Each case pairs an input with the ACTION we expect, not a shape.
CASES = [
    ("How do I change my email address?",          "reply"),
    ("I was charged twice for order 8842, fix it.", "escalate"),
    ("Cancel my subscription and refund this month","refund"),
]

@pytest.mark.parametrize("conversation, expected_tool", CASES)
def test_agent_picks_the_right_tool(conversation, expected_tool):
    call = agent_decides(conversation)
    # The assertion is about the decision, given the input. A schema check
    # would have passed all three of these even if the tool were wrong.
    assert call.name == expected_tool, (
        f"on {conversation!r} the agent chose {call.name}, expected {expected_tool}"
    )

def test_refund_only_fires_with_a_real_charge():
    call = agent_decides("How do I change my email address?")
    # The specific bug that cost us: a valid refund with no charge in context.
    assert call.name != "refund", "refunded with no payment in the conversation"

def test_refund_amount_is_justified_by_the_input():
    # Argument-level correctness: not "is amount a float" but "is it the RIGHT
    # amount for this input." The monthly plan is 9.99, so a refund of the month
    # should be 9.99, and any other value is a valid-but-wrong argument.
    call = agent_decides("Cancel my subscription and refund this month")
    assert call.name == "refund"
    assert call.args["amount"] == 9.99, "refunded an amount the input never justified"

The important shift is what the assertion is a function of. assert call.name == expected_tool cannot be satisfied by making the JSON cleaner. It can only be satisfied by the agent making the right call on that input, which is the property you actually care about. Argument-level correctness is the same idea one level down: not "is amount a float" but "is amount the amount that this input justifies," which again you can only judge against the input and a label.

You will not enumerate every input this way, and that is fine. A few dozen labeled cases covering the decisions that hurt when they are wrong (the actions that move money, send mail, touch a human) is already the difference between catching this class of bug in CI and hearing about it from finance.

Case 3: single-turn asserts do not survive a multi-turn agent, so you drive sessions

The parametrized test above works because each case is one input and one expected action. Real agents are not one input. They are a session: the customer says something, the agent calls a tool, the tool returns, the agent reads that and decides again, and the wrong-tool decision often shows up three turns in, only after a particular tool result comes back. You cannot express that as a static input string. You have to actually run the agent through a realistic multi-turn conversation and grade the calls it makes along the way, which means you need something that can play the other side of the conversation and replay the tool results.

That is a heavier piece of infrastructure than a pytest file, and it is the point where you start looking at what is out there rather than building it yourself. The tools I weighed for this, and what each one is actually for (repos linked so you can check my read rather than trust it):

promptfoo (github.com/promptfoo/promptfoo) is a declarative eval and red-team runner you point at your model from CI, assertion-first, config-driven. It is an eval tool, not a tracer.
DeepEval (github.com/confident-ai/deepeval) is pytest-style assertions for LLM output with a metric library, which is the shape Case 2 above is reaching for. Also eval-focused.
RAGAS (github.com/explodinggradients/ragas) is a narrower set of RAG-specific metrics, faithfulness and context precision and the like. Narrow on purpose.
Langfuse (github.com/langfuse/langfuse) is open-source tracing with an eval layer on top, so it is more than an eval tool: it also captures the multi-turn traces you would want to replay.
Future AGI (github.com/future-agi/future-agi) is an open-source end-to-end platform: eval, tracing, simulation and a gateway in one place, so like Langfuse and Phoenix it is more than an eval tool. Its lean is bundling multi-turn and voice simulation with the eval layer. At pure tracing the tools above are more mature.
Arize Phoenix (github.com/Arize-ai/phoenix) is open-source tracing plus eval as well, the same two-surface shape as Langfuse.
Braintrust (braintrust.dev) is a commercial eval and logging platform, strong on the experiment-tracking side, closed source.
LangSmith (smith.langchain.com) is the LangChain team's eval and tracing product, closed source, and tightest if you already run on the LangChain runtime.

All of that is as of July 2026, and these tools move fast enough that you should trust the repo over my one-line summary of it. The reason I list the whole spread rather than name a winner is that the choice is dominated by what you already run: if you are already tracing in one of these, grading the tool calls where the traces already live beats bolting on a second system. The category that matters for this bug is the one that can drive a multi-turn session and assert on the actions inside it, and several of these do that. Pick on integration cost, not on the feature grid.

My working rules on this, hedges attached:

Validate shape and evaluate decisions. They are different checks with different inputs, and passing one tells you nothing about the other.
Write the expected-action assertion first for the calls that hurt when wrong (money, mail, humans). A few dozen labeled cases beat a perfect schema.
Assert on the argument's correctness given the input, not just its type. "Is this the right amount" not "is this a float."
For anything multi-turn, drive a real session and grade the calls in it. A static input string cannot express a bug that only appears after turn three.
Grade where your traces already are. The integration you already run beats the tool that scores marginally better in isolation.

Where I'd push back on this

The honest counter is that "evaluate every decision" is a lot of machinery, and most agents do not need a simulation platform to be fine. If your agent has three tools and one of them is dangerous, you do not need a multi-turn harness, you need one blunt assertion that the dangerous tool never fires without its precondition, and you can write that in ten lines and be done. Reaching for a whole eval-and-simulation stack for a two-tool agent is the same over-engineering as reaching for a discriminated union when a boolean would do. I have watched a team spend a sprint standing up agent simulation for a bot that would have been fully covered by four assert call.name != "refund" style guards. The infrastructure was real work and it protected a decision surface that was not actually that wide.

So I will grant the boundary. If your action space is small and only one or two actions are irreversible, guard those directly and skip the rest. The place I do not move is any agent where the tool it picks is a genuine decision over a wide input space, especially a multi-turn one that acts on the world. There, "the JSON was valid" is not evidence the agent did the right thing, it is evidence you checked the one property that was never in doubt. In a system you have only ever validated, the valid-but-wrong call is the common case, and it stays invisible until something in your pipeline grades the decision and not just the shape.

The agent retried the tool call. The customer got charged twice.

James O'Connor — Fri, 10 Jul 2026 08:51:30 +0000

An agent will re-issue a tool call for reasons the tool never sees, and if that tool moves money, sends mail, or creates a record, the retry runs the side effect again. Prompting the model to be careful does not close this, because the model is not the layer that decides to retry. The tool contract is where it has to be closed.

The first time this bit us, a customer got charged twice for one order and we spent an afternoon reconstructing why. The trace showed two identical charge_card calls, four seconds apart, same amount, same card. The first call had actually succeeded on the payment side, but the response was slow, our client hit its timeout, and the framework's auto-retry fired a second call. From the model's point of view it made one decision. From the payment processor's point of view it received two charges. Over the following week, before we fixed it, we found 9 duplicate side effects across charges and confirmation emails. Not one of them was a model reasoning error. Every one was a retry the tool had no way to recognize as a retry.

Here is the position I have landed on. Any tool that mutates state has to be idempotent at the tool boundary, and you cannot delegate that to the model, because the model does not know when it is being retried and neither does the layer that retried it. I will walk through where the retries come from, the wrapper I now put on every mutating tool, and the two refinements that decide whether it holds.

Case 1: retries come from three places, and the tool sees none of them

It helps to name where the duplicate call actually originates, because the fix has to sit below all three.

There is the client-timeout retry: the call succeeded server-side, but the response was slow, so the caller gave up and re-sent. The side effect already happened; the caller does not know it. There is the model re-emission: the tool returned something ambiguous (a partial result, an error that was actually a success), and the model, reading the transcript, decides to call the tool again to be sure. And there is the framework auto-retry: many agent runtimes retry a tool call on exception by default, and a timeout is an exception even when the work completed.

The common thread is that in all three, the second call is byte-for-byte a legitimate call. Nothing about it looks wrong. You cannot filter it out by validating arguments, because the arguments are valid. The only thing that distinguishes it from a real second action is that it is the same logical action as one you already performed, and the tool has no memory of that unless you give it one.

Case 2: require an idempotency key per logical action, and dedupe on it (the load-bearing fix)

The mechanism that actually works is the one banks and payment APIs have used for years: every mutating call carries an idempotency key that names the logical action, and the server records the outcome under that key. A second call with the same key does not re-execute, it returns the stored result of the first. The key is not the arguments. It is a stable id for "this specific intended action," generated once, reused across every retry of that action.

Here is the wrapper I put on mutating tools. It stores the result of the first successful execution under the key and short-circuits any duplicate to that stored result instead of running the side effect again.

python
import functools, threading, time

class _Store:
    """Toy store. In production this is Redis/Postgres with a TTL and a real lock."""
    def __init__(self):
        self._data = {}          # key -> (status, result, expires_at)
        self._lock = threading.Lock()

    def begin(self, key, ttl):
        # Returns ("hit", result) if seen, else ("new", None) after reserving the key.
        now = time.time()
        with self._lock:
            row = self._data.get(key)
            if row and row[2] > now:
                if row[0] == "done":
                    return "hit", row[1]
                return "inflight", None      # a duplicate arrived before the first finished
            self._data[key] = ("inflight", None, now + ttl)
            return "new", None

    def finish(self, key, result, ttl):
        with self._lock:
            self._data[key] = ("done", result, time.time() + ttl)

_store = _Store()

def idempotent(ttl=3600):
    """Wrap a mutating tool. Requires an `idempotency_key` kwarg naming the action."""
    def deco(fn):
        @functools.wraps(fn)
        def wrapper(*args, idempotency_key: str, **kwargs):
            if not idempotency_key:
                raise ValueError(f"{fn.__name__} is mutating and needs an idempotency_key")
            state, prior = _store.begin(idempotency_key, ttl)
            if state == "hit":
                return prior                 # duplicate: return first result, do NOT re-run
            if state == "inflight":
                raise RuntimeError(f"action {idempotency_key} already in progress")
            result = fn(*args, **kwargs)     # the side effect runs exactly once
            _store.finish(idempotency_key, result, ttl)
            return result
        return wrapper
    return deco

@idempotent(ttl=3600)
def charge_card(customer_id: str, cents: int) -> dict:
    # the real side effect: hits the payment processor exactly once per key
    return {"charged": cents, "customer": customer_id}

# Same key across retries -> one charge, second call returns the stored result.
key = "order-8842-charge"
a = charge_card("cust_17", 4200, idempotency_key=key)
b = charge_card("cust_17", 4200, idempotency_key=key)   # no second charge
assert a == b

The inflight state matters as much as the done state. If two retries race (the timeout fires while the first call is still running), you do not want both to sail past the check and both hit the processor. Reserving the key before executing, and rejecting a second call that arrives while the first is still in flight, is what closes that window. In production the store is Redis or a row with a unique constraint, and the reservation is a real atomic operation, but the shape is exactly this.

Case 3: the key has to be stable, and reads should stay retryable

Two refinements decide whether this holds up.

First, who generates the key. The safest source is the caller that has the stable notion of the action: your orchestration layer minting one id per logical action (per order, per message, per ticket) and threading it through every retry of that action. You can let the model pass a client-generated action id, but treat it as advisory, because a model asked to "reuse the same key on a retry" will sometimes generate a fresh one, and a fresh key defeats the entire mechanism. If you have no stable id to lean on, a short-TTL dedupe cache keyed on the tuple of (tool_name, normalized_args) is a serviceable fallback: it catches the identical-call retry inside a small window. It is weaker, because two genuinely-distinct identical actions (the same customer legitimately buying the same item twice in a minute) will collide and the second gets silently dropped, so the TTL has to be tuned to your real duplicate-vs-distinct timing.

Second, do not idempotency-gate everything. Reads are naturally safe to retry (fetching a balance twice costs nothing and hides no bug), and wrapping them adds latency and a cache that can go stale. Split the surface at the wrapper layer: mark writes as unsafe and require a key, leave reads retryable and un-keyed. The wrapper should refuse to run a mutating tool without a key, and should not demand one from a read.

The rules I hold myself to on this, with the hedges attached:

Every mutating tool carries an idempotency key naming the logical action, and the server dedupes on it. This is the load-bearing fix.
Generate the key in the orchestration layer, not the model. Treat a model-supplied key as advisory, never as a guarantee.
Reserve the key before executing, so racing retries cannot both run the side effect.
A (tool, normalized-args) dedupe cache with a short TTL is a fallback when you have no stable id, with the caveat that it drops genuinely-distinct identical actions.
Separate safe (read) from unsafe (write) at the wrapper. Only writes need keys. Do not gate reads.

Where I'd push back on this

The honest counter is that idempotency infrastructure is not free and it introduces its own failure mode: a stale or wrong dedupe entry that suppresses a call that should have run. If your TTL is too long and the store returns a cached result for an action the caller genuinely wanted to repeat, you have now silently dropped a real charge or a real email, and that bug is harder to see than the double-charge, because nothing errored. Add a flaky key store to the picture (the reservation write fails, or the done write is lost after the side effect ran) and you can get the worst of both, a side effect that happened with no record that it did. So this is not "sprinkle a decorator and stop thinking." It is a small distributed-systems problem, and the store's correctness is now part of your tool's correctness.

There is a boundary I will grant. If a tool is naturally idempotent already, either a pure read, or a write whose target already dedupes (an upsert keyed on a natural id, a set-to-value rather than an increment), then adding a key layer on top is redundant machinery that buys you nothing and adds a cache to keep honest. Push the idempotency down to the resource when the resource can carry it. The case I hold firm on is any non-idempotent write with an external side effect: charging a card, sending a message, incrementing a counter, creating a ticket. There, the model and the framework will retry for reasons neither surfaces to the tool, and "the model usually does not double-call" is not a correctness argument, it is a hope. The retry is going to happen. The only question is whether the second one runs the side effect, and that is decided at the tool boundary, not in the prompt.

A strict schema gives the model nowhere to say I do not know, and that costs you the errors you most need to see.

James O'Connor — Wed, 08 Jul 2026 01:20:56 +0000

When you force a model into required fields and tight enums, you have not removed hallucination. You have removed the model's ability to admit it, so it fills the blank with something plausible instead.

We tightened a classification schema last spring and watched our error rate look better while our incidents got worse. The schema went from loose (a free-text answer field) to strict (a required category enum with eleven allowed values, no nulls). Validation-failure rate dropped to near zero, which read like a win on the dashboard. What actually happened is that inputs the model could not classify used to come back empty and get routed to a human, and now they came back as a confidently-typed category that happened to be wrong. We measured it on a held-out set of genuinely-ambiguous tickets: roughly 1 in 6 of them got a crisp, schema-valid, incorrect label, where before they would have been flagged as unclassifiable. We had not reduced the errors. We had hidden them behind a green checkmark.

The claim in one line. A strict schema with no representable "I cannot answer" outcome does not make the model more reliable, it makes the model's uncertainty invisible, and an invisible fabrication is more expensive than a visible refusal. Below is the argument in cases, and the schema shape I now reach for first.

Case 1: the abstain-shaped input is the one that hurts you

Most inputs are answerable and the schema is fine. The failure lives in the tail: the malformed record, the question the context does not cover, the enum that has no member for what the model is actually looking at. For those, a strict schema offers exactly one path. Pick a value. The model is not choosing to lie, it is doing the only thing the contract permits, which is to emit the most probable allowed token given an input it has no real answer for.

The tell is that these fabrications are indistinguishable from confident correct answers at the type level. A wrong-but-required category and a right category are the same shape (both pass), so no downstream validator, no retry loop, nothing catches it. The signal you needed (this input was unanswerable) existed for one moment inside the model and your schema threw it away.

Case 2: a discriminated union makes abstention a first-class outcome (my default)

The fix that has held up best is to stop pretending every response is an answer. Model the response as a tagged union of two shapes: an answer, or an explicit abstention that carries the reason it could not answer. In pydantic v2 this is a discriminated union on a literal kind field, and the discriminator is what lets you branch without guessing.

python
from typing import Literal, Union
from pydantic import BaseModel, Field, TypeAdapter

class Answer(BaseModel):
kind: Literal["answer"] = "answer"
category: Literal["billing", "bug", "feature_request", "account", "other"]
confidence: float = Field(ge=0.0, le=1.0)

class Abstain(BaseModel):
kind: Literal["abstain"] = "abstain"
reason: Literal[
"insufficient_context",
"ambiguous_between_categories",
"input_malformed",
"out_of_scope",
]
note: str = Field(default="", max_length=280)

Result = Union[Answer, Abstain]
_adapter = TypeAdapter(Result) # discriminates on kind

def classify(raw_json: str) -> Result:
return _adapter.validate_json(raw_json)

The caller cannot ignore the abstain arm, because the two shapes are different.

res = classify(model_output)
if res.kind == "abstain":
route_to_human(reason=res.reason, note=res.note) # a clean, typed refusal
else:
apply_label(res.category, res.confidence)

The property that matters: Answer and Abstain are structurally different, so the caller has to handle both arms. There is no way to accidentally treat an abstention as an answer, because it does not have a category field to read. The reason for abstaining is itself a constrained enum, so "I could not answer" arrives with a machine-actionable cause attached, not as a shrug.

Case 3: Optional fields are the weak version, because they lose the why

The reflexive fix, once you see this, is to make everything optional. Let category be str | None, let the model return null when it is stuck. This is better than a forced enum, but it is the weak form, and it fails for one specific reason: None tells you a field is missing, it does not tell you why it is missing.

python
from pydantic import BaseModel

Weak fix: null-as-abstain. Works, but throws away the reason.

class LossyResult(BaseModel):
category: str | None # None could mean:
# - genuinely ambiguous input
# - context did not cover it
# - the model malfunctioned
# - a real answer of "none of the above"

Now every null looks the same at the call site. "This ticket is ambiguous between two categories" and "the input was garbage" and "this is legitimately none-of-the-above" all collapse into one None, and you have to reconstruct the cause from logs you probably did not keep. Worse, None is often a valid answer to some questions (no middle name, no discount applied), so you have overloaded one token to mean both "the answer is nothing" and "I have no answer." The discriminated union keeps those separate on purpose. If you truly cannot afford a union, the next best thing is a pair of fields, an abstained: bool plus a reason, so the caller has an explicit flag to check rather than inferring intent from a null. And a sentinel enum member (a NOT_ENOUGH_INFORMATION value inside the category enum itself) is the cheapest retrofit of all when you cannot change the response shape, though it muddies the enum's meaning by mixing a control signal in with real categories.

My working rules, stated plainly:

Every strict output schema needs a representable "I cannot answer," or you are converting refusals into silent fabrications.
Prefer a discriminated union of answer-or-abstain, so the two outcomes are different shapes the caller must branch on.
If a union is too heavy, use an explicit abstained flag plus a reason the caller checks, not a bare None.
A reason for abstaining should be a constrained enum, so downstream code can route on it (retry, human, drop) without parsing prose.
"Just make it optional" is the weakest fix, because a null loses the reason and collides with legitimately-empty answers.

Where I'd push back on this

The honest counter is that giving a model an abstain arm is giving it an escape hatch, and escape hatches get overused. A model that can say "insufficient_context" will sometimes say it on inputs that were perfectly answerable, because abstaining is easy and being right is hard. If your abstain rate creeps to 30 percent, you have not built reliability, you have built a very confident way to route everything to a human, and now the humans are the bottleneck you were trying to remove. That failure is real and I have seen it: the union made refusal cheap, so the model took the cheap path more than it should have. You manage it by measuring the abstain rate as its own metric, sampling abstentions for ones that should have been answers, and tightening the prompt (or the examples) when the model is hiding behind the escape hatch instead of using it.

So I will concede the boundary. If your task is genuinely closed-world (a finite set of inputs you fully control, where every input has a correct answer by construction), then there is nothing to abstain about, and the abstain arm is dead code that only invites the model to misfire. A strict schema with no escape hatch is the right call there, and adding a union is over-engineering a problem you do not have. Where I do not move is anything open-world: user free text, scraped documents, anything where the input can be malformed or out of scope. There, the inputs that break your schema are exactly the inputs you most need to know broke it, and a schema that forces a confident answer on an unanswerable input is not being strict, it is being wrong on purpose and calling it valid.

# The Partial JSON Looked Done. It Wasn't. Here's What Streaming Structured Output Actually Requires.

James O'Connor — Mon, 06 Jul 2026 00:40:29 +0000

Last quarter we shipped a contract extraction feature that streamed its output field by field into a review UI, so a paralegal could start reading the extracted renewal date and counterparty name before the model finished the whole document. Nice idea. Faster perceived latency, better demo, the kind of thing that gets a thumbs up in a design review.

Then a reviewer flagged a contract where the counterparty name in the UI read "Meridian Hold" and the final, settled value was "Meridian Holdings Group LLC." Nobody had touched anything. The UI had simply displayed the string as it existed at a moment mid-stream, before the model had finished writing it, and the user had already glanced at it, nodded, and moved to the next tab. By the time the full value arrived, the case for that field was already closed in the reviewer's head.

That's the part that stings about streaming structured output. It doesn't usually fail loud. It fails by being plausible at exactly the wrong instant.

The naive approach is to accumulate the streamed text and try json.loads on the buffer every time a new chunk lands, catching the exception when it's not valid yet. That works, technically, right up until it doesn't, and the failure mode is worse than a crash: it's a false negative that becomes a false positive.

Concretely, here's what breaks:

1. Strings that parse as valid but aren't finished. json.loads doesn't fail on {"counterparty": "Meridian Hold"}. That's completely valid JSON. It's also not the value you want. A string field only becomes trustworthy at the character where its closing quote lands, and nothing before that tells you where that is, because the model doesn't announce "I'm two tokens from done."

2. Arrays that look closed but aren't. With tool-calling providers that stream token-by-token, we've seen an array of extracted clauses close its bracket, then reopen because the provider's underlying generation retried a truncated chunk server-side (this is provider-side behavior we can't fully control or always observe, and it showed up more under load, though I'd want a longer sample before I called that a hard rule rather than a pattern we noticed on high-traffic days). If your parser saw the first closing bracket and moved on, you've committed to a value that got silently superseded.

3. Parallel tool calls interleaving. When a model issues more than one tool call in the same turn, providers don't always guarantee the chunks for call A finish before chunks for call B start arriving. We had a case where two extract_clause calls were in flight and a naive buffer-per-response (rather than buffer-per-call) approach spliced fragments from both into one JSON blob that parsed successfully and meant nothing.

None of these are edge cases in the sense of being rare. In our sample of roughly 400 flagged extraction sessions over about six weeks (anecdotal, drawn from our own error queue, not an industry number), something in this family, a field displayed before it was complete, showed up in just under 3% of streamed sessions. Low frequency, high cost, because the ones that go wrong are the ones a human trusted.

What "schema-aware" actually means here

The fix isn't a smarter JSON parser. Full-document partial-JSON parsers exist and are useful for a different problem (rendering a tree view of an in-progress object), but they answer the wrong question. The question isn't "can I parse this yet." It's "is this specific field's value done being written."

That reframes the problem as a cursor per field, not a parser over the whole buffer. You track, for each key you care about, whether you've seen its terminating character. For strings, that's an unescaped closing quote. You do not surface the field to any downstream consumer, UI or otherwise, until that condition is met. Everything before that point stays internal state, not output.

Here's a minimal version of that, enough to show the shape (not a production parser, we handle nested objects and arrays with a similar but longer state machine):

from typing import Optional


class FieldExtractor:
    """Extracts only fields whose values are structurally complete
    from a growing buffer of streamed JSON text. Never calls
    json.loads on the partial buffer.

    A field counts as 'safe to surface' only once we've seen the
    terminating character for its type: here, a closing quote for
    strings (not preceded by an odd number of backslashes).
    """

    def __init__(self, expected_keys: set):
        self.expected_keys = expected_keys
        self.raw = ""
        self.emitted = {}

    def feed(self, chunk: str) -> dict:
        self.raw += chunk
        newly_closed = {}
        for key in self.expected_keys:
            if key in self.emitted:
                continue
            value = self._extract_closed_string_value(key)
            if value is not None:
                self.emitted[key] = value
                newly_closed[key] = value
        return newly_closed

    def _extract_closed_string_value(self, key: str) -> Optional[str]:
        marker = f'"{key}"'
        start = self.raw.find(marker)
        if start == -1:
            return None
        after_colon = self.raw.find(":", start + len(marker))
        if after_colon == -1:
            return None
        i = after_colon + 1
        while i < len(self.raw) and self.raw[i] in " \t\n":
            i += 1
        if i >= len(self.raw) or self.raw[i] != '"':
            return None  # not a string field, or value hasn't started
        j = i + 1
        while j < len(self.raw):
            if self.raw[j] == '"' and self.raw[j - 1] != "\\":
                return self.raw[i + 1:j]  # closing quote found, value is safe
            j += 1
        return None  # still being written


if __name__ == "__main__":
    extractor = FieldExtractor(expected_keys={"vendor_name", "renewal_date"})
    stream_chunks = [
        '{"vendor_name": "Acme Ind',
        'ustries, Inc.", "renewal_date": "2027-0',
        '3-01"}',
    ]
    for chunk in stream_chunks:
        closed = extractor.feed(chunk)
        for k, v in closed.items():
            print(f"safe to show: {k} = {v!r}")

Run it and the first chunk produces nothing (the vendor name string hasn't closed), the second chunk produces vendor_name, and the third produces renewal_date. Nothing gets displayed until its own value is structurally complete, independent of whatever else is still being written elsewhere in the object.

The version we actually run in production extends this with a stack for nested objects and arrays (so a clause list only emits an item once that item's closing brace is seen, not when the array itself closes), and a separate buffer keyed by tool-call ID so parallel calls can't splice into each other. Pydantic still validates the finished object at the end, same as before. This layer sits earlier and answers a narrower question: not "is this valid," but "is this done."

The part that's easy to miss

The instinct once you've built something like this is to treat it as purely a UI nicety, debounce the flicker, smooth the experience. But the actual failure we hit wasn't a UI glitch. It was an epistemic one: a human formed a belief about a fact ("counterparty is Meridian Hold, whatever that is") from data that hadn't finished existing yet. The fix isn't cosmetic. It's a correctness boundary between "data that exists" and "data that is still in the process of becoming data," and once you see it that way, treating it as a parsing convenience undersells what's actually broken.

Where I'd push back on this

The strongest objection to all of the above is that schema-aware incremental parsing adds real complexity, a stateful cursor per field, careful handling of escape sequences, tool-call ID bookkeeping, for a problem that a simpler mitigation might solve just as well: don't stream field values into a UI at all, stream only a coarse progress indicator ("extracting... 60%"), and reveal the full object once json.loads succeeds on the complete response. That removes the entire failure class in one move, and for a lot of products that's the right call, especially early on, when the team building the UI doesn't want to own a state machine.

I'd accept that objection for a lot of use cases. Where I wouldn't accept it is anywhere the whole point of streaming was the perceived-latency win, which was our actual reason for building this. If the product requirement is "show the paralegal something before the model finishes," you've already decided you need partial data, and the choice is really between partial data that's honest about its own completeness and partial data that isn't. Once you frame it that way, the extra state tracking looks less like gold-plating and more like the minimum bar for showing someone a fact you're asking them to trust.

The other pushback worth taking seriously: none of this catches a value that's structurally complete but semantically wrong, a well-formed string that just happens to be the wrong string. That's a different failure class with a different fix, and conflating the two is its own mistake.

Structured output broke on us three times. The third time taught us operator-ready.

James O'Connor — Fri, 03 Jul 2026 00:32:22 +0000

Structured output broke on us three times. The third time taught us what "operator-ready" means.

Last quarter we shipped a contract-extraction agent to an enterprise legal team. Schema validation passing at 97%. Human reviewers satisfied with the output quality in testing. Rollout went smoothly.

Then it broke. Three times. In three completely different ways.

The first two failures we fixed with better prompts and stricter schemas. The third one taught us something the first two hadn't: that "operator-ready" is not a technical checklist. It's a claim about your agent's behavior under conditions you didn't design it for.

Failure one: the validation paradox

Week two. A lease agreement came through with a renewal clause formatted as a table instead of prose. Our extractor looked for renewal terms in a specific JSON path. The table format populated the schema differently. Validation passed. The extracted renewal date was off by two years.

The fix was obvious in retrospect: add a canonical-format normalization step before extraction. But the lesson was sharper than that.

Schema validation tells you the shape of the output, not whether the content is correct. A JSON object with the right keys and the right types can still contain wrong values. Our 97% validation success rate was measuring the wrong thing. It was measuring structure conformance, not content accuracy.

After this failure, we separated validation into two signals: schema validity (does the object have the required fields) and field confidence (do we have evidence the content is correct). We started logging both. An output is trusted only when both signals are above threshold.

Failure two: the retry loop that lies

Month one. A particular clause type appeared in a contract format we hadn't trained our test set on. The extractor failed schema validation on the first attempt. Our retry logic kicked in, filled missing fields with model-inferred defaults, and passed validation on the third try.

The output looked right. The content was wrong. The inferred defaults were plausible values that did not match the actual contract.

No alert fired. No human review was triggered. The error surfaced three weeks later when the legal team flagged a discrepancy in a signed agreement.

This is the retry paradox: the retry loop is supposed to handle uncertainty, but in practice it converts "the model doesn't know" into "the model confidently guessed." The schema never sees the difference.

The fix: when a retry fails because of missing content (not format), the correct behavior is a human-review flag, not a default fill. "I cannot extract this clause with confidence" is a better output than a wrong value that passes validation.

We changed the retry logic to distinguish format failures (retry and reformat) from content failures (flag for review). The human-review rate went up. The silent error rate went to zero.

Failure three: the operator's data

This one took longer to understand.

Six weeks in, a new batch of contracts arrived from a subsidiary the legal team had recently acquired. Different contract structure, different clause naming conventions, different language patterns. Our extraction accuracy dropped from 94% on the training-corpus contracts to 61% on the acquired subsidiary's contracts.

We had not seen a single document from that subsidiary during development. Neither had our test suite.

This is the distribution shift problem. And it is the actual definition of not-operator-ready.

Production-ready means your agent handles the inputs you tested it on. Operator-ready means your agent handles the inputs the operator is actually going to give it. Those are not the same set.

The fix was not a better model or a better prompt. It was a process change: before any operator handoff, run the agent on a sample of the operator's own documents, measure accuracy on that corpus specifically, and establish a baseline before you commit to SLA numbers.

We now require 50 documents from the operator's corpus as part of the pre-handoff checklist. Not synthetic. Not ours. Theirs. If the accuracy on those 50 documents is not close to the accuracy on our training corpus, the handoff gets delayed until we understand why.

What these three failures have in common

All three were invisible to our eval suite. All three were visible with the right diagnostic.

The pattern: our eval was measuring our best case (our data, our test set, our format assumptions). Operator-ready means measuring the operator's case. Those are different measurement problems.

The three things we added to our pre-handoff process:

Field-level confidence scoring on every output (not just schema validity)
Content-failure-vs-format-failure separation in retry logic (fail loudly, not silently)
Operator corpus sampling before go-live (50 documents from their actual data, reviewed manually)

None of these are in the standard "production-ready" checklist. They're in the operator-ready checklist.

Where I'd push back on this

The common response to these failures is "just add more training data" or "fine-tune on the operator's corpus." That's the right long-term fix. It's not the short-term answer.

Fine-tuning takes weeks and requires labeling budget. An operator pilot that's already started does not have that runway. The faster path is: understand the distribution shift before you commit to accuracy numbers, not after you've already missed them.

There's also a steelman for the current "validation is enough" approach: for low-stakes use cases with structured, predictable inputs, schema validation really is sufficient. If every contract you're extracting is from the same template, format conformance and content accuracy are highly correlated.

The problem is that enterprise operators rarely have one template. The legal team that deployed our extractor manages contracts from 14 different counterparties, each with their own conventions. Validation-only was always going to break.

The concession I'll make: this is a data problem as much as an engineering problem. The teams that invest in building labeled corpora per operator will have substantially better outcomes than the teams that treat operator-ready as a single deployment decision. We didn't invest in that early enough. The second and third failures were partly the cost of that.

Operator-ready is not a state you reach. It's a process you run.

# Evaluating an AI agent is not evaluating an LLM call:

James O'Connor — Sun, 28 Jun 2026 22:58:50 +0000

I compared six tools for evaluating AI agents: LangSmith, Galileo, Arize Phoenix, Braintrust, Future AGI, and Langfuse. My thesis, up front so you can argue with it early: the mistake that wastes the most time is grading the agent's final answer like it is a single LLM call. An agent has a trajectory, which tools it called, in what order, how it recovered, and a wrong final answer and a right-by-luck final answer look identical until you score the path. Here is the rundown as of June 2026.

The final answer is not the unit of evaluation

An LLM-call eval grades one output. An agent eval has to grade a sequence: did it call the right tool, with the right arguments, in a sensible order, and recover when a call failed. Two runs can produce the same final answer, one by reasoning correctly and one by luck, and only trajectory-level scoring tells them apart. If your agent eval only looks at the final response, you are testing a chatbot, not an agent.

The six, by how deep they score

LangSmith. The LangChain-native pick. Agent traces plus eval, automatic if you are on LangChain or LangGraph. Deep on traces, proprietary and coupled to that stack.

Galileo. The agent-focused eval pick. Built around agentic workflows with metrics aimed at tool use and task completion, managed.

Arize Phoenix. The open-source OTel pick. Span-level agent traces plus eval, self-hostable, good if you want trajectory visibility without a license.

Braintrust. The polished-SaaS pick. Strong eval and observability UI for agents, proprietary, no self-host.

Future AGI. The simulate-then-score pick. Their Simulation runs synthetic voice or text personas through your agent before prod, and agentic_eval scores the multi-turn trajectory, tool calls, stepwise reasoning, and the full conversation, not just the final output (github.com/future-agi, as of June 2026). The draw for me was running a synthetic-persona session through the agent like an integration test and then scoring the path it took, not only where it ended up. It is one option among several here, not the answer.

Langfuse. The open-source observability pick. Agent traces plus eval, self-hostable, framework-agnostic; the eval layer is lighter than the eval-specialist tools.

I am not crowning one. LangSmith if you live in LangChain, Phoenix or Langfuse for self-hosted OTel traces, Galileo or Braintrust for managed agent metrics, the simulate-then-score approach if you want to generate the sessions, not just observe them.

What I actually score on a trajectory

Tool-selection-correct (right tool for the step), tool-args-valid, recovery (did it handle a failed call gracefully), and only then final-answer-correct. The first three catch the agent-specific failures the final-answer score hides. The agent that reached a fine answer through three wrong tool calls is a latent incident, not a pass.

Objections I'd accept / wouldn't

Accept: "single-turn metrics still matter." They do. They grade each response, and you want them. They just miss the cross-turn failures (state, tool ordering) that are the whole reason you built an agent rather than a chatbot, so they are necessary and not sufficient.

Wouldn't accept: "trajectory scoring is overkill, ship on final-answer accuracy." That is the position that produces the right-by-luck pass. The agent that stumbles to a correct answer through three wrong tool calls will fail differently next week, and your final-answer metric will not have warned you.

Where I'd push back on this

Steelmanning against myself: trajectory scoring assumes I know what the right path looks like, and for open-ended agents there is often more than one valid path to a good answer. A lot of what I call "wrong trajectory" might be "a reasonable path I did not anticipate," and if I over-fit my eval to one golden path I will punish agents for being creative in ways that are actually fine. The concession: I do not have a clean way to score "took a reasonable path I did not anticipate" without hand-labeling every trajectory. What I hold onto is narrower than full-path matching: tool-args-valid and graceful-recovery are path-independent, they are correct or not regardless of which route the agent took, so I trust those two even when I cannot agree on the one true path. If you have a way to score path-reasonableness without hand-labeling everything, that is the comment I want.

Pydantic passed. Types matched. The downstream system still got garbage.

James O'Connor — Thu, 25 Jun 2026 07:01:37 +0000

I want to walk through three production failures on the same contract-extraction agent, because they looked unrelated at the time and turned out to be the same problem wearing different clothes. My claim, stated up front so you can disagree with it early: schema validation tells you the grammar is correct and nothing about whether the meaning is. Those are two different jobs, and most teams (mine included, for a while) only build the first one.

Case 1: valid JSON, wrong semantics

The extractor used Claude 3.5 Sonnet with Pydantic schemas. A termination_clauses field accepted list[str]. Validation passed every time. The trouble was the model returned paraphrases, not verbatim clause text, and the downstream tool did exact-string matching against a database. Paraphrases never matched.

Pydantic had no way to catch this. The schema said list[str]. Strings arrived. Valid. The fix was a second-pass semantic check (a model call with a rubric asking, in effect, "are these strings verbatim from the source?"). Success on that field moved from 61% to 94%.

Lesson: structured-output validation is syntax validation. Semantic validation is a separate layer (and you have to build it on purpose).

Case 2: the retry cost spike

Retry logic via tenacity. One customer's documents carried a dual-signatory clause with an optional co-signer. The schema expected co_signer: Optional[str]; the model kept returning nested objects instead. Each retry was about $0.04, and on the worst documents that compounded past $2 each before anything escalated.

Two changes: cap retries at 5 with escalation to human review, and audit any new document type before it hits production.

Lesson: unlimited retry logic on validation failures is a latent billing incident (it just hasn't billed you yet).

Case 3: the model-switch regression

We moved GPT-4o to GPT-4.5. Success on party_obligations (a field that needs three-level nesting for conditional logic) fell from 91% to 73%. The newer model handled ambiguous cases with flatter structures. Valid JSON, wrong nesting, Pydantic waved it through, downstream broke quietly.

The fix was shadow evaluation after any upgrade: run old and new models against the same production documents, and flag any field where agreement drops below 95% before shipping.

Lesson: model upgrades are schema-compatibility events (treat them like a dependency bump, not a free swap).

The common thread

None of these surfaced as a Pydantic error. The schema was valid each time. The real failures were semantic drift, an uncontrolled retry loop, and a model-specific regression. In every case the grammar was fine and the meaning was not, which is precisely the thing type validation cannot see.

What the stack looks like now: Pydantic for syntax, a lightweight evaluator for semantics, DeepEval's correctness metric for the text fields, retries capped, an escalation field on every extraction schema so failure modes are a design-time decision, and a shadow-eval checklist of 200 production documents on any model change.

Objections I'd accept / wouldn't

Accept: "stricter schemas would have caught some of this." Partly true. Enums, discriminated unions, and constrained types genuinely shrink the semantic-validation surface when your domain is stable and bounded. If that's you, lean on them.

Wouldn't accept: "so you don't need eval around structured output." Three production failures, two of them customer escalations, disagree. Stricter types reduce the surface; they do not remove it, and they get brittle the moment a new document shape arrives.

Where I'd push back on this

If I'm steelmanning the opposite of my own thesis: maybe the honest read is that I under-specified my schemas and called it a semantics problem to feel better about it. A verbatim-quote field could have been a constrained type backed by a span reference into the source, not a free str. A lot of what I'm calling "semantic validation" is really "validation I was too lazy to encode structurally."

So here's the concession. If you have shipped high-volume extraction without a semantic eval layer and held accuracy above 92% for more than six months, I'd genuinely like to see the schema design, because either you bounded the domain harder than I did, or you encoded meaning into types better than I did. The part I won't give up: somewhere, a field has to assert meaning, and if it isn't your schema doing it, it has to be something downstream of the schema.

I put 6 LLM guardrail tools inline and measured what they cost me. Here is the latency-vs-recall tradeoff.

James O'Connor — Thu, 18 Jun 2026 05:50:48 +0000

An input guardrail runs on every request. Too slow and you rip it out; fast but blind and you get owned. That tradeoff, not the feature list, is the whole decision.

TL;DR: I ran six guardrail and prompt-injection tools inline on a production agent for a few weeks (Lakera Guard, Llama Guard, NeMo Guardrails, Guardrails AI, Future AGI's fi.evals scanners, and ProtectAI's LLM Guard). The deciding axis was not which one detects the most attack types, it was which one was fast enough to run on every request without anyone noticing, while still catching the injections that mattered. Here is the rundown as of June 2026.

An input guardrail sits on the hot path, so latency is the spec

A guardrail that inspects every prompt before it reaches the model adds to every request. Anything over about 50ms inline and users feel it; over about 200ms and someone disables it during an incident. So the real spec is narrow: catch the attack classes you care about (jailbreak, injection, PII or secret leak) inside a latency budget you can afford on the hot path. A 99 percent recall guardrail that adds 400ms is worse in practice than a 95 percent one at 10ms, because the slow one gets turned off.

The six, on the latency-vs-recall axis

Lakera Guard: the commercial-API pick. Strong prompt-injection detection, hosted, low effort to integrate. The tradeoff is a network hop per call (latency plus a third party in your request path) and per-call cost.
Llama Guard: Meta's open LLM-based safeguard model. Flexible policy taxonomy, runs on your own infra. It is an LLM, so it is the heaviest of these on latency unless you serve it carefully.
NeMo Guardrails: NVIDIA's open-source programmable rails (you write flows in Colang). Powerful for conversational and topical boundaries; more of a framework than a drop-in scanner, with the setup cost to match.
Future AGI fi.evals scanners: the inline-speed pick, from their Apache-2.0 ai-evaluation SDK (github.com/future-agi). Local scanners for jailbreak, code injection, PII, and secrets that block in under 10ms and tell you what tripped via result.blocked_by, as of June 2026. The draw was the latency: it runs on the hot path with no network hop, and the managed tier adds model-backed ensemble guardrails on top. Worth saying plainly: these cover attack and safety classes, not business-rule semantic checks.
Guardrails AI: the open-source validation-framework pick. A library of validators (structure, PII, toxicity) you compose; some are fast, some call a model, so your latency depends on which you switch on.
ProtectAI LLM Guard: open-source scanners for input and output (prompt injection, secrets, toxicity). Similar shape to a scanner pipeline; benchmark it against your own latency budget.

I am not crowning one. For lowest-effort hosted detection it was Lakera; for policy flexibility on your own infra, Llama Guard or NeMo; for inline speed with no network hop, the local-scanner approach. They sit at different points on the same curve.

What I gate on, and what I only log

Hard-gate (block the request) on the cheap, high-precision classes: secret and API-key leaks, obvious jailbreak strings, code injection. Log-and-alert (do not block) on the fuzzy classes where a false positive is worse than a miss, because blocking a legitimate user is its own incident. The split is by "how bad is a false positive here," the same logic as eval gating.

FAQ

Inline or async? The cheap deterministic scanners go inline on the hot path; the heavy model-based ones run async or on a sample, unless you can afford the latency.
Do these catch business-logic abuse? No. They catch attack and safety classes (injection, PII, secrets). "The agent did something it should not for THIS user" is a semantic and authorization check you still have to write.
One tool or several? Usually a fast local scanner inline plus a heavier model-based check async. Different tools for different points on the curve.

Open question

Every one of these catches the attack classes you name in advance. The injection that gets through is the one shaped like a class you did not configure, and injection is adversarial, so the attack distribution shifts under you. I do not have a clean way to catch the novel injection that matches no configured scanner. If you have, that is the comment I want.