<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nikhil Pareek</title>
    <description>The latest articles on DEV Community by Nikhil Pareek (@nikhil_pareek_13).</description>
    <link>https://dev.to/nikhil_pareek_13</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3963981%2Fd14e1668-ae02-4235-aeb8-31565aa3492c.jpeg</url>
      <title>DEV Community: Nikhil Pareek</title>
      <link>https://dev.to/nikhil_pareek_13</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nikhil_pareek_13"/>
    <language>en</language>
    <item>
      <title>Function-calling eval was a 2024 problem. Tool-using agents are the 2026 one.</title>
      <dc:creator>Nikhil Pareek</dc:creator>
      <pubDate>Wed, 03 Jun 2026 06:37:49 +0000</pubDate>
      <link>https://dev.to/nikhil_pareek_13/tool-call-accuracy-is-lying-to-you-a-four-layer-eval-stack-for-agents-523p</link>
      <guid>https://dev.to/nikhil_pareek_13/tool-call-accuracy-is-lying-to-you-a-four-layer-eval-stack-for-agents-523p</guid>
      <description>&lt;p&gt;Here's a trace that reset how I think about evaluating tool-calling agents.&lt;/p&gt;

&lt;p&gt;An agent tries to book a flight. It calls &lt;code&gt;search_flights&lt;/code&gt; with &lt;code&gt;departure_date="next Friday"&lt;/code&gt;. The endpoint expected an ISO date, so it returns a &lt;code&gt;400&lt;/code&gt;. The agent retries the same string four times, then apologizes to the user and gives up.&lt;/p&gt;

&lt;p&gt;Now the part that actually bothered me. &lt;strong&gt;Tool selection was correct.&lt;/strong&gt; The model picked the right function out of a registry of 28. My tool-selection accuracy logged a clean &lt;code&gt;1.0&lt;/code&gt;. The aggregate task-completion logged a &lt;code&gt;0&lt;/code&gt;. And neither number told me which of three things broke:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the argument was wrong,&lt;/li&gt;
&lt;li&gt;the model never read the &lt;code&gt;400&lt;/code&gt; body, or&lt;/li&gt;
&lt;li&gt;the retry policy looped on the same input.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My eval wasn't wrong. It was asking the wrong question.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "tool-call accuracy" actually grades
&lt;/h2&gt;

&lt;p&gt;If the only thing you measure is &lt;em&gt;did the agent call the right tool&lt;/em&gt;, you're testing intent, not execution. Tool selection is necessary, not sufficient. It passes the moment the right function name shows up in the trace, completely blind to whether the arguments were garbage, whether the model read what came back, or whether it recovered from the &lt;code&gt;400&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That's the gap. The metric checks that the agent &lt;em&gt;started&lt;/em&gt; the right way. Production needs to know whether it &lt;em&gt;finished&lt;/em&gt; the right way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reframe: it's four eval problems, not one
&lt;/h2&gt;

&lt;p&gt;The thing I had to internalize is that tool-calling eval is four problems stacked, each with its own root cause:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tool selection&lt;/strong&gt;, right tool, or correctly &lt;em&gt;no&lt;/em&gt; tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Argument extraction&lt;/strong&gt;, schema-valid and semantically correct&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result utilization&lt;/strong&gt;, did it actually use what the tool returned&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error recovery&lt;/strong&gt;, did it retry, fall back, or escalate&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Score them separately and "the agent failed" collapses into "the argument extractor regressed on date strings on the flight-booking path." One bisect instead of three days.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I rebuilt
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Layer 1: Tool selection (with the bucket everyone drops)
&lt;/h3&gt;

&lt;p&gt;F1 on the tool name, so a 28-tool registry doesn't hide a regression on one rare endpoint behind a strong global mean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fi.evals&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_name_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;predicted_tool&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ground_truth_tool&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The piece almost every post skips is the &lt;strong&gt;irrelevance bucket&lt;/strong&gt;: test cases where the gold answer is "no tool call" (a greeting, a clarification, an in-model factual question). Without those, you can't catch the regression where a prompt revision makes the model bolder about calling &lt;code&gt;search&lt;/code&gt; on every input. BFCL added the bucket for exactly this reason; build it into your private set the same way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Argument extraction
&lt;/h3&gt;

&lt;p&gt;Schema validation runs first and is deterministic. Pydantic on the model output is the cheapest possible gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ValidationError&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SearchFlightsArgs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;departure_airport&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^[A-Z]{3}$&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;arrival_airport&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^[A-Z]{3}$&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;departure_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^\d{4}-\d{2}-\d{2}$&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cabin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^(economy|premium|business|first)$&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But schema-valid isn't correct. &lt;code&gt;departure_date="2026-01-01"&lt;/code&gt; validates fine and is still wrong if the user said "next Friday." That semantic class needs an LLM judge scoring whether the argument captured the user's intent. &lt;code&gt;customer_id="me"&lt;/code&gt; returning someone else's account is the failure that schema validation will never see.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Result utilization (the layer most posts skip entirely)
&lt;/h3&gt;

&lt;p&gt;The tool returned. Does the agent use the payload? Three patterns kept showing up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It paraphrases with a number flipped:&lt;/strong&gt; tool returns &lt;code&gt;amount_cents: 4500&lt;/code&gt;, agent says "your refund of $54.00 is processing."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It substitutes prior model knowledge:&lt;/strong&gt; &lt;code&gt;get_account_balance&lt;/code&gt; returns &lt;code&gt;12_400&lt;/code&gt;, model answers from a remembered "$200 threshold" instead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It uses the result on turn 1, then drifts off it by turn 3:&lt;/strong&gt; quotes the right itinerary, then invents a contradicting baggage policy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rubric is Groundedness, except you point the context slot at the tool's return payload instead of a retrieved corpus:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fi.evals&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Evaluator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fi.evals.templates&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Groundedness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ContextAdherence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ChunkAttribution&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fi.testcases&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TestCase&lt;/span&gt;

&lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TestCase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;eval_templates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;Groundedness&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nc"&gt;ContextAdherence&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nc"&gt;ChunkAttribution&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 4: Error recovery
&lt;/h3&gt;

&lt;p&gt;When the tool 4xx-es or times out, the agent's next move is the eval surface. Did it read the error and correct, or resend the same broken string? Fall back when the primary was down? Stop at a sane retry cap (3 is a common floor; 6 usually means the loop guard is missing)? This is trajectory-level, not per-call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fi.evals.metrics.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TrajectoryScore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AgentTrajectoryInput&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fi.evals.metrics.agents.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentStep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TaskDefinition&lt;/span&gt;

&lt;span class="n"&gt;trajectory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentTrajectoryInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;trajectory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;AgentStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_used&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;tool_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_result&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;agent_steps&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;TaskDefinition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;expected_goal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_request&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;available_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;registered_tools&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;final_result&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TrajectoryScore&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;compute_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trajectory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The math that makes all of this non-optional
&lt;/h2&gt;

&lt;p&gt;End-to-end success on a &lt;em&gt;k&lt;/em&gt;-step agent is roughly the product of per-step success rates.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;95% per step over 8 steps lands near &lt;strong&gt;66%&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;99% per step over 8 steps lands near &lt;strong&gt;92%&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two-thirds of sessions ending structurally wrong while every individual step scores green isn't a hypothetical. It's the default math, and it's the most common reason teams ship agents that pass eval and tank in production.&lt;/p&gt;

&lt;p&gt;The fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Score the trajectory as a unit (per-step rubric is the gate, trajectory metric is the truth).&lt;/li&gt;
&lt;li&gt;Treat anything longer than five steps as suspect and decompose it.&lt;/li&gt;
&lt;li&gt;Reserve a &lt;code&gt;pass^k&lt;/code&gt; consistency slice: 30 hard cases run &lt;em&gt;k&lt;/em&gt; times, the fraction that succeed on all &lt;em&gt;k&lt;/em&gt;. When it moves, the planner regressed, not the tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I still use public benchmarks for
&lt;/h2&gt;

&lt;p&gt;I didn't throw out BFCL or τ-bench, I just stopped pretending they gate production.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BFCL&lt;/strong&gt; tells you whether the underlying model can call tools at all (AST, executable, irrelevance).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;τ-bench&lt;/strong&gt; tells you about multi-turn reliability. Even GPT-4o lands below 25% at &lt;code&gt;pass^8&lt;/code&gt; on retail.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both are a model-selection floor. Neither knows anything about your registry, your schemas, your error codes, or your business policy. The private eval set, stratified by tool, argument-edge-case, and error code, with failing production traces promoted in weekly, is the one that gates the ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Score per-layer from day one&lt;/strong&gt;, not aggregate task-completion. Five rubrics per case costs more, but when CI fails, the failing layer name &lt;em&gt;is&lt;/em&gt; the root cause.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat groundedness-on-tool-output as noisier than on a retrieved corpus.&lt;/strong&gt; Payloads are JSON, the rubric reasons over fields. Pin a small human-labelled calibration set, re-tune monthly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run the &lt;code&gt;pass^k&lt;/code&gt; slice on release candidates, not every PR.&lt;/strong&gt; 30 cases × 8 rollouts is 240 agent runs. Worth it at the right cadence, painful as a per-commit gate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're running tool-calling agents in production on aggregate task-completion alone, you're flying with one eye closed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Curious about your setup
&lt;/h2&gt;

&lt;p&gt;Anyone else been bitten by the green-everywhere-but-broken trace? Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do you score arguments semantically, or stop at schema validation?&lt;/li&gt;
&lt;li&gt;Result utilization: are you grounding against the tool payload, or only the retrieved corpus?&lt;/li&gt;
&lt;li&gt;How much do you trust LLM-as-judge for grounding on live production traffic?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drop a comment, I read all of them. The four-layer stack runs on an open-source eval SDK too, so if you want to get started, say the word and I'll share the link.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
