Nikhil Pareek

Posted on Jun 3

Function-calling eval was a 2024 problem. Tool-using agents are the 2026 one.

#ai #llm #agents #testing

Here's a trace that reset how I think about evaluating tool-calling agents.

An agent tries to book a flight. It calls search_flights with departure_date="next Friday". The endpoint expected an ISO date, so it returns a 400. The agent retries the same string four times, then apologizes to the user and gives up.

Now the part that actually bothered me. Tool selection was correct. The model picked the right function out of a registry of 28. My tool-selection accuracy logged a clean 1.0. The aggregate task-completion logged a 0. And neither number told me which of three things broke:

the argument was wrong,
the model never read the 400 body, or
the retry policy looped on the same input.

My eval wasn't wrong. It was asking the wrong question.

What "tool-call accuracy" actually grades

If the only thing you measure is did the agent call the right tool, you're testing intent, not execution. Tool selection is necessary, not sufficient. It passes the moment the right function name shows up in the trace, completely blind to whether the arguments were garbage, whether the model read what came back, or whether it recovered from the 400.

That's the gap. The metric checks that the agent started the right way. Production needs to know whether it finished the right way.

The reframe: it's four eval problems, not one

The thing I had to internalize is that tool-calling eval is four problems stacked, each with its own root cause:

Tool selection, right tool, or correctly no tool
Argument extraction, schema-valid and semantically correct
Result utilization, did it actually use what the tool returned
Error recovery, did it retry, fall back, or escalate

Score them separately and "the agent failed" collapses into "the argument extractor regressed on date strings on the flight-booking path." One bisect instead of three days.

What I rebuilt

Layer 1: Tool selection (with the bucket everyone drops)

F1 on the tool name, so a 28-tool registry doesn't hide a regression on one rare endpoint behind a strong global mean:

from fi.evals import evaluate

result = evaluate("function_name_match",
    output={"function_name": predicted_tool},
    expected={"function_name": ground_truth_tool})

The piece almost every post skips is the irrelevance bucket: test cases where the gold answer is "no tool call" (a greeting, a clarification, an in-model factual question). Without those, you can't catch the regression where a prompt revision makes the model bolder about calling search on every input. BFCL added the bucket for exactly this reason; build it into your private set the same way.

Layer 2: Argument extraction

Schema validation runs first and is deterministic. Pydantic on the model output is the cheapest possible gate:

from pydantic import BaseModel, Field, ValidationError

class SearchFlightsArgs(BaseModel):
    departure_airport: str = Field(pattern=r"^[A-Z]{3}$")
    arrival_airport: str = Field(pattern=r"^[A-Z]{3}$")
    departure_date: str = Field(pattern=r"^\d{4}-\d{2}-\d{2}$")
    cabin: str = Field(pattern=r"^(economy|premium|business|first)$")

But schema-valid isn't correct. departure_date="2026-01-01" validates fine and is still wrong if the user said "next Friday." That semantic class needs an LLM judge scoring whether the argument captured the user's intent. customer_id="me" returning someone else's account is the failure that schema validation will never see.

Layer 3: Result utilization (the layer most posts skip entirely)

The tool returned. Does the agent use the payload? Three patterns kept showing up:

It paraphrases with a number flipped: tool returns amount_cents: 4500, agent says "your refund of $54.00 is processing."
It substitutes prior model knowledge: get_account_balance returns 12_400, model answers from a remembered "$200 threshold" instead.
It uses the result on turn 1, then drifts off it by turn 3: quotes the right itinerary, then invents a contradicting baggage policy.

The rubric is Groundedness, except you point the context slot at the tool's return payload instead of a retrieved corpus:

from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextAdherence, ChunkAttribution
from fi.testcases import TestCase

tc = TestCase(input=ex.user_message, output=result.response,
              context=json.dumps(tool_call.result))
scores = evaluator.evaluate(
    eval_templates=[Groundedness(), ContextAdherence(), ChunkAttribution()],
    inputs=tc)

Layer 4: Error recovery

When the tool 4xx-es or times out, the agent's next move is the eval surface. Did it read the error and correct, or resend the same broken string? Fall back when the primary was down? Stop at a sane retry cap (3 is a common floor; 6 usually means the loop guard is missing)? This is trajectory-level, not per-call:

from fi.evals.metrics.agents import TrajectoryScore, AgentTrajectoryInput
from fi.evals.metrics.agents.types import AgentStep, TaskDefinition

trajectory = AgentTrajectoryInput(
    trajectory=[AgentStep(action=s.action, tool_used=s.tool,
                          tool_args=s.args, tool_result=s.result,
                          error=s.error) for s in agent_steps],
    task=TaskDefinition(goal=expected_goal, description=user_request),
    available_tools=[t.name for t in registered_tools],
    final_result=agent_response)
score = TrajectoryScore().compute_one(trajectory)

The math that makes all of this non-optional

End-to-end success on a k-step agent is roughly the product of per-step success rates.

95% per step over 8 steps lands near 66%.
99% per step over 8 steps lands near 92%.

Two-thirds of sessions ending structurally wrong while every individual step scores green isn't a hypothetical. It's the default math, and it's the most common reason teams ship agents that pass eval and tank in production.

The fixes:

Score the trajectory as a unit (per-step rubric is the gate, trajectory metric is the truth).
Treat anything longer than five steps as suspect and decompose it.
Reserve a pass^k consistency slice: 30 hard cases run k times, the fraction that succeed on all k. When it moves, the planner regressed, not the tools.

What I still use public benchmarks for

I didn't throw out BFCL or τ-bench, I just stopped pretending they gate production.

BFCL tells you whether the underlying model can call tools at all (AST, executable, irrelevance).
τ-bench tells you about multi-turn reliability. Even GPT-4o lands below 25% at pass^8 on retail.

Both are a model-selection floor. Neither knows anything about your registry, your schemas, your error codes, or your business policy. The private eval set, stratified by tool, argument-edge-case, and error code, with failing production traces promoted in weekly, is the one that gates the ship.

What I'd do differently

Score per-layer from day one, not aggregate task-completion. Five rubrics per case costs more, but when CI fails, the failing layer name is the root cause.
Treat groundedness-on-tool-output as noisier than on a retrieved corpus. Payloads are JSON, the rubric reasons over fields. Pin a small human-labelled calibration set, re-tune monthly.
Run the pass^k slice on release candidates, not every PR. 30 cases × 8 rollouts is 240 agent runs. Worth it at the right cadence, painful as a per-commit gate.

If you're running tool-calling agents in production on aggregate task-completion alone, you're flying with one eye closed.

Curious about your setup

Anyone else been bitten by the green-everywhere-but-broken trace? Specifically:

Do you score arguments semantically, or stop at schema validation?
Result utilization: are you grounding against the tool payload, or only the retrieved corpus?
How much do you trust LLM-as-judge for grounding on live production traffic?

Drop a comment, I read all of them. The four-layer stack runs on an open-source eval SDK too, so if you want to get started, say the word and I'll share the link.

DEV Community