Token-level eval harness for tool-calling agents: what we wired up

#machinelearning #llm #mlops #devops

TL;DR: We replaced our "did the agent finish the task" pass/fail eval with a token-level harness that scores tool selection, argument shape, and recovery behavior separately. Pass rate went from a single 73% number to four signals that actually tell us what broke. Bifrost sits in front as the provider switch so the same eval runs against four models without rewriting the harness.

At Nexus Labs we run agent automation for enterprise workflows. Twelve people on the team, around 40 tool definitions across the production agents, mix of GPT-4.1, Claude Sonnet 4.6, and a fine-tuned Qwen3 32B we serve ourselves on vLLM.

Last quarter our eval suite told us the new agent build was "72% passing." Shipped it. Two customers reported the agent was silently picking the wrong tool and confabulating success. Pass rate didn't catch it because the final assistant message looked fine.

So we rebuilt the harness.

The four signals

End-to-end pass/fail is one number that hides everything. We split it.

Signal	What it measures	Failure mode it catches
Tool selection accuracy	Did the agent pick the right tool at step N	Picks `search_db` when it should call `query_api`
Argument F1	Token-level F1 on tool arguments vs gold	Right tool, wrong filter or off-by-one date
Recovery rate	After a tool returns an error, does the next step make sense	Loops the same failing call three times
Trajectory length delta	Steps taken vs minimum needed	Wanders for 11 steps on a 3-step task

None of these are novel on their own. The point is having all four on every run, per-model, per-tool. When our 72% number dropped to 68% on the new build, the breakdown showed argument F1 collapsed on date-range tools while selection stayed flat. That's a tokenizer regression on the fine-tune, not a reasoning regression. Different fix.

The eval loop

We needed to run the same suite against four models without writing four clients. Bifrost handles that. One OpenAI-compatible endpoint, swap the model string.

eval_targets:
  - name: gpt-4-1
    model: openai/gpt-4.1
  - name: sonnet-4-6
    model: anthropic/claude-sonnet-4-6
  - name: qwen3-internal
    model: ollama/qwen3-32b-tools-v4
  - name: cerebras-llama
    model: cerebras/llama-3.3-70b

gateway:
  base_url: http://bifrost:8080/v1
  headers:
    x-bf-virtual-key: ${EVAL_VK}

The virtual key matters. We give the eval harness its own budget cap through Bifrost's governance so a runaway nightly run can't burn $4K on Anthropic before anyone notices. Last month it did exactly that, capped at $200, dropped the rest of the requests. Email at 3am instead of a Slack thread the next morning.

Semantic caching off for eval runs. Obvious reason: cached responses defeat the point. Bifrost lets you disable it per-request via header, docs here.

Argument F1, in code

The non-obvious signal is argument F1. Most harnesses do exact-match on the JSON, which is brittle ("2026-05-26" vs "May 26, 2026" both call the right API but exact-match scores zero).

def arg_f1(predicted: dict, gold: dict) -> float:
    pred_tokens = tokenize_args(predicted)
    gold_tokens = tokenize_args(gold)
    if not pred_tokens or not gold_tokens:
        return 0.0
    tp = len(pred_tokens & gold_tokens)
    if tp == 0:
        return 0.0
    precision = tp / len(pred_tokens)
    recall = tp / len(gold_tokens)
    return 2 * precision * recall / (precision + recall)

tokenize_args flattens nested JSON and normalizes dates, IDs, and known enums. It's 80 lines. We diff against gold per-key and weight required keys higher than optional ones.

This caught the Qwen regression. Selection accuracy was 91%, argument F1 dropped from 0.84 to 0.61 in one fine-tune iteration. Turned out the tokenizer was splitting ISO dates differently after we added a new SFT batch.

Why Bifrost vs LiteLLM or Portkey

Honest comparison. We tried all three.

	Bifrost	LiteLLM	Portkey
Provider count	23+	More (50+)	~40
Self-hosted free tier	Yes	Yes	Limited
Built-in virtual keys with budget caps	Yes	Plugin/proxy config	Yes
Native Prometheus metrics	Yes	Via callback	Hosted-first
Latency overhead (our measurement, p50)	~1ms	~3-4ms	n/a (hosted)

LiteLLM has more providers and a larger community. If you need a niche provider that's the safer bet. Portkey's hosted UX is more polished if you don't want to run anything. We picked Bifrost because the Prometheus integration is native (we already run Prometheus + Grafana) and the overhead was the lowest in our test. Your tradeoffs may differ.

Trade-offs and Limitations

Token-level argument F1 needs gold labels. We hand-labeled 1,200 trajectories. That's not free. If your agent universe is huge and changing weekly, this approach gets expensive.

Recovery rate is the noisiest signal. It needs a judge model to score whether the next step "makes sense" given the error, and judge models disagree with humans about 8% of the time in our spot checks. We use it as a trend indicator, not a gate.

Adding a gateway adds a hop. ~1ms in our setup, but if your eval is running 50K trajectories overnight, that's still real wall-clock time. We accept it because the centralized rate limiting and budget caps are worth more than the millisecond.

Bifrost's MCP gateway is enterprise-only. We use the open-source build, so for MCP tool routing we still wire that ourselves outside the gateway.

Top comments (1)

Harjot Singh • May 31

Token-level eval for tool-calling is a sharper lens than most teams use - for tool-calling agents the failure isn't usually "bad prose," it's "called the wrong tool / right tool wrong args / hallucinated a tool that doesn't exist," and end-to-end pass/fail hides which of those happened. Evaluating at the tool-call level (did it select correctly, were the arguments valid against the schema, did it handle the result) gives you a debuggable signal instead of a vague "the agent failed." That granularity is exactly what makes the eval actionable.

The thing this unlocks: once you can score tool-call correctness, you can route confidently - know which model is best at tool selection vs argument construction and assign accordingly, instead of guessing. Tool-call evals are what make multi-model agent systems tunable rather than hopeful. That's precisely the kind of signal I rely on in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - verifying tool calls at the boundary is how a build stays correct AND lets me route steps to the cheapest model that passes. Really sharp infra. What's your hardest tool-call failure to catch - wrong-tool, bad-args, or the subtle "right call, wrong interpretation of the result"? That last one's brutal to eval.