DEV Community

Marcus Chen
Marcus Chen

Posted on

Token-level eval harness for tool-calling agents: what we wired up

TL;DR: We replaced our "did the agent finish the task" pass/fail eval with a token-level harness that scores tool selection, argument shape, and recovery behavior separately. Pass rate went from a single 73% number to four signals that actually tell us what broke. Bifrost sits in front as the provider switch so the same eval runs against four models without rewriting the harness.

At Nexus Labs we run agent automation for enterprise workflows. Twelve people on the team, around 40 tool definitions across the production agents, mix of GPT-4.1, Claude Sonnet 4.6, and a fine-tuned Qwen3 32B we serve ourselves on vLLM.

Last quarter our eval suite told us the new agent build was "72% passing." Shipped it. Two customers reported the agent was silently picking the wrong tool and confabulating success. Pass rate didn't catch it because the final assistant message looked fine.

So we rebuilt the harness.

The four signals

End-to-end pass/fail is one number that hides everything. We split it.

Signal What it measures Failure mode it catches
Tool selection accuracy Did the agent pick the right tool at step N Picks search_db when it should call query_api
Argument F1 Token-level F1 on tool arguments vs gold Right tool, wrong filter or off-by-one date
Recovery rate After a tool returns an error, does the next step make sense Loops the same failing call three times
Trajectory length delta Steps taken vs minimum needed Wanders for 11 steps on a 3-step task

None of these are novel on their own. The point is having all four on every run, per-model, per-tool. When our 72% number dropped to 68% on the new build, the breakdown showed argument F1 collapsed on date-range tools while selection stayed flat. That's a tokenizer regression on the fine-tune, not a reasoning regression. Different fix.

The eval loop

We needed to run the same suite against four models without writing four clients. Bifrost handles that. One OpenAI-compatible endpoint, swap the model string.

eval_targets:
  - name: gpt-4-1
    model: openai/gpt-4.1
  - name: sonnet-4-6
    model: anthropic/claude-sonnet-4-6
  - name: qwen3-internal
    model: ollama/qwen3-32b-tools-v4
  - name: cerebras-llama
    model: cerebras/llama-3.3-70b

gateway:
  base_url: http://bifrost:8080/v1
  headers:
    x-bf-virtual-key: ${EVAL_VK}
Enter fullscreen mode Exit fullscreen mode

The virtual key matters. We give the eval harness its own budget cap through Bifrost's governance so a runaway nightly run can't burn $4K on Anthropic before anyone notices. Last month it did exactly that, capped at $200, dropped the rest of the requests. Email at 3am instead of a Slack thread the next morning.

Semantic caching off for eval runs. Obvious reason: cached responses defeat the point. Bifrost lets you disable it per-request via header, docs here.

Argument F1, in code

The non-obvious signal is argument F1. Most harnesses do exact-match on the JSON, which is brittle ("2026-05-26" vs "May 26, 2026" both call the right API but exact-match scores zero).

def arg_f1(predicted: dict, gold: dict) -> float:
    pred_tokens = tokenize_args(predicted)
    gold_tokens = tokenize_args(gold)
    if not pred_tokens or not gold_tokens:
        return 0.0
    tp = len(pred_tokens & gold_tokens)
    if tp == 0:
        return 0.0
    precision = tp / len(pred_tokens)
    recall = tp / len(gold_tokens)
    return 2 * precision * recall / (precision + recall)
Enter fullscreen mode Exit fullscreen mode

tokenize_args flattens nested JSON and normalizes dates, IDs, and known enums. It's 80 lines. We diff against gold per-key and weight required keys higher than optional ones.

This caught the Qwen regression. Selection accuracy was 91%, argument F1 dropped from 0.84 to 0.61 in one fine-tune iteration. Turned out the tokenizer was splitting ISO dates differently after we added a new SFT batch.

Why Bifrost vs LiteLLM or Portkey

Honest comparison. We tried all three.

Bifrost LiteLLM Portkey
Provider count 23+ More (50+) ~40
Self-hosted free tier Yes Yes Limited
Built-in virtual keys with budget caps Yes Plugin/proxy config Yes
Native Prometheus metrics Yes Via callback Hosted-first
Latency overhead (our measurement, p50) ~1ms ~3-4ms n/a (hosted)

LiteLLM has more providers and a larger community. If you need a niche provider that's the safer bet. Portkey's hosted UX is more polished if you don't want to run anything. We picked Bifrost because the Prometheus integration is native (we already run Prometheus + Grafana) and the overhead was the lowest in our test. Your tradeoffs may differ.

Trade-offs and Limitations

Token-level argument F1 needs gold labels. We hand-labeled 1,200 trajectories. That's not free. If your agent universe is huge and changing weekly, this approach gets expensive.

Recovery rate is the noisiest signal. It needs a judge model to score whether the next step "makes sense" given the error, and judge models disagree with humans about 8% of the time in our spot checks. We use it as a trend indicator, not a gate.

Adding a gateway adds a hop. ~1ms in our setup, but if your eval is running 50K trajectories overnight, that's still real wall-clock time. We accept it because the centralized rate limiting and budget caps are worth more than the millisecond.

Bifrost's MCP gateway is enterprise-only. We use the open-source build, so for MCP tool routing we still wire that ourselves outside the gateway.

Further Reading

The model is the easy part. The harness that tells you which model regressed, and why, is the actual work.

Top comments (0)