DEV Community

Marcus Chen
Marcus Chen

Posted on

We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage.

TL;DR: We pulled 41,000 production agent traces at Nexus Labs to build a fine-tuning dataset. After a manual audit of 1,200 of them, ~48% were unusable: tool calls that "succeeded" but returned wrong data, retries masking provider failures, and silent fallbacks that changed which model answered. Putting Bifrost in front of the agent fleet fixed the trace problem more than any sampling strategy we tried.

We run an enterprise agent product. Sales-ops automations mostly. Each user task ends up as a chain of 8-40 tool calls across a planner model, a worker model, and roughly 12 internal tools.

For the last quarter my team has been building a fine-tune dataset from real traces. The plan was straightforward. Pull successful task completions. Filter by user thumbs-up. Use the trace as the training signal.

It did not work.

What "successful" actually meant in our traces

The first audit pass was 1,200 traces, two engineers, three weeks. We tagged each trace as "clean", "noisy", or "corrupted".

Category % of traces What it meant
Clean 52% Tool calls returned correct data, model picked the right next step
Noisy 31% Right answer eventually, but with hidden retries, fallback to a different model, or stale cache hits
Corrupted 17% Trace claimed success, output was wrong. User had not noticed yet.

The noisy category is the one that broke me. We had been treating these as gold-standard data. A trace where the planner called crm_lookup, got a 500, retried twice, then succeeded on a fallback Anthropic key while the original trace span still pointed at OpenAI gpt-4o. The training pair we would have generated: "given this user input, output this tool call sequence." But the sequence was the result of three providers and two model versions stitched together. No reproducibility.

Worse: nothing in our trace told us which model actually produced the final answer. We had a model field. It logged whichever provider was configured at request start.

Why we ended up putting a gateway in front of everything

We tried two things first. Both partial fixes.

The first was logging at the application layer. Wrap every provider call, log model, latency, retry count, fallback path. This works until you have four services calling four SDKs with four retry policies. Our Python service used the official openai client. Our Go service used a hand-rolled HTTP client. The TypeScript planner used Vercel AI SDK. Three different definitions of "retry".

The second was forcing all traffic through LiteLLM. It got us to a unified call surface but the observability was thin for our needs, and the failover behaviour was harder to reason about under load. Not a knock on LiteLLM, it just was not the shape we wanted.

We migrated the fleet behind Bifrost about five months ago. Two reasons specific to our problem:

  1. The Automatic Fallbacks config makes the fallback chain a first-class object. When a request fails over from Anthropic to Bedrock, that is in the response metadata. Not in three different log lines you have to join.
  2. Native Prometheus metrics (observability docs) meant bifrost_requests_total is tagged by the actual provider that served the request, not the one we asked for.

Here is a chunk of the config that mattered for trace cleanup:

providers:
  openai:
    keys:
      - value: env.OPENAI_API_KEY_1
        weight: 0.7
      - value: env.OPENAI_API_KEY_2
        weight: 0.3
  anthropic:
    keys:
      - value: env.ANTHROPIC_API_KEY

fallbacks:
  - model: openai/gpt-4o
    fallback_to:
      - anthropic/claude-sonnet-4-6
      - openai/gpt-4o-mini

logging:
  include_fallback_chain: true
  include_provider_actual: true
Enter fullscreen mode Exit fullscreen mode

The two include_* flags meant every trace span we emitted downstream had a deterministic answer to "who served this token". Our corrupted-trace rate on the next 5,000 sampled dropped from 17% to under 3%.

What the audit actually changed about our fine-tuning

We stopped using user thumbs-up as the primary filter. Thumbs-up correlates with "user got what they wanted eventually", not "the model made the right call". Now the filter is:

  • Single-provider, single-model trace (no fallback fired)
  • No retry on any tool call
  • Tool call result schemas validated post-hoc against a recorded ground truth
  • Span timing within 1.5x median for that task class

That filter throws away about 71% of our raw traces. Painful. But the 29% that survives is data we can actually train on.

Trade-offs and limitations

Honest take on what this did not solve.

  • Bifrost is not a debugger. It tells you which provider served the request and whether a fallback fired. It does not tell you whether the tool result was correct. We still need the post-hoc schema validation pass.
  • Semantic caching (docs) made the corruption worse before it got better. Cache hits looked like fresh model calls in our old logging. We had to explicitly tag cached responses in the trace pipeline. Once tagged, fine, but the default was confusing.
  • LiteLLM has a larger provider list at the long-tail. If you need niche providers, check both before committing.
  • Portkey's prompt management UI is nicer. We do prompt management elsewhere so it did not matter for us. If you want one tool for both, Portkey is worth a look.
  • The MCP gateway feature (docs) is interesting but we have not put it in production. Cannot vouch for it yet.

The model is the easy part. The infrastructure around the trace is where your eval dataset lives or dies.

Further Reading

  • Bifrost retries and fallbacks docs
  • Bifrost observability defaults
  • LiteLLM proxy docs for honest comparison
  • Anthropic's tool use guide — the trace structure section is the relevant one
  • OpenTelemetry GenAI semantic conventions — what we wish our old logging had matched

Top comments (0)