TL;DR: Our eval sets went stale because a human wrote the test cases by hand once and never updated them. We moved the capture point into the gateway. A Bifrost custom plugin logs every production request and response, and we curate a weekly regression set from real traffic instead of inventing inputs at our desks.
I lead the fine-tuning and eval team at Nexus Labs. Six people. We ship enterprise agent automation, and the model is the easy part. The hard part is knowing whether last week's change made anything better or quietly broke a customer's workflow.
For a year our regression suite was 120 hand-written cases. Someone on the team sat down, imagined what a user might ask, and froze it. By month three those inputs looked nothing like what real agents were sending. We were grading ourselves on a test we wrote, not the one production was running.
Why the gateway is the right capture point
We route every model call through Bifrost, an open-source AI gateway in front of OpenAI, Anthropic, and our self-hosted vLLM endpoints. It already sees the full request and response for gpt-4o-mini and our fine-tuned Qwen2.5-7B. That's the natural seam to tap.
Capturing inside the application means touching every service. Capturing at the gateway means one place. Bifrost ships a custom plugin system, a middleware layer with a pre-hook and a post-hook around each call, documented under Custom Plugins. We wrote a plugin that copies request and response into a Postgres table with the model id, latency, and token counts attached.
Here's the shape of it. Go, because Bifrost is Go.
func (p *EvalCapturePlugin) PostHook(
ctx context.Context,
req *schemas.BifrostRequest,
res *schemas.BifrostResponse,
) (*schemas.BifrostResponse, error) {
// sample 5% of traffic, skip anything flagged PII
if hash(req.ID)%20 == 0 && !req.Meta.Sensitive {
p.queue <- EvalSample{
Model: req.Model,
Input: req.Input,
Output: res.Choices[0].Message,
Latency: res.ExtraFields.Latency,
Tokens: res.Usage.TotalTokens,
}
}
return res, nil
}
The queue drains to Postgres on a separate goroutine so we don't add latency to the request path. We sample 5%, which at our volume is roughly 9,000 captured calls a day.
Curating, not dumping
Raw traffic is not an eval set. It's a pile. Most of it is repetitive and low-signal.
Each week we pull the captured rows and cluster them by embedding. We keep one representative per cluster, drop near-duplicates, and oversample the tail where the agent hit a tool-call error or returned an empty completion. That tail is where regressions hide. The result is around 400 cases a week, which we human-review down to maybe 250 before it joins the frozen suite.
The frozen suite is now 1,900 cases and growing from real inputs. Last month it caught a 6-point drop in tool-call accuracy on our Qwen fine-tune that the old hand-written set sailed straight past, because no human had thought to write a case with three nested function calls.
You also get the metrics for free. Bifrost emits native Prometheus counters per model, so we already had latency and token distributions to weight the sampling toward expensive calls.
Bifrost vs LiteLLM vs Portkey
We looked at three gateways for this. Honest read:
| Capability | Bifrost | LiteLLM | Portkey |
|---|---|---|---|
| Custom logging hook | Go plugin, in-process | Python callback / custom logger | Hosted logs + feedback API |
| Self-hosted, full data control | Yes | Yes | Self-host on paid tier |
| Language | Go | Python | Managed service |
| Overhead at high QPS | Low | Higher under load | Network hop to their edge |
| Setup friction | Write Go, compile |
pip install, edit config |
Fastest, UI out of the box |
LiteLLM was the obvious pick for an ML team. It's Python, so our existing data tooling drops right in, and its provider list is larger. If you want a callback in 10 minutes, it wins. We hit throughput limits under our agent burst traffic that Bifrost's Go path handled without tuning.
Portkey has the most polished logging UI and a real feedback API we didn't have to build. The tradeoff is that the data lives in their system unless you're on the self-hosted plan, and a customer contract ruled that out for us. If you want a managed dashboard and don't have a data-residency clause, Portkey is a reasonable call.
We picked Bifrost because the capture runs in-process with no extra network hop, the plugin is the same binary as the gateway, and the logging plus plugin surface gave us both metrics and raw payloads in one place.
Trade-offs and Limitations
Writing a plugin in Go is more work than a Python callback. If your team doesn't read Go, that's a real cost, and LiteLLM is the saner choice.
Sampling is lossy. At 5% we miss rare inputs, and we've had to bump specific routes to 100% capture when a customer reported a bug we couldn't reproduce.
PII is on you. The gateway sees everything, so the plugin has to redact before it writes, and we still run a scrubbing pass before any human looks at a row. Getting this wrong is worse than a stale eval set.
And capturing traffic doesn't grade it. You still need a scoring harness and labels. The gateway gives you the inputs and outputs. The judgment is the part you can't outsource to infrastructure.
Top comments (0)