Shadow-testing a fine-tuned 8B against gpt-4o-mini through Bifrost

#machinelearning #llm #mlops #devops

TL;DR: We fine-tuned a Llama 3.1 8B for invoice line-item extraction. Before flipping production over, we mirrored 14 days of live traffic to both the fine-tune and gpt-4o-mini using Bifrost's load balancing, then diffed outputs offline. The 8B won on accuracy by 3.2 points and cut per-call cost by 71%. The interesting bug: 4% of "wins" were the fine-tune hallucinating a field the base model correctly left null.

Our team at Nexus Labs ships an agent that pulls structured fields out of supplier invoices. The previous version hit gpt-4o-mini for every call. Bill was getting unfun.

I'm not a fan of swapping production models based on benchmark numbers. MT-Bench scores tell you very little about whether your specific eight-field extraction prompt works on the long tail of malformed PDFs that your customers actually send. So we shadow-tested.

The setup

We needed three things wired together:

Mirror live production traffic to a second model without affecting the primary response
Log both responses with a shared request ID
Replay an offline judge over the diffs

We were already running Bifrost in front of OpenAI for spend visibility. Turns out the load balancing config lets you weight providers across a single virtual model name, and the per-request log includes the full input and output payload. That covered the first two.

A trimmed slice of the config we used:

providers:
  primary_extractor:
    weight: 1.0
    model: openai/gpt-4o-mini
  shadow_extractor:
    weight: 1.0
    model: vllm/llama-3.1-8b-extract-v4
    shadow: true

The shadow: true flag is implemented via a custom plugin. The Bifrost README documents the plugin architecture but does not ship a built-in mirror mode. Our plugin sends the shadow request async and discards the response from the client path. Both log records share a trace ID so downstream comparison is a join, not a search.

What we found in 14 days

Fourteen days, 218,400 production requests, mirrored to both targets. The numbers:

Metric	gpt-4o-mini	Fine-tuned 8B
Field-level accuracy (judge)	94.1%	97.3%
Latency p50	480ms	190ms
Latency p99	1.8s	410ms
Cost per 1k requests	$0.42	$0.12
Hallucinated field rate	0.3%	1.1%

The accuracy win is real. The cost win is real. The latency win is mostly because we run the 8B on a single H100 with vLLM continuous batching and there is no network egress.

The hallucination rate is the part that almost killed the migration. The fine-tune confidently filled in vendor_tax_id on 1.1% of invoices where the field genuinely did not exist. The base model returned null. Our judge initially scored the hallucinations as correct because the format was valid. That's a separate post.

What the judge missed

We were using gpt-4o as the offline judge. It graded outputs against the ground-truth JSON. The grader rewarded any non-null field that matched the schema, which meant a plausible-sounding made-up tax ID got partial credit.

We swapped to a stricter judge that compared field-by-field against a held-out human-labeled set of 2,400 invoices. The fine-tune still won, but the margin shrank to 1.8 points. Worth the migration. Not worth the marketing pitch our PM wanted.

Why Bifrost vs LiteLLM or Portkey

I've used all three. Honest comparison:

LiteLLM is fine if all you want is the proxy layer. Easier to drop into a Python script. The plugin story is weaker, so you'd be writing more glue for the mirror behavior we needed.
Portkey has nicer observability dashboards out of the box, and its guardrails feature is more mature than what Bifrost ships today. If your priority is policy enforcement on user-facing chat traffic, look there first.
Bifrost won for us because the Go core handles the request volume without GIL-related weirdness, the plugin hooks let us implement the shadow flag without forking, and the virtual keys model already matched how we track team budgets. The semantic caching feature was not relevant here. Extraction prompts are too input-specific.

I'd switch tomorrow if Portkey shipped a documented mirror primitive and a Go core. They haven't.

Trade-offs and Limitations

The shadow approach doubles your inference cost during the test window. We ran for 14 days, which felt long. Five would have been enough for the distribution, but extraction has weekly seasonality (Mondays look different) so we wanted two full cycles.

vLLM on a single H100 fits our throughput. If your shadow target is a 70B model you'd need cluster routing, and Bifrost's clustering is enterprise-only. The README is explicit about that. Plan accordingly.

The judge problem cost us a week of confusion. Run your judge against a small human-labeled set first. If it agrees with humans below 90%, the judge is the bottleneck, not the model.

One last thing. Shadow traffic with the same trace ID means your APM tool sees double the spans. Filter those out at the collector or your dashboards lie.