Provider drift broke our regression evals. We pinned versions through Bifrost.

#mlops #llm #machinelearning #sre

TL;DR: Our nightly agent regression suite dropped 4 points on a tool-calling metric with zero code or prompt changes. The cause was a provider silently rotating the model behind a floating alias. We moved eval traffic through Bifrost, pinned exact model strings per provider, and added Prometheus per-model latency so the next drift shows up as a graph instead of a Slack mystery.

I lead the fine-tuning and eval team at Nexus Labs. Series B, enterprise agent automation. We run a nightly suite of about 2,400 adversarial test cases against whatever models our agents call in production. The suite is the contract. If it moves, something changed.

On a Tuesday in April it moved. Tool-call accuracy went from 0.91 to 0.87 overnight. No deploy. No prompt edit. Git was clean.

The model under you is not stable

We were calling a floating alias on a hosted provider. The kind that maps to "the current version" and gets repointed when the vendor ships an update. Our eval harness recorded the alias string, not the resolved version. So the harness thought it was testing the same thing two nights running. It wasn't.

That is the part people skip. You can pin your seed, your temperature, your prompt template, your sampling params. The weights still move under you. A 4-point swing on a contract metric is the difference between shipping and not, and we spent a day and a half bisecting our own code for a bug that lived in someone else's deploy.
The fix is boring. Pin the exact version. Make the gateway enforce it. Alert when the resolved model string changes.

Why a gateway and not just a constant in our config

We already had model names in a config file. The problem is enforcement and visibility, not storage. We needed three things at the call layer: the exact model string sent on every request, a metric tagged by resolved model, and failover so a provider 500 mid-suite does not kill a 90-minute run.

We put Bifrost (https://github.com/maximhq/bifrost) in front. It is a Go gateway, OpenAI-compatible, so our eval client changed by one base URL. Provider and model become an explicit provider/model string in the request, no floating aliases unless we opt into one.

# bifrost config -- explicit versions, no floating aliases
providers:
  openai:
    keys:
      - value: env.OPENAI_KEY
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY

# eval client now sends fully-qualified model strings:
#   anthropic/claude-sonnet-4-6
#   openai/gpt-4o-mini-2024-07-18
# a dated string cannot be silently rotated under us

The request side stays explicit:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini-2024-07-18",
    "messages": [{"role": "user", "content": "..."}]
  }'

Native Prometheus metrics gave us latency and request counts labeled by model. When the dated string stops resolving because the vendor retired it, the suite fails loud on a 4xx instead of quietly testing a substitute. That is the behavior I want. Fail visible, not silent.

Failover that does not corrupt the eval

A subtler trap: automatic failover is great for production and dangerous for evals. If provider A times out and Bifrost retries on provider B, your eval row now reflects a different model than the column header says. So we scope it. Production keys get fallbacks. The eval virtual key gets retries on the same model only, no cross-provider fallback. Same gateway, two policies.

That distinction matters more than the drift fix itself. A gateway that just works by silently routing around failures is exactly the thing that poisoned our data in the first place.

Honest comparison

We looked at LiteLLM and Portkey before landing here.

Concern	Bifrost	LiteLLM	Portkey
OpenAI-compatible single API	Yes	Yes	Yes
Self-host, no vendor cloud	Yes (Go binary/Docker)	Yes (Python)	Gateway OSS; control plane leans hosted
Per-model Prometheus metrics	Native	Via callbacks/config	Hosted dashboards
Maturity / ecosystem	Newer, fewer integrations	Largest provider list, most battle-tested	Polished hosted UX
Config surface	Web UI + JSON	Python-config heavy	Hosted-first

LiteLLM has the wider provider coverage and far more StackOverflow answers when something breaks at 2am. If you live in Python and want the longest integration list, it is the safe pick. Portkey hosted observability is genuinely nicer out of the box than wiring your own Grafana. Bifrost won for us because it is a single Go process we run ourselves, the OpenAI-compatible surface meant a one-line client change, and the Prometheus labels were exactly the cardinality we wanted without a callback plugin. Different teams, different answer.

Trade-offs and Limitations

A gateway does not detect drift on its own. It records the exact model string; you still have to alert on changes and pin dated versions. If you keep calling floating aliases through Bifrost, you have added a hop and solved nothing.
It's another process in the path. For our eval traffic that is fine, sub-millisecond overhead against multi-second LLM calls. For ultra-latency-sensitive serving you would benchmark it yourself.

And it is younger software. We hit one config-reload quirk early. LiteLLM longer track record is a real argument if you cannot afford to debug a gateway.

Dated model strings also age out. When a provider retires gpt-4o-mini-2024-07-18, our suite breaks loudly and we re-baseline on purpose. That is the point, but it is maintenance, not magic.

The model is the easy part. The thing that moved under us was the infrastructure around it, and the only defense is making every change observable.