Benchmarking 5 LLM providers on one eval set, no SDK per vendor

#machinelearning #llm #mlops #devops

TL;DR: We run a 1,200-case eval suite for enterprise agent automation at Nexus Labs. Comparing models across OpenAI, Anthropic, Bedrock, Vertex, and Groq used to mean five client libraries and five sets of retry logic. We put Bifrost in front of all of them and now the harness talks to one OpenAI-compatible endpoint. Here's what that bought us, and where it didn't help.

The problem was never the models

Our eval set is 1,200 cases. Tool-call traces, multi-turn agent transcripts, graded against a rubric. The grading is hard. The models are not the hard part.

The hard part was the plumbing around the models. Every time we wanted to score a new candidate, the code branched. openai.ChatCompletion for GPT-4o. anthropic.messages for Claude. The boto3 Bedrock runtime for the Llama variants. Vertex had its own auth dance with service accounts. Groq was OpenAI-shaped but pointed at a different base URL with its own rate limits.

Five providers. Five SDKs. Five retry policies, all written slightly differently by whoever added that provider. When a Vertex call 429'd at case 800 of a run, the harness handled it differently than when Anthropic did. So our results carried noise that had nothing to do with model quality.

One endpoint, five backends

Bifrost is a gateway. It speaks one OpenAI-compatible API and routes to 23+ providers behind it. We run it self-hosted, Docker, on a single box next to the eval workers.

docker run -p 8080:8080 maximhq/bifrost

The harness stopped importing vendor SDKs. It points the OpenAI client at localhost:8080/v1 and changes the model string. That's the whole switch.

import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

for model in ["openai/gpt-4o", "anthropic/claude-sonnet-4-6",
              "bedrock/llama-3.3-70b", "vertex/gemini-2.0",
              "groq/llama-3.3-70b"]:
    for case in eval_set:           # 1,200 cases
        resp = client.chat.completions.create(
            model=model, messages=case.messages)
        record(case, model, resp)

The retry and fallback logic moved out of our Python and into the gateway config. One policy, applied to every provider, documented in retries and fallbacks. When Vertex 429s now, every provider gets the same backoff. The eval results stopped carrying our plumbing's personality.

What it changed, with numbers

A full sweep is 1,200 cases × 5 models = 6,000 calls. Before, a sweep failed partway maybe one run in three, usually a provider-specific timeout we hadn't normalized. We babysat it.

After, the gateway's load balancing across multiple API keys per provider cut our 429 retries enough that a clean sweep became the default. We added a second OpenAI key and a second Anthropic key to the config; Bifrost distributes across them. No code change.

Semantic caching helped less than I expected and more than I'd have guessed. Roughly 18% of our cases are near-duplicate prompts (same system prompt, tiny input deltas). Semantic caching served those on repeat runs. On a re-run after a rubric tweak, that shaved real wall-clock and provider spend. On a first run, zero benefit. Obvious in hindsight.

How it stacks up against the alternatives

We looked at LiteLLM and Portkey first. Both are good. The honest comparison:

Concern	Bifrost	LiteLLM	Portkey
Deployment	Self-host, Go binary	Self-host, Python proxy	Managed SaaS (self-host limited)
OpenAI-compatible API	Yes	Yes	Yes
Provider breadth	23+	Largest list	Broad
Per-key load balancing	Built in	Built in	Built in
Semantic caching	Built in	Via config	Strong, managed
Observability UI	Prometheus + web UI	Thinner UI	Best-in-class dashboards
Governance / virtual keys	Yes	Basic	Yes

If you want the widest provider list and you're Python-native already, LiteLLM is the pragmatic pick. If you want a managed dashboard and don't want to run infrastructure, Portkey's observability is ahead of what we self-host. We picked Bifrost because the Go gateway's overhead was low enough to not show up in our latency numbers, and we wanted the whole thing inside our network with no per-call data leaving. Virtual keys also let us split eval spend by team, which Stripe-brain me appreciates.

Trade-offs and Limitations

It's another process to run. If your eval suite only hits one provider, a gateway is pure overhead. Don't add it for a single backend.

Semantic caching can lie to you in evals. A cached response means you're scoring an old generation, not the current model. We disable caching on any run that's measuring model quality and only enable it for re-scoring runs where the generations are fixed and the rubric changed. If you forget that, you'll report a model improvement that's actually a cache hit.

The unified API smooths over provider differences you sometimes want to see. Bedrock's stop-reason semantics aren't identical to OpenAI's. The gateway normalizes the response shape, so a few provider-specific fields we used to inspect now need digging. For most eval work that's fine. For debugging a weird truncation, I still drop to the raw provider once in a while.

And it's young. The community is smaller than LiteLLM's. When I hit an edge case with Vertex auth, there were fewer GitHub issues to crib from. We filed one.

What I'd tell my past self

The model swap was never the work. The work was making five providers behave like one so the comparison meant something. A gateway is the boring infrastructure answer, and boring is what you want under an eval harness.

Run the sweep clean. Hold the plumbing constant. Then argue about the models.