DEV Community

TokenHub
TokenHub

Posted on

Build the eval set before you swap the model.

The pattern I keep seeing on teams chasing AI cost reductions: someone swaps a workload from GPT-4o to DeepSeek-V3, eyeballs a handful of outputs, calls it good, ships it. The cost graph drops the next day. Three weeks later a customer surfaces a regression — the cheaper model hallucinates a date format 6% more often, breaks the downstream invoice generator, and the rollback erases most of the savings plus a week of engineering time.

The fix isn't "don't swap models." Swapping is mostly the right move — DeepSeek-V3 at $0.07/$0.28 per million tokens vs GPT-4o at $2.50/$10 is too much money to leave on the table when the workload tolerates it.

The fix is: build the eval set before the swap, not after.

What a useful eval set looks like

You don't need fancy infrastructure for this. Five steps:

1. Pull 100-300 real prompts from production logs

Cover the long tail of inputs, not just the happy path. Include the weird ones:

  • The customer ticket in Spanish when your system was designed for English
  • The PR diff with binary files mixed in
  • The malformed JSON the user pasted instead of describing the issue in words
  • The prompt that hit a 30-second timeout last Tuesday

These are the inputs where models actually differ. A 50-prompt happy-path eval will tell you both models are 99% accurate, and you'll learn nothing.

2. Get the current model's outputs on those prompts

Save them with a timestamp. This is your baseline. Don't skip this — you'll need it for the comparison and you can't reconstruct it later if the model gets deprecated.

from openai import OpenAI

client = OpenAI(api_key="...", base_url="https://your-gateway/v1")

baseline_outputs = {}
for prompt_id, prompt in eval_set.items():
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    baseline_outputs[prompt_id] = resp.choices[0].message.content

with open("baseline_gpt4o.json", "w") as f:
    json.dump(baseline_outputs, f)
Enter fullscreen mode Exit fullscreen mode

3. Get the candidate model's outputs on the same prompts

Same code, different model name. If your application is wired through an OpenAI-compatible gateway, this is one config change:

candidate_outputs = {}
for prompt_id, prompt in eval_set.items():
    resp = client.chat.completions.create(
        model="deepseek-chat",  # the only change
        messages=[{"role": "user", "content": prompt}],
    )
    candidate_outputs[prompt_id] = resp.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

The cost of running 300 prompts through DeepSeek-V3 is roughly $0.20. Don't optimize this step.

4. Compare programmatically where you can, human-review the rest

For structured outputs (JSON, tool calls, field extraction), programmatic comparison covers most ground:

  • Schema validity: does the output parse?
  • Field match: do the extracted fields match the baseline?
  • Edit distance for short strings

For free-form outputs (summaries, explanations, agent responses), human review is the bottleneck. Three minutes per prompt × 300 prompts = 15 hours, which sounds bad but is a one-time cost for a decision that affects every production call going forward.

Use an LLM-as-judge to triage: have a stronger model (Claude 3.5, GPT-4o) rate each candidate output against the baseline as better / equivalent / worse / different-but-acceptable. Then human-review only the worse and different buckets. That cuts human time by ~70% in my experience.

5. Set a threshold before you ship

"Candidate model has to match baseline on at least 95% of evals to ship" is a reasonable default. The exact number depends on the workload:

  • Safety-critical (legal, medical, financial): 99%+
  • User-facing high-stakes (customer-facing summaries): 97%+
  • Internal tooling (Slack summaries, dev tools): 92%+
  • Background tasks (data cleanup, tagging): 85%+

Pick the threshold before you see the numbers. Picking after is how you talk yourself into shipping a model that's slightly worse on the dimension you care about.

The architectural prerequisite

This whole loop only works cheaply if swapping the model for the eval is a config change, not an integration project. Wire your application code through the OpenAI Python SDK with a configurable base_url and let a gateway handle the provider-specific bits.

client = OpenAI(
    api_key="th-...",
    base_url="https://your-gateway/v1",
)

# Same client, different model per call
client.chat.completions.create(model="gpt-4o", ...)
client.chat.completions.create(model="deepseek-chat", ...)
client.chat.completions.create(model="claude-3-5-sonnet", ...)
Enter fullscreen mode Exit fullscreen mode

I use TokenHub for the gateway — 40+ models behind one API key, route per call. LiteLLM self-hosted gets you the same shape if you'd rather run it yourself.

Without that wiring, every eval is a custom integration project, which is why most teams don't run evals.

TL;DR

  1. Pull 100-300 real prompts from logs (include weird ones)
  2. Run baseline model, save outputs
  3. Run candidate model, save outputs
  4. Compare (programmatic for structured, LLM-judge + human for free-form)
  5. Threshold before shipping

The whole exercise takes a day. It saves you the rollback story.

Top comments (0)