Alan West

Posted on Apr 22

How to Actually Benchmark Open-Source LLMs Before Ditching Your API Provider

#ai #opensource #llm #machinelearning

Every few weeks, a Reddit post blows up claiming some new open-source model is a "legit replacement" for whatever proprietary API you're paying for. You get excited, spin it up, ask it a few questions, and... it seems great? But then you swap it into your production pipeline and everything falls apart.

I've been through this cycle at least four times now. Here's what I've learned about properly evaluating open-source models before you commit to the migration.

The Real Problem: Vibes-Based Evaluation

The core issue isn't that open-source models are bad — they've gotten shockingly good. The problem is how most developers evaluate them. They run a few cherry-picked prompts, compare outputs side-by-side, and declare victory.

This is vibes-based evaluation, and it will burn you.

Here's why: LLMs have wildly different performance profiles across tasks. A model might crush it on code generation but completely fall apart on structured JSON output. It might write beautiful prose but hallucinate API parameters that don't exist. Your Reddit-inspired test of "write me a Python script" tells you almost nothing about how it'll handle your actual workload.

Step 1: Define What You Actually Need

Before you touch a single model, write down the specific capabilities your application depends on. Be ruthlessly specific.

# eval_config.py - Define your evaluation dimensions
EVAL_DIMENSIONS = {
    "structured_output": {
        "description": "Returns valid JSON matching a schema",
        "weight": 0.3,  # 30% of our usage is structured extraction
    },
    "code_generation": {
        "description": "Generates working code in Python/TypeScript",
        "weight": 0.25,
    },
    "instruction_following": {
        "description": "Follows complex multi-step instructions precisely",
        "weight": 0.25,
    },
    "long_context": {
        "description": "Maintains coherence over 8k+ token inputs",
        "weight": 0.2,
    },
}

The weights matter. If 80% of your API calls are structured data extraction, a model that's 10% better at creative writing but 5% worse at JSON output is a net loss for you.

Step 2: Build a Real Eval Suite From Your Production Data

Grab actual inputs from your production logs. Not synthetic benchmarks — your real prompts, with your real edge cases.

import json
from pathlib import Path

def build_eval_dataset(log_dir: str, sample_size: int = 200):
    """Sample real production prompts for evaluation."""
    samples = []
    log_path = Path(log_dir)

    for log_file in log_path.glob("*.jsonl"):
        with open(log_file) as f:
            for line in f:
                entry = json.loads(line)
                samples.append({
                    "input": entry["prompt"],
                    "expected": entry["response"],  # baseline from current model
                    "category": entry.get("task_type", "general"),
                    "tokens_in": entry.get("input_tokens", 0),
                })

    # Stratified sampling across categories
    from collections import defaultdict
    import random

    by_category = defaultdict(list)
    for s in samples:
        by_category[s["category"]].append(s)

    eval_set = []
    per_category = sample_size // len(by_category)
    for cat, items in by_category.items():
        eval_set.extend(random.sample(items, min(per_category, len(items))))

    return eval_set

Two hundred samples is a decent starting point. Less than fifty and your results are noise. More than a thousand and you're wasting compute unless you have a very diverse workload.

Step 3: Run the Models Head-to-Head

Here's where most people go wrong — they test the new model in isolation. You need an A/B comparison against your current setup with automated scoring.

import time
from openai import OpenAI  # works with any OpenAI-compatible endpoint

def run_comparison(eval_set, model_configs):
    """Run each eval sample against all model configs."""
    results = []

    for sample in eval_set:
        row = {"input": sample["input"], "category": sample["category"]}

        for name, config in model_configs.items():
            client = OpenAI(
                base_url=config["base_url"],
                api_key=config.get("api_key", "not-needed"),
            )

            start = time.perf_counter()
            try:
                resp = client.chat.completions.create(
                    model=config["model"],
                    messages=[{"role": "user", "content": sample["input"]}],
                    temperature=0.0,  # deterministic for fair comparison
                    max_tokens=config.get("max_tokens", 2048),
                )
                elapsed = time.perf_counter() - start

                row[f"{name}_output"] = resp.choices[0].message.content
                row[f"{name}_latency"] = elapsed
                row[f"{name}_tokens"] = resp.usage.completion_tokens
                row[f"{name}_error"] = None
            except Exception as e:
                row[f"{name}_output"] = None
                row[f"{name}_latency"] = None
                row[f"{name}_error"] = str(e)

        results.append(row)

    return results

# Example config for local model via vLLM/Ollama vs API
model_configs = {
    "current_api": {
        "base_url": "https://api.your-provider.com/v1",
        "api_key": "sk-...",
        "model": "your-current-model",
    },
    "local_candidate": {
        "base_url": "http://localhost:8000/v1",  # vLLM or similar
        "model": "moonshot/kimi-k2",
        "api_key": "not-needed",
    },
}

Notice I'm using temperature=0.0 for both. This isn't how you'd run in production, but it eliminates randomness from your comparison. You want to measure the model, not the dice rolls.

Step 4: Score What Matters (Not What's Easy)

Here's the scoring framework I use. It's not fancy, but it catches the things that actually break production systems.

import json
import re

def score_structured_output(output: str, expected_schema: dict) -> float:
    """Check if output is valid JSON matching expected structure."""
    try:
        parsed = json.loads(output)
    except (json.JSONDecodeError, TypeError):
        # Try extracting JSON from markdown code blocks
        match = re.search(r'```

(?:json)?\s*({.*?})\s*

```', output, re.DOTALL)
        if match:
            try:
                parsed = json.loads(match.group(1))
            except json.JSONDecodeError:
                return 0.0
        else:
            return 0.0

    # Check required keys exist
    required_keys = set(expected_schema.get("required", []))
    present_keys = set(parsed.keys())
    key_score = len(required_keys & present_keys) / len(required_keys) if required_keys else 1.0

    return key_score

def score_latency(candidate_ms: float, baseline_ms: float) -> float:
    """Score latency relative to baseline. 1.0 = same speed, >1.0 = faster."""
    if candidate_ms is None or candidate_ms == 0:
        return 0.0
    return baseline_ms / candidate_ms

The JSON extraction fallback is critical. Some models wrap their JSON output in markdown code fences even when you ask them not to. If you don't handle that, you'll score them as failures when the actual content is fine.

Step 5: The Numbers That Actually Matter

After running your eval, focus on these metrics:

Failure rate: How often does the model produce unusable output? A 2% failure rate on structured output might sound fine until you realize that's hundreds of retries per day at scale.
P95 latency: Average latency is meaningless. Your users feel the slow requests, not the fast ones.
Cost per token: For local models, calculate your actual hardware cost divided by throughput. GPU hours aren't free just because you own the hardware.
Consistency: Run the same eval three times. If scores vary more than 5%, the model's behavior is too unstable for production use.

The Gotchas Nobody Mentions

Quantization changes everything. That open-source model everyone's raving about? They tested the full fp16 weights. You're running a 4-bit quantized version because you don't have 8 A100s. The quality difference can be dramatic, especially for structured output and math.

System prompts behave differently. A prompt that works perfectly with one model might completely confuse another. Budget time for prompt adaptation — it's not just a drop-in swap.

Context window claims are optimistic. A model might technically support 128k tokens, but performance often degrades significantly past a certain threshold. Test with your actual context lengths, not the number on the spec sheet.

My Honest Take

Open-source models have genuinely closed the gap for many use cases. For straightforward code generation, summarization, and general Q&A, you can absolutely find open-source options that perform comparably to top-tier APIs. The Mixture-of-Experts architecture in particular has been a game-changer for making large models practical to run.

But "performs comparably on benchmarks" and "works as a drop-in replacement in my production system" are very different claims. Do the work. Run the evals. Test with your actual data.

The thirty minutes you spend building a proper eval suite will save you from the three-day production incident when your new model starts returning XML instead of JSON at 2 AM.

Prevention: Make This Repeatable

Whatever you build for this evaluation, commit it to your repo. The next time someone posts on Reddit about the hot new model, you should be able to run your eval suite in under an hour and have a definitive answer for your specific use case.

That's the real superpower here — not picking the right model once, but having the infrastructure to evaluate any model quickly and confidently.

DEV Community