DEV Community

Daniel Dong
Daniel Dong

Posted on

How to A/B Test AI Models on Your Real User Queries

Not sure which AI model is best for your use case?

Don't trust benchmarks. Test on your actual user queries.

Here's how to A/B test 14+ models in 30 minutes.


Why A/B Test?

Benchmarks lie. A model that's "90% accurate" on MMLU might suck at:

  • Your specific domain (legal? medical? code?)
  • Your language (English? Chinese? mixed?)
  • Your task type (summarization? reasoning? extraction?)

Solution: Test on real data.


The Setup (5 minutes)

Step 1: Export 50-100 real user queries from your app.

Step 2: Get an AIBridge API key (3M free tokens, no credit card).

Step 3: Write a simple test script:

from openai import OpenAI
import json

client = OpenAI(
    api_key="mb_your_key",
    base_url="https://aibridge-api.com/v1"
)

# Your test queries
test_queries = [
    "Summarize this article...",
    "Write a Python function to...",
    "Translate this to Chinese...",
    # ... 50-100 real queries
]

# Models to test
models_to_test = [
    "deepseek-v4-pro",
    "qwen3-235b-a22b",
    "glm-4-plus",
    "deepseek-v4-flash"
]

# Run A/B test
results = {}
for model in models_to_test:
    scores = []
    for query in test_queries[:10]:  # Test subset first
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": query}]
        )
        output = response.choices[0].message.content
        latency = response.usage.completion_tokens / response.usage.prompt_tokens  # Rough speed metric

        # Score it (your criteria here)
        score = score_output(output, query)  # Your scoring function
        scores.append({"score": score, "latency": latency, "cost": response.usage.total_tokens})

    results[model] = scores

# Compare
for model, scores in results.items():
    avg_score = sum(s["score"] for s in scores) / len(scores)
    avg_latency = sum(s["latency"] for s in scores) / len(scores)
    print(f"{model}: Score={avg_score:.2f}, Speed={avg_latency:.2f}")
Enter fullscreen mode Exit fullscreen mode

What to Measure

Metric Why It Matters
Accuracy Does the output match your quality bar?
Latency How fast does it respond?
Cost What's the $/1M tokens?
Consistency Does it fail on edge cases?

Pro tip: Create a scoring rubric for your specific use case. What's "good" for you might be different from benchmarks.

Results (What I Found)
Testing on 100 real code generation queries:

Model Accuracy Avg Latency Cost / 1M tokens
deepseek-v4-pro 94% 3.8s $2.00
deepseek-coder 92% 2.1s $0.14
qwen3-235b-a22b 90% 2.5s $1.00
glm-4-plus 88% 3.2s $1.50

Winner for my use case: deepseek-coder — 92% accuracy at 1/14th the cost of V4 Pro.

The AIBridge Advantage
With direct APIs, A/B testing means:
❌ 4 different API clients
❌ 4 billing dashboards
❌ Rewriting code for each model

With AIBridge:
✅ One client, one loop
✅ One billing dashboard
✅ Change model= and rerun

Try it: https://aibridge-api.com

5M free tokens. Test all 14 models. 🧪

mainpage

models

playground

pricing

Top comments (0)