How to A/B Test AI Models on Your Real User Queries

#ai #api #llm #programming

Not sure which AI model is best for your use case?

Don't trust benchmarks. Test on your actual user queries.

Here's how to A/B test 14+ models in 30 minutes.

Why A/B Test?

Benchmarks lie. A model that's "90% accurate" on MMLU might suck at:

Your specific domain (legal? medical? code?)
Your language (English? Chinese? mixed?)
Your task type (summarization? reasoning? extraction?)

Solution: Test on real data.

The Setup (5 minutes)

Step 1: Export 50-100 real user queries from your app.

Step 2: Get an AIBridge API key (3M free tokens, no credit card).

Step 3: Write a simple test script:

from openai import OpenAI
import json

client = OpenAI(
    api_key="mb_your_key",
    base_url="https://aibridge-api.com/v1"
)

# Your test queries
test_queries = [
    "Summarize this article...",
    "Write a Python function to...",
    "Translate this to Chinese...",
    # ... 50-100 real queries
]

# Models to test
models_to_test = [
    "deepseek-v4-pro",
    "qwen3-235b-a22b",
    "glm-4-plus",
    "deepseek-v4-flash"
]

# Run A/B test
results = {}
for model in models_to_test:
    scores = []
    for query in test_queries[:10]:  # Test subset first
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": query}]
        )
        output = response.choices[0].message.content
        latency = response.usage.completion_tokens / response.usage.prompt_tokens  # Rough speed metric

        # Score it (your criteria here)
        score = score_output(output, query)  # Your scoring function
        scores.append({"score": score, "latency": latency, "cost": response.usage.total_tokens})

    results[model] = scores

# Compare
for model, scores in results.items():
    avg_score = sum(s["score"] for s in scores) / len(scores)
    avg_latency = sum(s["latency"] for s in scores) / len(scores)
    print(f"{model}: Score={avg_score:.2f}, Speed={avg_latency:.2f}")

What to Measure

Metric	Why It Matters
Accuracy	Does the output match your quality bar?
Latency	How fast does it respond?
Cost	What's the $/1M tokens?
Consistency	Does it fail on edge cases?

Pro tip: Create a scoring rubric for your specific use case. What's "good" for you might be different from benchmarks.

Results (What I Found)
Testing on 100 real code generation queries:

Model	Accuracy	Avg Latency	Cost / 1M tokens
deepseek-v4-pro	94%	3.8s	$2.00
deepseek-coder	92%	2.1s	$0.14
qwen3-235b-a22b	90%	2.5s	$1.00
glm-4-plus	88%	3.2s	$1.50

Winner for my use case: deepseek-coder — 92% accuracy at 1/14th the cost of V4 Pro.

The AIBridge Advantage
With direct APIs, A/B testing means:
❌ 4 different API clients
❌ 4 billing dashboards
❌ Rewriting code for each model

With AIBridge:
✅ One client, one loop
✅ One billing dashboard
✅ Change model= and rerun

Try it: https://aibridge-api.com

5M free tokens. Test all 14 models. 🧪

DEV Community

How to A/B Test AI Models on Your Real User Queries

Why A/B Test?

The Setup (5 minutes)

Top comments (0)