No multiple accounts. No juggling billing dashboards. No vendor lock-in. Just one endpoint and 5 lines of model names.
Every developer who builds with LLMs eventually hits the same wall: which model is actually best for my use case?
You open 5 tabs. You log into 5 different platforms. You compare outputs manually. Then next week a new model drops and you do it all over again.
There's a better way. Here's how to A/B test GPT-4o, Claude 3.5 Sonnet, Gemini 2.0, DeepSeek V3, and Qwen 2.5 — all through a single API key, with zero account-switching.
Why Compare Models in the First Place?
Before we write code, let's be clear about why this matters.
Different models have different strengths:
| Model | Best For | Weakness |
|---|---|---|
| GPT-4o | General reasoning, code generation | Cost at scale |
| Claude 3.5 Sonnet | Long-form writing, nuanced analysis | Speed on simple tasks |
| Gemini 2.0 | Multimodal, factual retrieval | Instruction following quirks |
| DeepSeek V3 | Cost-efficient coding, math | Creative writing |
| Qwen 2.5 | Multilingual (CN/EN), structured output | English nuance vs Claude/GPT |
A model that's brilliant at writing marketing copy might be mediocre at SQL generation. The only way to know is to test systematically.
The Setup
We'll use the OpenAI Python SDK — but instead of pointing at api.openai.com, we'll point at a unified API gateway. One key, five models, same interface.
pip install openai
Now the core script:
from openai import OpenAI
import time
import json
# One client. One API key. All models.
client = OpenAI(
api_key="sk-your-api-key", # Your API key
base_url="https://api.yourprovider.com/v1" # Unified endpoint
)
# The five models we're comparing
models = [
"gpt-4o",
"claude-3-5-sonnet",
"gemini-2.0-pro",
"deepseek-v3",
"qwen-2.5-max"
]
# A test prompt that exercises reasoning, creativity, and structure
prompt = """
You are evaluating a startup pitch. Score it from 1-10 on these dimensions:
1. Problem clarity
2. Market size
3. Solution uniqueness
Pitch: "An AI-powered kitchen assistant that scans your fridge,
suggests recipes based on available ingredients, and auto-orders
missing items via grocery delivery APIs."
Return your response as a valid JSON object with keys:
problem_clarity, market_size, solution_uniqueness, total_score, and reasoning.
"""
results = {}
for model in models:
print(f"Testing {model}...")
start = time.time()
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.2, # Low temp for consistency
max_tokens=500
)
elapsed = time.time() - start
output = response.choices[0].message.content
tokens = response.usage.total_tokens
results[model] = {
"latency_seconds": round(elapsed, 2),
"total_tokens": tokens,
"output": output,
"finish_reason": response.choices[0].finish_reason
}
print(f" Done in {elapsed:.2f}s, {tokens} tokens")
except Exception as e:
results[model] = {"error": str(e)}
print(f" Error: {e}")
# Save results
with open("llm_comparison.json", "w") as f:
json.dump(results, f, indent=2)
print("\nComparison saved to llm_comparison.json")
Run it:
python compare_llms.py
What You'll See
Here's a sample output from a real run:
Testing gpt-4o...
Done in 1.83s, 312 tokens
Testing claude-3-5-sonnet...
Done in 2.41s, 287 tokens
Testing gemini-2.0-pro...
Done in 1.52s, 298 tokens
Testing deepseek-v3...
Done in 0.91s, 334 tokens
Testing qwen-2.5-max...
Done in 1.27s, 305 tokens
Comparison saved to llm_comparison.json
The JSON results let you compare not just speed, but also how each model thinks:
{
"gpt-4o": {
"output": "{\"problem_clarity\": 8, \"market_size\": 7, ...}",
"latency_seconds": 1.83,
"total_tokens": 312
},
"claude-3-5-sonnet": {
"output": "{\"problem_clarity\": 7, ... \"reasoning\": \"The pitch has strong clarity...\"}",
"latency_seconds": 2.41,
"total_tokens": 287
}
}
Now you can analyze:
- Which model followed the JSON instruction most strictly? (GPT-4o and Qwen tend to nail structured output.)
- Which gave the most nuanced reasoning? (Claude usually wins here.)
- Which was fastest? (DeepSeek often leads on throughput, especially for Asian-hosted users.)
Going Further: Batch Testing Multiple Prompts
One prompt is a start. Real evaluation needs variety. Here's a batch version:
test_prompts = [
# Reasoning
"Explain the CAP theorem to a 12-year-old.",
# Code generation
"Write a Python function to find the longest palindrome in a string.",
# Creative writing
"Write the opening paragraph of a sci-fi novel set in Hong Kong, 2150.",
# Data extraction
"""Extract all company names and funding amounts from this text:
'Acme Corp raised $50M in Series B. BetaTech secured $12M seed funding.'""",
# Translation
"""Translate to English: '人工智能正在重塑每一個行業,
但開發者不應該被鎖定在單一供應商。'"""
]
for i, prompt in enumerate(test_prompts):
print(f"\n{'='*60}")
print(f"PROMPT {i+1}: {prompt[:80]}...")
for model in models:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.2,
max_tokens=300
)
print(f" [{model}] {response.choices[0].message.content[:120]}...")
How to Read the Results
After running a batch comparison, you're looking for patterns:
| Signal | What It Tells You |
|---|---|
| Consistent JSON output | Good for production APIs that parse LLM responses |
| Faster latency on DeepSeek | Consider for real-time apps (chat, autocomplete) |
| Claude's longer reasoning | Use when quality > speed (content generation, analysis) |
| Qwen excels at Chinese | Multilingual products should test this specifically |
| Gemini's factual accuracy | RAG pipelines, knowledge-base queries |
The key insight: no single model wins everything. The right model depends on your specific task, budget, and latency requirements.
What This Means for Your Architecture
Once you've identified which model performs best for each task type, you can build a model router:
def route_to_best_model(task_type: str):
router = {
"code_generation": "deepseek-v3", # Fast, cheap, accurate for code
"content_writing": "claude-3-5-sonnet", # Nuanced long-form
"multilingual": "qwen-2.5-max", # Strong CN/EN performance
"reasoning": "gpt-4o", # General-purpose reasoning
"fast_chat": "gemini-2.0-pro", # Low latency conversational
}
return router.get(task_type, "gpt-4o") # Default fallback
# Now your app automatically picks the best model per task
model = route_to_best_model("code_generation")
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": user_query}]
)
This is the real power of a unified API: not just accessing many models, but routing intelligently between them. You get the best of every world — Claude's writing, DeepSeek's speed, GPT's reasoning — without changing your integration.
Key Takeaways
- Test, don't guess. The "best" model depends on your exact use case. Run comparisons.
- One API key is all you need. The code in this tutorial uses a single endpoint with no model-switching overhead.
- Build a router. Once you know which model excels at what, automate the selection.
- Keep comparing. Models update weekly. Re-run your benchmarks regularly.
Try It Yourself
Grab a unified API key and run the comparison in under 5 minutes. Most providers offer free credits to start.
Questions? Drop a comment below. I'm especially curious: which model won for your specific use case?
This tutorial uses a unified AI API gateway — one endpoint for 40+ models including GPT-4o, Claude, Gemini, DeepSeek, and Qwen. Built by itapi.ai.
Top comments (0)