How to Compare 5 LLMs with One API Key (Python Tutorial)

No multiple accounts. No juggling billing dashboards. No vendor lock-in. Just one endpoint and 5 lines of model names.

Every developer who builds with LLMs eventually hits the same wall: which model is actually best for my use case?

You open 5 tabs. You log into 5 different platforms. You compare outputs manually. Then next week a new model drops and you do it all over again.

There's a better way. Here's how to A/B test GPT-4o, Claude 3.5 Sonnet, Gemini 2.0, DeepSeek V3, and Qwen 2.5 — all through a single API key, with zero account-switching.

Why Compare Models in the First Place?

Before we write code, let's be clear about why this matters.

Different models have different strengths:

Model	Best For	Weakness
GPT-4o	General reasoning, code generation	Cost at scale
Claude 3.5 Sonnet	Long-form writing, nuanced analysis	Speed on simple tasks
Gemini 2.0	Multimodal, factual retrieval	Instruction following quirks
DeepSeek V3	Cost-efficient coding, math	Creative writing
Qwen 2.5	Multilingual (CN/EN), structured output	English nuance vs Claude/GPT

A model that's brilliant at writing marketing copy might be mediocre at SQL generation. The only way to know is to test systematically.

The Setup

We'll use the OpenAI Python SDK — but instead of pointing at api.openai.com, we'll point at a unified API gateway. One key, five models, same interface.

pip install openai

Now the core script:

from openai import OpenAI
import time
import json

# One client. One API key. All models.
client = OpenAI(
    api_key="sk-your-api-key",    # Your API key
    base_url="https://api.yourprovider.com/v1"  # Unified endpoint
)

# The five models we're comparing
models = [
    "gpt-4o",
    "claude-3-5-sonnet",
    "gemini-2.0-pro",
    "deepseek-v3",
    "qwen-2.5-max"
]

# A test prompt that exercises reasoning, creativity, and structure
prompt = """
You are evaluating a startup pitch. Score it from 1-10 on these dimensions:
1. Problem clarity
2. Market size
3. Solution uniqueness

Pitch: "An AI-powered kitchen assistant that scans your fridge,
suggests recipes based on available ingredients, and auto-orders
missing items via grocery delivery APIs."

Return your response as a valid JSON object with keys:
problem_clarity, market_size, solution_uniqueness, total_score, and reasoning.
"""

results = {}

for model in models:
    print(f"Testing {model}...")
    start = time.time()

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2,   # Low temp for consistency
            max_tokens=500
        )

        elapsed = time.time() - start
        output = response.choices[0].message.content
        tokens = response.usage.total_tokens

        results[model] = {
            "latency_seconds": round(elapsed, 2),
            "total_tokens": tokens,
            "output": output,
            "finish_reason": response.choices[0].finish_reason
        }

        print(f"  Done in {elapsed:.2f}s, {tokens} tokens")

    except Exception as e:
        results[model] = {"error": str(e)}
        print(f"  Error: {e}")

# Save results
with open("llm_comparison.json", "w") as f:
    json.dump(results, f, indent=2)

print("\nComparison saved to llm_comparison.json")

Run it:

python compare_llms.py

What You'll See

Here's a sample output from a real run:

Testing gpt-4o...
  Done in 1.83s, 312 tokens
Testing claude-3-5-sonnet...
  Done in 2.41s, 287 tokens
Testing gemini-2.0-pro...
  Done in 1.52s, 298 tokens
Testing deepseek-v3...
  Done in 0.91s, 334 tokens
Testing qwen-2.5-max...
  Done in 1.27s, 305 tokens

Comparison saved to llm_comparison.json

The JSON results let you compare not just speed, but also how each model thinks:

{
  "gpt-4o": {
    "output": "{\"problem_clarity\": 8, \"market_size\": 7, ...}",
    "latency_seconds": 1.83,
    "total_tokens": 312
  },
  "claude-3-5-sonnet": {
    "output": "{\"problem_clarity\": 7, ... \"reasoning\": \"The pitch has strong clarity...\"}",
    "latency_seconds": 2.41,
    "total_tokens": 287
  }
}

Now you can analyze:

Which model followed the JSON instruction most strictly? (GPT-4o and Qwen tend to nail structured output.)
Which gave the most nuanced reasoning? (Claude usually wins here.)
Which was fastest? (DeepSeek often leads on throughput, especially for Asian-hosted users.)

Going Further: Batch Testing Multiple Prompts

One prompt is a start. Real evaluation needs variety. Here's a batch version:

test_prompts = [
    # Reasoning
    "Explain the CAP theorem to a 12-year-old.",
    # Code generation
    "Write a Python function to find the longest palindrome in a string.",
    # Creative writing
    "Write the opening paragraph of a sci-fi novel set in Hong Kong, 2150.",
    # Data extraction
    """Extract all company names and funding amounts from this text:
    'Acme Corp raised $50M in Series B. BetaTech secured $12M seed funding.'""",
    # Translation
    """Translate to English: '人工智能正在重塑每一個行業，
    但開發者不應該被鎖定在單一供應商。'"""
]

for i, prompt in enumerate(test_prompts):
    print(f"\n{'='*60}")
    print(f"PROMPT {i+1}: {prompt[:80]}...")

    for model in models:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2,
            max_tokens=300
        )
        print(f"  [{model}] {response.choices[0].message.content[:120]}...")

How to Read the Results

After running a batch comparison, you're looking for patterns:

Signal	What It Tells You
Consistent JSON output	Good for production APIs that parse LLM responses
Faster latency on DeepSeek	Consider for real-time apps (chat, autocomplete)
Claude's longer reasoning	Use when quality > speed (content generation, analysis)
Qwen excels at Chinese	Multilingual products should test this specifically
Gemini's factual accuracy	RAG pipelines, knowledge-base queries

The key insight: no single model wins everything. The right model depends on your specific task, budget, and latency requirements.

What This Means for Your Architecture

Once you've identified which model performs best for each task type, you can build a model router:

def route_to_best_model(task_type: str):
    router = {
        "code_generation": "deepseek-v3",      # Fast, cheap, accurate for code
        "content_writing": "claude-3-5-sonnet", # Nuanced long-form
        "multilingual": "qwen-2.5-max",        # Strong CN/EN performance
        "reasoning": "gpt-4o",                 # General-purpose reasoning
        "fast_chat": "gemini-2.0-pro",         # Low latency conversational
    }
    return router.get(task_type, "gpt-4o")  # Default fallback

# Now your app automatically picks the best model per task
model = route_to_best_model("code_generation")
response = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": user_query}]
)

This is the real power of a unified API: not just accessing many models, but routing intelligently between them. You get the best of every world — Claude's writing, DeepSeek's speed, GPT's reasoning — without changing your integration.

Key Takeaways

Test, don't guess. The "best" model depends on your exact use case. Run comparisons.
One API key is all you need. The code in this tutorial uses a single endpoint with no model-switching overhead.
Build a router. Once you know which model excels at what, automate the selection.
Keep comparing. Models update weekly. Re-run your benchmarks regularly.