Xidao

Posted on May 15

I Tested 6 LLM Models on the Same 50 Production Prompts — Here’s What Actually Varies

#ai #llm #webdev #programming

When you're building an app that calls an LLM API, the model benchmarks on the leaderboard don't tell you what you actually need to know. You need to know: will this model follow my JSON schema reliably? How fast does the first token arrive under load? What happens when I throw an edge case at it?

I spent two weeks testing 6 models on 50 real production prompts — the kind your app actually sends, not the kind that win MMLU scores. Here's what I found, complete with code, cost breakdowns, and the failure modes nobody warns you about.

Why I Built My Own Benchmark

Public benchmarks are useful for researchers. They're almost useless for engineers choosing a model for production.

Here's why: benchmarks test models in isolation, with carefully curated prompts, evaluated by other LLMs or human graders. Your production environment is different. Your prompts are wrapped in system messages. Your inputs are messy user text. Your outputs need to parse into specific schemas. Your latency budget is 2 seconds, not 20.

After the third time a "top-ranked" model failed to return valid JSON for our extraction pipeline, I decided to stop trusting leaderboards and start testing with our actual prompts. Here's exactly how I did it, and what I found.

The Setup

I used the same prompt templates, the same system messages, and the same output schemas across all models. The only variable was the model endpoint. Everything went through a single API gateway so I could track latency, token usage, and cost uniformly.

The models tested:

GPT-4o (OpenAI) — the default choice for many teams
Claude 3.5 Sonnet (Anthropic) — strong on instruction following
Gemini 1.5 Pro (Google) — long context, competitive pricing
DeepSeek V3 (DeepSeek) — open-weight, ultra-low cost
Qwen 2.5 72B (Alibaba) — strong multilingual, open-weight
Mistral Large (Mistral) — European alternative, good code generation

The 50 prompts fell into 5 categories:

Structured extraction (10 prompts) — parse user messages into JSON objects
Code generation (10 prompts) — write Python/JS functions from descriptions
Summarization (10 prompts) — condense long documents with specific constraints
Multi-turn reasoning (10 prompts) — maintain context across 5+ turns
Edge cases (10 prompts) — ambiguous instructions, conflicting constraints, malformed input

What I Measured (And Why)

Not accuracy scores. In production, "accuracy" is a moving target that depends on your prompt engineering, your fine-tuning, and your definition of "correct." Instead, I measured the things that actually cause you to wake up at 3 AM:

Time to first token (TTFT) — how long before streaming starts. This is what users feel.
Tokens per second — streaming speed. Determines how fast a long response renders.
Format adherence — did the output parse as valid JSON when requested? This is binary: your parser works or it doesn't.
Instruction following — did it do what I asked, the way I asked? Not "did it give a good answer" but "did it follow the constraints."
Failure mode — when it failed, how did it fail? Gracefully (with an explanation) or catastrophically (with garbage)?
Cost per 1K output tokens — at current API pricing, what does this actually cost?

Results: Structured Extraction

This is where you feel the difference between models most painfully. You ask for JSON, and some models give you JSON. Others give you JSON wrapped in a markdown code block. Others give you JSON with a trailing comma. Others give you a helpful paragraph explaining what JSON is.

Model	Valid JSON Rate	Avg TTFT	Tokens/sec	Cost/1K output
GPT-4o	94%	320ms	85	$0.005
Claude 3.5 Sonnet	97%	280ms	78	$0.0075
Gemini 1.5 Pro	88%	410ms	72	$0.005
DeepSeek V3	91%	190ms	142	$0.0003
Qwen 2.5 72B	85%	250ms	95	$0.0004
Mistral Large	90%	300ms	88	$0.004

Claude was the most reliable for JSON output. DeepSeek was the fastest and cheapest but occasionally added extra fields that weren't in the schema. Gemini had a habit of wrapping JSON in markdown code blocks even when I explicitly said "return only JSON, no markdown."

Here's a concrete example. The prompt was:

Extract the following user message into JSON with these fields:
name (string), age (integer), location (string), occupation (string), 
years_experience (integer), skills (array of strings).

Message: "I'm Sarah Chen, 28, based in Berlin. Been doing backend 
development for about 5 years, mainly Python and Go. Recently started 
picking up Rust too."

Expected output:

{
  "name": "Sarah Chen",
  "age": 28,
  "location": "Berlin",
  "occupation": "backend developer",
  "years_experience": 5,
  "skills": ["Python", "Go", "Rust"]
}

What actually happened across models:

GPT-4o: Correct JSON, but sometimes wrapped in code blocks (7/10 times bare JSON, 3/10 wrapped)
Claude 3.5 Sonnet: Bare JSON every time. Consistently the cleanest output.
Gemini 1.5 Pro: 6/10 bare JSON, 4/10 wrapped in code blocks. Also added an extra field once.
DeepSeek V3: Bare JSON, but 2/10 times added a source or extra field not in the schema.
Qwen 2.5 72B: Bare JSON 5/10 times, wrapped 3/10, and 2/10 times returned an array instead of an object.
Mistral Large: Bare JSON most of the time, but once returned age as string instead of integer.

The practical takeaway: If your app parses structured output, test your actual prompt with the actual model before shipping. The difference between 85% and 97% valid JSON means your parser breaks on 15% of requests vs 3% — that's the difference between "works" and "support tickets."

And the cost difference is staggering. DeepSeek at $0.0003/1K tokens vs Claude at $0.0075/1K tokens is a 25x gap. For a pipeline processing 10M tokens/day, that's $3/day vs $75/day — $1,095/year vs $27,375/year. The question is whether Claude's 6% higher reliability is worth that premium for your use case.

Results: Code Generation

Code generation showed the widest variance in quality, but the variance was task-dependent:

Simple utility functions (sort a list, parse a date, write a regex): All 6 models performed nearly identically. The differences were cosmetic — variable naming, comment style, error handling approach. If your code generation use case is autocomplete or simple function writing, the model choice barely matters.

Complex multi-file refactoring: Claude and GPT-4o were clearly better at maintaining consistency across multiple code blocks. When I asked them to refactor a Python class into 3 separate files with proper imports, both models got the import paths right and maintained the class interface. DeepSeek sometimes hallucinated import paths — from utils.helpers import validate_input when no such module existed. Qwen would sometimes forget to import dependencies it used in the code.

Framework-specific code (React components, FastAPI routes, SQLAlchemy models): Qwen and Mistral were more likely to use outdated API patterns. I asked for a FastAPI endpoint with Pydantic v2 models, and both Qwen and Mistral wrote Pydantic v1 syntax. GPT-4o and Claude had the most current knowledge.

Performance characteristics:

Model	Code Gen Speed	Lines/min	Hallucinated Imports
GPT-4o	85 tokens/sec	~45	0/10
Claude 3.5 Sonnet	78 tokens/sec	~42	0/10
Gemini 1.5 Pro	72 tokens/sec	~38	1/10
DeepSeek V3	142 tokens/sec	~78	3/10
Qwen 2.5 72B	95 tokens/sec	~50	2/10
Mistral Large	88 tokens/sec	~46	1/10

One surprising finding: DeepSeek V3 was the fastest at generating code (about 2x the tokens/second of Claude), and for simple tasks the quality was indistinguishable from GPT-4o. For a code completion use case where speed matters more than complex reasoning, DeepSeek is a strong choice. The hallucinated imports are annoying but catchable with a simple linter.

Results: Summarization

Summarization was the most consistent category across models. All 6 produced reasonable summaries of technical documents. The differences were in the details:

Length control: Claude was best at hitting a target word count. When I said "summarize in 150 words," Claude averaged 155 words. Gemini tended to produce longer summaries — averaging 210 words when asked for 150. GPT-4o was in the middle at 175 words.

Key point extraction: I tested this by summarizing a 3,000-word technical document with 5 clearly important points and 3 minor details. GPT-4o and Claude consistently identified all 5 major points. DeepSeek sometimes prioritized less relevant details — it would mention the author's affiliation but miss the core finding.

Hallucination in summaries: This was the most concerning finding. When the source document didn't contain information the summary "needed" to be complete, most models would fabricate plausible details. For example, summarizing a paper that mentioned "tests were conducted in 2024" — two models added "across 500 participants" even though the paper never stated the sample size. Claude was least likely to do this. DeepSeek and Qwen were most likely.

The cost angle: For summarization, DeepSeek's quality was close enough to GPT-4o for most use cases, at 1/17th the cost. If you're building a news aggregator or document summarization tool, DeepSeek is the obvious choice — the occasional missed detail is worth the 94% cost savings.

Results: Multi-Turn Reasoning

This is where things got interesting. I ran conversations with 5+ turns where each turn built on previous context. This simulates a real chat application or a multi-step agent workflow.

Context retention: GPT-4o and Claude maintained context best across long conversations. In a 7-turn conversation about database migration, both models correctly referenced a constraint mentioned in turn 2 when answering in turn 7. Gemini occasionally "forgot" details from earlier turns, especially when the conversation topic shifted slightly.

Contradiction handling: When I introduced a contradictory instruction in turn 4 (contradicting something from turn 2), the models handled it very differently:

Claude: Flagged the contradiction explicitly. "You mentioned X earlier, but now you're asking for Y, which conflicts. Which would you prefer?"
GPT-4o: Silently followed the newer instruction. Didn't mention the conflict.
DeepSeek: Tried to reconcile both instructions, producing a confused output that partially satisfied neither.
Gemini: Followed the newer instruction, like GPT-4o, but added a brief note about the change.
Qwen: Asked for clarification, similar to Claude.
Mistral: Followed the newer instruction without noting the conflict.

Token cost in multi-turn: This is where costs compound. A 7-turn conversation with an average of 500 tokens per turn (including system message re-sends) costs:

GPT-4o: ~$0.175
Claude 3.5 Sonnet: ~$0.2625
DeepSeek V3: ~$0.0105
Qwen 2.5 72B: ~$0.014
Mistral Large: ~$0.14
Gemini 1.5 Pro: ~$0.175

That's a 25x cost difference between DeepSeek and Claude for the same conversation. If your chat app has 10,000 daily active users each averaging 5 conversations/day, the monthly bill is:

GPT-4o: ~$26,250
Claude: ~$39,375
DeepSeek: ~$1,575

Results: Edge Cases

This was the most revealing category. Edge cases tell you how a model fails, which matters more than how it succeeds:

Ambiguous instructions: I gave prompts like "Write a function to handle the user data" without specifying input/output format, error handling, or which data fields. Claude asked for clarification most often (7/10 times). GPT-4o made assumptions and proceeded (8/10 times). DeepSeek and Qwen sometimes just picked one interpretation without noting the ambiguity (6/10 and 5/10 times respectively).

Conflicting constraints: "Write a Python function that's both maximally readable and maximally performant" — a classic tension. Claude and GPT-4o handled this best — they'd note the trade-off and offer a balanced approach with a comment about the tension. Mistral would silently optimize for performance, producing less readable code. DeepSeek would sometimes satisfy neither constraint well.

Malformed input: All models handled typos and broken formatting reasonably well. The real test was adversarial prompts — injection attempts, attempts to override system prompts. Claude was most resistant to prompt injection. GPT-4o was generally resistant but had some edge cases where creative framing could bypass the system prompt. DeepSeek was more susceptible to prompt injection in my testing — a well-crafted "ignore previous instructions" prompt worked about 30% of the time.

Truncation handling: When I set a very low max_tokens limit (50 tokens) for a task that needed 200, the models behaved differently:

Claude: Stopped cleanly mid-sentence, easy to detect and retry.
GPT-4o: Sometimes tried to compress the answer, producing abbreviated but still useful output.
DeepSeek: Would sometimes produce truncated JSON (missing closing braces), breaking parsers.
Gemini: Similar to GPT-4o, tried to compress.
Qwen: Stopped cleanly but sometimes at awkward points (mid-word).
Mistral: Clean stops, similar to Claude.

The Cost Reality (Full Breakdown)

Let's talk money, because this is what actually determines which model you use in production.

For a typical production workload of 1M tokens/day (mix of input and output):

Model	Monthly Cost	Relative
GPT-4o	~$5,000	1x
Claude 3.5 Sonnet	~$7,500	1.5x
Gemini 1.5 Pro	~$5,000	1x
DeepSeek V3	~$300	0.06x
Qwen 2.5 72B	~$400	0.08x
Mistral Large	~$4,000	0.8x

That 15-25x cost difference between DeepSeek and the premium models is real. The question is whether the quality difference justifies the cost for your use case.

Here's my framework for making the decision:

Is the output user-facing? (e.g., chatbot responses, generated content) -> Use GPT-4o or Claude. Quality matters, users notice.
Is the output machine-consumed? (e.g., extraction, classification, routing) -> Use DeepSeek or Qwen. Cost matters more than polish.
Is latency critical? (e.g., real-time autocomplete, live chat) -> Use DeepSeek. 2x faster TTFT and 2x faster streaming.
Is the task safety-critical? (e.g., medical, legal, financial) -> Use Claude. Best instruction following and least hallucination.
Is the volume high? (e.g., >100K requests/day) -> Use DeepSeek. The cost savings compound fast.

What I Actually Ship With

After testing, I don't use one model for everything. My production setup:

Structured extraction / JSON parsing: Claude 3.5 Sonnet (highest reliability, worth the premium for parsing)
Code generation (simple): DeepSeek V3 (fastest, cheapest, good enough for autocomplete)
Code generation (complex): GPT-4o or Claude (better at multi-file consistency)
Summarization: DeepSeek V3 (quality is close enough, cost is 17x lower)
Multi-turn conversations: GPT-4o (best context retention, users notice dropped context)
Edge case / adversarial inputs: Claude (most robust against injection)

The routing logic adds complexity, but it saves about 60% compared to using GPT-4o for everything, with no measurable quality loss for end users. The key insight is that different tasks have different quality requirements, and the cheapest model that meets the requirement for each task is the right model.

How to Run This Test Yourself

If you want to test these models with your own prompts, here's the exact setup I used:

import openai
import time
import json
from dataclasses import dataclass
from typing import Optional

@dataclass
class TestResult:
    model: str
    ttft: float
    total_time: float
    tokens_per_sec: float
    response: str
    valid_json: Optional[bool] = None
    cost_estimate: float = 0.0

# All models behind a single gateway
client = openai.OpenAI(
    base_url="https://api.xidao.online/v1",
    api_key="your-api-key"
)

MODELS = [
    "gpt-4o",
    "claude-3-5-sonnet-20241022",
    "gemini-1.5-pro",
    "deepseek-chat",
    "qwen-2.5-72b-instruct",
    "mistral-large-latest"
]

def test_model(model: str, prompt: str, schema: dict = None) -> TestResult:
    messages = [{"role": "user", "content": prompt}]
    if schema:
        messages.insert(0, {
            "role": "system",
            "content": f"Respond only with valid JSON: {json.dumps(schema)}"
        })

    start = time.time()
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
        max_tokens=1000
    )

    first_token_time = None
    tokens = []
    for chunk in response:
        if first_token_time is None:
            first_token_time = time.time() - start
        if chunk.choices[0].delta.content:
            tokens.append(chunk.choices[0].delta.content)

    full_response = "".join(tokens)
    total_time = time.time() - start

    result = TestResult(
        model=model,
        ttft=first_token_time or 0,
        total_time=total_time,
        tokens_per_sec=len(full_response.split()) / total_time if total_time > 0 else 0,
        response=full_response
    )

    if schema:
        try:
            parsed = json.loads(full_response.strip())
            result.valid_json = True
        except json.JSONDecodeError:
            result.valid_json = False

    return result

def run_extraction_test():
    prompt = "Extract into JSON: Sarah Chen, 28, Berlin, backend dev, 5 years, Python/Go/Rust"
    schema = {
        "name": "string", "age": "integer", "location": "string",
        "occupation": "string", "years_experience": "integer",
        "skills": "array of strings"
    }

    for model in MODELS:
        result = test_model(model, prompt, schema)
        print(f"{model}: TTFT={result.ttft:.2f}s, "
              f"JSON={result.valid_json}, "
              f"speed={result.tokens_per_sec:.0f} tok/s")

if __name__ == "__main__":
    run_extraction_test()

The key insight: running the same prompt across multiple models with a single API gateway makes comparison trivial. You don't need a benchmark framework — you need your actual prompts and a JSON parser.

Lessons Learned (The Hard Way)

After running these tests, here are the things I wish I'd known before:

1. Test with YOUR prompts, not generic benchmarks. Our extraction prompts had specific quirks (nested objects, optional fields, arrays of enums) that triggered different failure modes in different models. A generic "JSON generation" benchmark wouldn't have caught these.

2. The cheapest model is often good enough. I was using Claude for everything before this test. Switching extraction and summarization to DeepSeek saved us ~$6,000/month with no noticeable quality drop for those use cases.

3. Speed matters more than you think. DeepSeek's 142 tokens/sec vs Claude's 78 tokens/sec means a 500-token response renders in 3.5 seconds vs 6.4 seconds. Users notice. In our A/B tests, faster streaming reduced abandonment by 12%.

4. The failure mode matters more than the success mode. A model that fails gracefully (with an error message) is better than one that fails silently (with garbage output). Claude's explicit contradiction flagging saved us from a data corruption bug.

5. Multi-model routing is worth the complexity. The routing logic took about 2 days to implement. The cost savings paid for that engineering time in the first week. If you're processing >100K API calls/month, the ROI is obvious.

What's Your Experience?

I'm curious what patterns others have found. Are you seeing different trade-offs? Have you found a model that's surprisingly good for a specific task? Or one that's surprisingly bad despite its benchmark scores?

Specific questions I'd love answers to:

How do these models compare on function calling reliability? I didn't test that.
Has anyone tested Claude 3.5 Haiku vs GPT-4o-mini for high-volume extraction?
What's your experience with Gemini 1.5 Flash for summarization? Is it good enough?
Are there specific prompt engineering tricks that close the gap between cheap and expensive models?

Drop your findings in the comments — especially if you've tested on tasks I didn't cover here (translation, image description, function calling, etc.). The more real-world data points we have, the better decisions we can all make.

If you want to reproduce this test with your own prompts, the code above works with any OpenAI-compatible API endpoint. I used XiDao as my gateway because it lets me route to all 6 models through a single endpoint with unified billing, but you can adapt it to any setup.

DEV Community