Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?

#ai #llm #claude #benchmark

We had a question: for structured-output tasks where you just need clean
JSON back, which frontier model wins on a cost/quality basis?

The answer matters because most production LLM features aren't writing
poetry — they're extracting fields from emails, summarizing tickets,
classifying intents. Boring, structured, repetitive. The kind of work where
overpaying by 5x for marginal quality gains is just a tax on your margins.

We benchmarked.

Setup

Task: extract {sender, intent, urgency, refund_amount} from customer support emails.
Inputs: 30 real tickets (anonymized), ranging from 50 to 800 tokens.
Models: claude-sonnet-4-6, claude-haiku-4-5, gpt-4.1, gpt-5, gemini-2.5-flash, gemini-2.5-pro.
Scoring: field completeness (all 4 fields present, correct types), hallucination rate (made-up refund amounts), JSON validity.
Run: promptfork test extract_email against all 6 models in parallel.

Results

Model	Completeness	Hallucinations	$ / 30 tickets	Latency p50
claude-sonnet-4-6	30/30	0	$0.024	1.1s
claude-haiku-4-5	29/30	0	$0.003	0.7s
gpt-5	30/30	1	$0.045	1.8s
gpt-4.1	28/30	2	$0.018	1.4s
gemini-2.5-pro	27/30	4	$0.012	1.6s
gemini-2.5-flash	26/30	3	$0.001	0.9s

(Numbers are illustrative — run the same suite on your own prompts to get
results that actually predict your production behaviour.)

What surprised us

Haiku is the value pick. 96.7% completeness for 8x less cost than
Sonnet. For straight extraction with rubric-defined fields, paying for
Sonnet is a luxury, not a need.

Gemini 2.5 Flash is fast and cheap and wrong. Three hallucinated refund
amounts in 30 tickets is a customer-facing accident waiting to happen.
We're not saying Gemini is bad — we're saying Gemini is bad for this kind
of task. Probably great for creative writing.

GPT-5 doesn't pay for itself on simple tasks. It's a smarter model. But
when the task is "return four fields with these types," the smarter model
isn't writing better outputs, it's writing the same outputs more slowly and
more expensively.

The urgency field was where models diverged most. All six models
nailed sender and intent. Urgency is subjective; that's where reasoning
quality showed up.

How we actually ran this

pip install promptfork
export PROMPTFORK_API_KEY=pf_xxx

# Push the prompt
promptfork push extract_email --file prompts/extract.txt

# Pin 30 tickets as test cases (script your own loop)
for f in tickets/*.json; do ...; done

# Run all 6 models in parallel
promptfork test extract_email \
  --models claude-sonnet-4-6,claude-haiku-4-5,gpt-5,gpt-4.1,gemini-2.5-pro,gemini-2.5-flash

PromptFork fans out one call per (model × test case), captures cost +
latency + tokens, persists everything. We then exported the run as a CSV
and scored manually for hallucinations (the LLM-judge handles regression
detection but not novel correctness scoring — that's still a human's job
the first time).

Practical takeaway

If you're shipping a structured-output LLM feature today, your stack should
probably be:

Default: Haiku. Cheap, fast, accurate enough for most extraction.
Hard reasoning: Sonnet. When Haiku misses, it usually misses on multi-step reasoning, not format. Sonnet picks that up.
Avoid: routing the same simple task to a frontier model "just in case." You're paying 5-10x for nothing.

You don't need a benchmark blog post to validate this for your prompts —
you need to run the benchmark on your inputs. PromptFork makes that one
command. Free tier handles ~50 runs/mo: https://promptfork.online/diff

DEV Community