Benchmarking GPT-5 vs Claude 4 Sonnet on 200 Requests

#programming #ai #openai #resources

Authors & Volunteers: Dr. Mattia Salvarani (UNIMORE), Prof. Carlos Hernández (University of Cambridge), Dr. Aisha Rahman (University of Toronto), Prof. Luca Moretti (ETH Zürich)

Github Repo: Click Here

Over the past two days, with little in the way of new developments in the world of AI, we conducted a focused study to test and compare AI quality directly. We evaluated GPT‑5 and Claude 4 Sonnet across 200 diverse prompts spanning reasoning, coding, analysis, knowledge, writing, and safety-critical scenarios on our Cubent VS Code Extension. Our study measured task success, factual precision, reasoning quality, helpfulness, conciseness, safety/refusal correctness, hallucination rate, and latency. In accordance with real-world user experience constraints, we also reported p50/p90/p95 latency and cost-normalized quality.

Key findings:

Speed: Claude 4 Sonnet is consistently faster (median 5.1 s) than GPT‑5 (median 6.4 s); Sonnet also shows lower p95 latency.

Precision: On fact-heavy tasks, Sonnet is slightly more precise (93.2% vs. 91.4% factual precision) and exhibits a lower hallucination rate.

Overall Quality: GPT‑5 achieves higher task success overall (86% vs. 84%), particularly on multi-step reasoning and code generation/debugging.

Safety & Refusals: Sonnet shows a marginal edge in refusal correctness (96% vs. 94%), while both models maintain high safety compliance.

Domain Trends: Sonnet is faster and a touch more precise on editing, summarization, and short-form Q&A; GPT‑5 leads on complex reasoning, code synthesis, data analysis, and multilingual tasks.

All results include bootstrapped 95% confidence intervals and effect sizes. We release the prompt taxonomy, rubric, and annotation protocol for reproducibility.

DEV Community

Benchmarking GPT-5 vs Claude 4 Sonnet on 200 Requests

Top comments (0)