Authors & Volunteers: Dr. Mattia Salvarani (UNIMORE), Prof. Carlos Hernández (University of Cambridge), Dr. Aisha Rahman (University of Toronto), Prof. Luca Moretti (ETH Zürich)
Github Repo: Click Here
Over the past two days, with little in the way of new developments in the world of AI, we conducted a focused study to test and compare AI quality directly. We evaluated GPT‑5 and Claude 4 Sonnet across 200 diverse prompts spanning reasoning, coding, analysis, knowledge, writing, and safety-critical scenarios on our Cubent VS Code Extension. Our study measured task success, factual precision, reasoning quality, helpfulness, conciseness, safety/refusal correctness, hallucination rate, and latency. In accordance with real-world user experience constraints, we also reported p50/p90/p95 latency and cost-normalized quality.
Key findings:
Speed: Claude 4 Sonnet is consistently faster (median 5.1 s) than GPT‑5 (median 6.4 s); Sonnet also shows lower p95 latency.
Precision: On fact-heavy tasks, Sonnet is slightly more precise (93.2% vs. 91.4% factual precision) and exhibits a lower hallucination rate.
Overall Quality: GPT‑5 achieves higher task success overall (86% vs. 84%), particularly on multi-step reasoning and code generation/debugging.
Safety & Refusals: Sonnet shows a marginal edge in refusal correctness (96% vs. 94%), while both models maintain high safety compliance.
Domain Trends: Sonnet is faster and a touch more precise on editing, summarization, and short-form Q&A; GPT‑5 leads on complex reasoning, code synthesis, data analysis, and multilingual tasks.
All results include bootstrapped 95% confidence intervals and effect sizes. We release the prompt taxonomy, rubric, and annotation protocol for reproducibility.
Top comments (0)