The 12% Accuracy Gap Nobody Talks About
GPT-4o scores 90.2% on HumanEval pass@1. Claude 3.5 Sonnet hits 92.0%. Gemini 1.5 Pro lands at 84.1%. That's the headline from the model cards, but here's what actually happens when you run the same 164 Python problems through all three models with identical prompts: the gap shrinks to 4%, and the failure modes tell you more than the topline numbers.
I ran the full HumanEval benchmark on all three models using the same zero-shot prompt format. No few-shot examples, no chain-of-thought scaffolding — just the problem description and function signature. The goal wasn't to reproduce the model card numbers (those use proprietary eval harnesses with different sampling strategies). It was to see which model writes correct code on the first try, with the kind of prompt you'd actually use in a REPL or coding assistant.
What HumanEval Actually Tests
Continue reading the full article on TildAlice

Top comments (0)