Shmulik Cohen

Posted on Mar 1 • Originally published at shmulc.substack.com on Mar 9, 2025

HumanEval - The Most Inhuman Benchmark For LLM Code Generation

#ai #llm #testing #programming

How OpenAI Evaluated Its Model's Coding Capabilities

AI is Everywhere, But Code is the Real Test

Large Language Models (LLMs) are everywhere. They help us write, summarize, translate, and even generate images. But if there’s one domain where they matter the most, it’s coding. Why? Because code is a power multiplier.

Code transforms effort in dramatic ways. A single efficient function can automate hours of manual work, while the right algorithm can scale a business to serve millions of users. Unlike creative text where quality is subjective, code’s success can be measured by its functionality, performance, and reliability. This makes coding capabilities especially valuable in AI systems, as they directly translate to measurable productivity gains and technical leverage that compounds over time.

This makes coding one of the most valuable skills in the AI era. LLMs that can generate correct, efficient, and maintainable code aren’t just convenient — they are transformative. But this also means we can’t afford to judge them based on vibes. Code execution demands precision, and that’s why we must benchmark AI coding capabilities rigorously rather than trusting our intuition.

Once Upon A Time

In this post, we are going to talk about HumanEval, one of the first benchmarks to evaluate LLMs on code generation from the house of OpenAI.

The benchmark was introduced in the paper **Evaluating Large Language Models Trained on Code **by chen et al in 2021, not too long ago but millennia in the ML world (more than a year before ChatGPT!).

The paper focuses on three main topics:

HumanEval: A carefully crafted evaluation set that measures functional correctness for synthesizing programs from docstrings
Codex: OpenAI’s GPT language model finetuned on publicly available code from GitHub, and their journey to improve its performance on the benchmark
Potential: The potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

As the name of this post suggests, we will focus on the first part — the benchmark that would go on to become the industry standard for few years measure coding abilities of LLMs.

The Code Evaluation Challenge

Evaluating LLM outputs presents challenges that simply didn’t exist in classic machine learning. In traditional ML, success metrics were clear-cut: you evaluated image classification by accuracy, recommendation systems by precision and recall, or regression models by mean squared error.

But with language models, evaluation becomes more complex. How do you objectively measure the quality of generated text? Is a summary “good enough”? Does a translation capture nuance and tone? These questions invite subjective judgment rather than binary assessment.

When it comes to code generation, we face similar evaluation hurdles. How do we systematically test if an AI can write good code? What does “good” even mean in this context?

Take a minute to think: What approach would you use to evaluate AI-generated code? How could we create a standardized benchmark that objectively measures coding capability? Feel free to share your thoughts in the comments below.

LeetCode offers an interesting parallel here. For those unfamiliar, LeetCode is a platform where programmers practice coding challenges ranging from simple algorithms to complex systems design. Each problem comes with:

A clear problem statement
Input/output examples
Hidden test cases that validate edge cases
An automated evaluation system that tests solution correctness and efficiency

Classic Question from Leetcode

This approach — providing a problem description and automatically testing the resulting code — creates an objective framework for assessment. The solution either works for all test cases or it doesn’t. Additionally, LeetCode evaluates performance metrics like execution time and memory usage, adding dimensions beyond mere correctness.

HumanEval takes inspiration from this model but focuses specifically on measuring an AI’s ability to turn natural language descriptions into functional code. Instead of testing a human programmer’s problem-solving skills, it tests whether an AI can parse a function description (docstring) and generate appropriate code that handles all requirements correctly.

The beauty of this approach is that it provides a clear, objective measure: either the generated code passes all test cases or it doesn’t. This binary outcome gives us a solid foundation for comparing different models’ coding capabilities.

The Dataset

The dataset consists of 164 original pythonic programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

Each Question comes in format of json that looks like these:

example_problem.json

humaneval_0.json, The text keeps going to the right

We can see here that each problem got the following fields:

task_id: Just a unique id to identify the problem.
prompt: The code snippet, usually contains the required imports, docstring and function definitions. The only input that the LLM will see.
entry_poin t: The function name, used to run the code and specify where the generated code will be.
canonical_solution: The canonical solution for the problem.
test: One or more test suites for the problem.

The Benchmark

To run the benchmark and evaluate the generated code from the LLM, we need to pass the prompt (and only the prompt) to a function that will call the LLM. In the repo there is a simple example for it:

from human_eval.data import write_jsonl, read_problems

problems = read_problems()

num_samples_per_task = 200
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
    for task_id in problems
    for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)

you just have to provide generate_one_completion to make it work.

In the repo there are also functions to evaluate the result in samples.jsonl, and provide a score. It’s extremely dangerous to run the benchmark in your own machine, outside of a robust security sandbox, and we will touch on that later.

Understanding Pass@k Metric

HumanEval uses ‘functional correctness’ rather than text similarity to evaluate code generation. The pass@k (Kulal et al., 2020) metric measures the probability that at least one of the top k generated samples for a problem passes all unit tests.

The formula is simple:

n: Total samples generated
c: Number of correct samples
k: Number of top samples considered

Pass@k = 1 — C(n-c,k)/C(n,k)

This metric reflects real-world usage where developers might generate multiple solutions and pick one that works. It’s now standard in code generation benchmarks, including the Papers with Code leaderboards that track pass@10 and pass@100 scores.

By focusing on whether code actually runs correctly rather than how it looks, HumanEval provides a practical assessment of code generation capabilities.

Paper’s Results

In the original paper, the reviewed models got pretty low score, with GPT-3 solves 0% of the problems and GPT-J solves 11.4%. The introduced model in the paper, codex, got higher score of 28.8% of the problems. With repeated sampling from the model, they were able to solve 70.2% of the problems.

Limitations and Problems With HumanEval

There are a few problems with HumanEval that we must address, some of them are specific to the benchmark but most of them are problematic in general in the world of LLM’s benchmarks. The main ones are:

Benchmark Leakage: One of the biggest problems with benchmarks is their leakage, what if the LLM already knows that canonical solution and it just repeats it instead of solving the problem? The problem becomes harder and harder since the new models scrapes the whole web
Tests Are Not Enough: HumanEval gives score to solution only by one thing — if they pass the tests or not. The main two problems with that are that passing the tests doesn’t necessarily say that the solution is correct, the LLM might ignore some untested edge cases or solve the only few that will make the tests pass. The second problem is that there are more parameters for good code other than correctness, good code should be readable, maintainable and efficient too and HumanEval ignores all of these completely.
Running Model Generated Code: Running untrusted model-generated code is extremely dangerous and discouraged to do outside of of a robust security sandbox. Some of the evaluators probably didn’t follow this rule…
Passing Your Own Benchmark: There is an inherent problem with designing your own benchmark and then design a model that performs really good at it.

Despite these major problems, HumanEval is an important milestone in the world of ML benchmarks and evaluated dozens of models since its release.

HumanEval Today

LLMs have continuously improved their performance on HumanEval over time, today the top models (like Claude 3.5 Sonnet) reaches score of even 100% on Pass@1. You can see the gradual improvement over the years in the graph below.

HumanEval is considered relatively easy Benchmark today but it paved the way to newer and harder benchmarks. To name a few —

And many more, would you like to hear more on some of them or on other benchmarks? Tell me in the comments!

My personal perspective

I learned about HumanEval and LLM’s benchmarks from Dr. Ofir Press at the NLP course in TAU. We then proceeded with a final project on generating novel programming Exercises with Large Language Models.

We had a lot of iterations and at the end we used the HumanEval framework as a base to create more question (in the same json formats) and achieved some cool results and crazy questions like this one: