Beta
BenchmarksReading · ~3 min · 87 words deep

HumanEval

A benchmark where the model writes a Python function from its docstring, scored by whether it passes hidden unit tests.

HumanEval leaderboard
TL;DR

A benchmark where the model writes a Python function from its docstring, scored by whether it passes hidden unit tests.

Level 1

HumanEval was published by OpenAI in 2021 alongside Codex. It contains 164 hand-crafted Python programming problems, each with a function signature, docstring, and several unit tests. The metric is pass@k · the percentage of problems where at least one of k generated solutions passes all tests. Most frontier models now score 90%+ on pass@1, so HumanEval is mostly saturated as a differentiator.

Level 2

HumanEval's 164 problems span string manipulation, list operations, math, and simple algorithms. The test suite is held out, not shown to the model. Pass@1 is the hardest metric · one shot, must be correct. Pass@10 and pass@100 allow the model multiple attempts. HumanEval+ extends the suite with additional adversarial tests. Frontier models crossed 90% pass@1 in 2023; by 2026 the frontier is 98-99% and the benchmark is used mostly as a sanity check. Real-world coding capability is better measured by SWE-bench, Aider polyglot, or LiveBench coding subset.

Level 3

Pass@k is estimated from n generated samples using an unbiased estimator 1 - C(n-c, k) / C(n, k) where c is the number of correct samples. Temperature affects this significantly; most reports use temperature 0.2-0.8 and sample 20-100 solutions. Test leakage has been documented · HumanEval problems appear in public training data, limiting its signal. HumanEval+ (2023) adds mutation-based tests to harden the suite. Primary criticism: one-function scope and academic-toy style doesn't predict agent or repo-scale coding performance.

Why this matters now

HumanEval is effectively solved. Announcements of new coding-capable models now skip HumanEval and lead with SWE-bench Verified.

The takeaway for you
If you are a
Researcher
  • ·164 Python problems, held-out unit tests, pass@k metric
  • ·Saturated at 90%+ pass@1 for frontier models
  • ·HumanEval+ and MBPP extend the format
If you are a
Builder
  • ·Ignore HumanEval scores · they no longer differentiate
  • ·Use SWE-bench Verified, Aider polyglot, LiveBench-coding instead
  • ·HumanEval 100% on a small model tells you nothing about agent performance
If you are a
Investor
  • ·HumanEval saturation marked the end of the first coding-AI bench era
  • ·Watch SWE-bench as the next-gen coding signal
  • ·Any pitch deck still leading with HumanEval is out of date
If you are a
Curious · Normie
  • ·The old-school AI coding test · 164 Python puzzles
  • ·Almost every AI scores close to perfect now
  • ·Look at SWE-bench instead for meaningful comparisons
Don't mix them up
HumanEvalvsSWE-bench

HumanEval is 164 isolated function problems. SWE-bench is 500+ real GitHub issues in real repositories. Different difficulty, different skill.

HumanEvalvsMBPP

MBPP (Mostly Basic Python Problems) has 974 crowd-sourced beginner Python tasks. HumanEval has 164 hand-crafted ones. Similar tier of difficulty.

Gecko's take

HumanEval is the old guard. Every serious coding-model comparison moved to SWE-bench Verified by 2025.

OpenAI, published alongside the Codex paper in 2021.
Canonical sources