HumanEval
A benchmark where the model writes a Python function from its docstring, scored by whether it passes hidden unit tests.
A benchmark where the model writes a Python function from its docstring, scored by whether it passes hidden unit tests.
Basic
HumanEval was published by OpenAI in 2021 alongside Codex. It contains 164 hand-crafted Python programming problems, each with a function signature, docstring, and several unit tests. The metric is pass@k · the percentage of problems where at least one of k generated solutions passes all tests. Most frontier models now score 90%+ on pass@1, so HumanEval is mostly saturated as a differentiator.
Deep
HumanEval's 164 problems span string manipulation, list operations, math, and simple algorithms. The test suite is held out, not shown to the model. Pass@1 is the hardest metric · one shot, must be correct. Pass@10 and pass@100 allow the model multiple attempts. HumanEval+ extends the suite with additional adversarial tests. Frontier models crossed 90% pass@1 in 2023; by 2026 the frontier is 98-99% and the benchmark is used mostly as a sanity check. Real-world coding capability is better measured by SWE-bench, Aider polyglot, or LiveBench coding subset.
Expert
Pass@k is estimated from n generated samples using an unbiased estimator 1 - C(n-c, k) / C(n, k) where c is the number of correct samples. Temperature affects this significantly; most reports use temperature 0.2-0.8 and sample 20-100 solutions. Test leakage has been documented · HumanEval problems appear in public training data, limiting its signal. HumanEval+ (2023) adds mutation-based tests to harden the suite. Primary criticism: one-function scope and academic-toy style doesn't predict agent or repo-scale coding performance.
HumanEval is effectively solved. Announcements of new coding-capable models now skip HumanEval and lead with SWE-bench Verified.
Depending on why you're here
- ·164 Python problems, held-out unit tests, pass@k metric
- ·Saturated at 90%+ pass@1 for frontier models
- ·HumanEval+ and MBPP extend the format
- ·Ignore HumanEval scores · they no longer differentiate
- ·Use SWE-bench Verified, Aider polyglot, LiveBench-coding instead
- ·HumanEval 100% on a small model tells you nothing about agent performance
- ·HumanEval saturation marked the end of the first coding-AI bench era
- ·Watch SWE-bench as the next-gen coding signal
- ·Any pitch deck still leading with HumanEval is out of date
- ·The old-school AI coding test · 164 Python puzzles
- ·Almost every AI scores close to perfect now
- ·Look at SWE-bench instead for meaningful comparisons
Often confused with
HumanEval is 164 isolated function problems. SWE-bench is 500+ real GitHub issues in real repositories. Different difficulty, different skill.
MBPP (Mostly Basic Python Problems) has 974 crowd-sourced beginner Python tasks. HumanEval has 164 hand-crafted ones. Similar tier of difficulty.
HumanEval is the old guard. Every serious coding-model comparison moved to SWE-bench Verified by 2025.
Frequently Asked Questions
Read the primary sources
- Codex / HumanEval paper (OpenAI, 2021)arxiv.org
- HumanEval+ · hardened suitegithub.com