The percentage of problems solved on the first generated attempt. Hardest metric · most realistic for production use.

Why do models score 99% now?

Training-data contamination plus the benchmark being too easy. HumanEval+ was designed to address the test-leakage concern.

BenchmarksReading · ~3 min · 87 words deep

HumanEval

A benchmark where the model writes a Python function from its docstring, scored by whether it passes hidden unit tests.

HumanEval leaderboard

TL;DR

A benchmark where the model writes a Python function from its docstring, scored by whether it passes hidden unit tests.

Level 1

Basic

HumanEval was published by OpenAI in 2021 alongside Codex. It contains 164 hand-crafted Python programming problems, each with a function signature, docstring, and several unit tests. The metric is pass@k · the percentage of problems where at least one of k generated solutions passes all tests. Most frontier models now score 90%+ on pass@1, so HumanEval is mostly saturated as a differentiator.

Level 2

Deep

HumanEval's 164 problems span string manipulation, list operations, math, and simple algorithms. The test suite is held out, not shown to the model. Pass@1 is the hardest metric · one shot, must be correct. Pass@10 and pass@100 allow the model multiple attempts. HumanEval+ extends the suite with additional adversarial tests. Frontier models crossed 90% pass@1 in 2023; by 2026 the frontier is 98-99% and the benchmark is used mostly as a sanity check. Real-world coding capability is better measured by SWE-bench, Aider polyglot, or LiveBench coding subset.

Level 3

Expert

Pass@k is estimated from n generated samples using an unbiased estimator 1 - C(n-c, k) / C(n, k) where c is the number of correct samples. Temperature affects this significantly; most reports use temperature 0.2-0.8 and sample 20-100 solutions. Test leakage has been documented · HumanEval problems appear in public training data, limiting its signal. HumanEval+ (2023) adds mutation-based tests to harden the suite. Primary criticism: one-function scope and academic-toy style doesn't predict agent or repo-scale coding performance.

Why this matters now

HumanEval is effectively solved. Announcements of new coding-capable models now skip HumanEval and lead with SWE-bench Verified.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·164 Python problems, held-out unit tests, pass@k metric
·Saturated at 90%+ pass@1 for frontier models
·HumanEval+ and MBPP extend the format

If you are a

Builder

·Ignore HumanEval scores · they no longer differentiate
·Use SWE-bench Verified, Aider polyglot, LiveBench-coding instead
·HumanEval 100% on a small model tells you nothing about agent performance

If you are a

Investor

·HumanEval saturation marked the end of the first coding-AI bench era
·Watch SWE-bench as the next-gen coding signal
·Any pitch deck still leading with HumanEval is out of date

If you are a

Curious · Normie

·The old-school AI coding test · 164 Python puzzles
·Almost every AI scores close to perfect now
·Look at SWE-bench instead for meaningful comparisons

Don't mix them up

Often confused with

HumanEvalvsSWE-bench

HumanEval is 164 isolated function problems. SWE-bench is 500+ real GitHub issues in real repositories. Different difficulty, different skill.

HumanEvalvsMBPP

MBPP (Mostly Basic Python Problems) has 974 crowd-sourced beginner Python tasks. HumanEval has 164 hand-crafted ones. Similar tier of difficulty.

Gecko's take

HumanEval is the old guard. Every serious coding-model comparison moved to SWE-bench Verified by 2025.

Frequently Asked Questions

OpenAI, published alongside the Codex paper in 2021.

Canonical sources

Read the primary sources

Codex / HumanEval paper (OpenAI, 2021)arxiv.org
HumanEval+ · hardened suitegithub.com

HumanEval

Basic

Deep

Expert

Depending on why you're here

Often confused with

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed