Benchmarks
How models are measured · SWE-bench, GPQA, MMLU.
Top 12 terms
The real-world coding benchmark · AI resolves actual GitHub issues in open-source Python repos.
A 164-problem Python benchmark where the model writes a function from its docstring and passes unit tests.
A contamination-resistant benchmark that refreshes tasks monthly to prevent models from memorizing answers.
Crowdsourced head-to-head AI model comparison · humans vote on anonymous outputs and Elo ratings rank the models.
A harder version of MMLU with 10 answer choices, filtered noise, and more reasoning-heavy questions.
ARC AI2 is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.
BBH is a reasoning benchmark tracked by BenchGecko across every frontier and open-weight model.
GSM8K is a math benchmark tracked by BenchGecko across every frontier and open-weight model.
HellaSwag is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.
LAMBADA is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.
MMLU is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.
GPQA diamond is a knowledge benchmark tracked by BenchGecko across every frontier and open-weight model.