MMLU-PRO
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with MMLU-PRO
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
73 models tested · sorted by score
Frequently asked
Pulled from the MMLU-PRO dataset · updated daily
What does MMLU-PRO measure?
MMLU-PRO is a knowledge benchmark in the BenchGecko catalog. 73 AI models have been tested on it. Scores range from 0.3 to 52.6 out of 100.
Which model leads on MMLU-PRO?
Qwen2-72B from Alibaba Qwen leads MMLU-PRO with a score of 52.6. The median score across 73 tested models is 22.8.
Is MMLU-PRO saturated?
No · the top score is 52.6 out of 100 (53%). There is still meaningful room for improvement on MMLU-PRO.
Does MMLU-PRO predict performance on other benchmarks?
Yes · MMLU-PRO scores correlate 0.94 with BBH (HuggingFace) across 73 shared models. Models that do well on MMLU-PRO tend to do well on BBH (HuggingFace).
How often is MMLU-PRO data refreshed?
BenchGecko pulls updates daily. New model scores on MMLU-PRO appear as soon as they are published by Epoch AI or the model provider.
More knowledge benchmarks
Same category · related evaluations