MMLU-PRO
The Frontier
Best score over time · one chart, every benchmark
Full rankings
73 models tested · sorted by score
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with MMLU-PRO
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About MMLU-PRO
What does MMLU-PRO measure?
MMLU-PRO is a knowledge benchmark in the BenchGecko catalog. 73 AI models have been tested on it. Scores range from 0.3 to 52.6 out of 100.
Which model leads on MMLU-PRO?
Qwen2-72B from Alibaba Qwen leads MMLU-PRO with a score of 52.6. The median score across 73 tested models is 22.8.
Is MMLU-PRO saturated?
No · the top score is 52.6 out of 100 (53%). There is still meaningful room for improvement on MMLU-PRO.
Does MMLU-PRO predict performance on other benchmarks?
Yes · MMLU-PRO scores correlate 0.94 with BBH (HuggingFace) across 73 shared models. Models that do well on MMLU-PRO tend to do well on BBH (HuggingFace).
How often is MMLU-PRO data refreshed?
BenchGecko pulls updates daily. New model scores on MMLU-PRO appear as soon as they are published by Epoch AI or the model provider.
- Category
- Knowledge
- Max score
- 100
- Models
- 73
- Updated
- 2025-07-24
Top on MMLU-PRO
Qwen2-72B · 52.6Qwen2.5 32B Instruct · 51.9Qwen2.5 72B Instruct · 51.4Qwen2.5 72B Instruct Abliterated · 50.4Phi 4 · 48.6More knowledge benchmarks
Same category · related evaluations