OpenCompass · MMLU-Pro
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with OpenCompass · MMLU-Pro
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
32 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 87.6 | |
| 2 | 86.2 | |
| 3 | 85.8 | |
| 4 | 85.8 | |
| 5 | 85.5 | |
| 6 | 85.2 | |
| 7 | 84.3 | |
| 8 | 84.0 | |
| 9 | 83.5 | |
| 10 | 83.5 | |
| 11 | 83.5 | |
| 12 | 83.1 | |
| 13 | 83.0 | |
| 14 | 83.0 | |
| 15 | 82.7 | |
| 16 | 82.0 | |
| 17 | 81.7 | |
| 18 | 81.6 | |
| 19 | 81.3 | |
| 20 | 81.0 | |
| 21 | 79.7 | |
| 22 | 79.5 | |
| 23 | 79.2 | |
| 24 | 78.0 | |
| 25 | 73.9 | |
| 26 | 73.8 | |
| 27 | 72.8 | |
| 28 | 72.8 | |
| 29 | 72.1 | |
| 30 | 70.8 | |
| 31 | 67.8 | |
| 32 | 63.0 |
Frequently asked
Pulled from the OpenCompass · MMLU-Pro dataset · updated daily
What does OpenCompass · MMLU-Pro measure?
OpenCompass · MMLU-Pro is a knowledge benchmark in the BenchGecko catalog. 32 AI models have been tested on it. Scores range from 63.0 to 87.6 out of 100.
Which model leads on OpenCompass · MMLU-Pro?
Qwen3.5 397B A17B from Alibaba Qwen leads OpenCompass · MMLU-Pro with a score of 87.6. The median score across 32 tested models is 81.8.
Is OpenCompass · MMLU-Pro saturated?
No · the top score is 87.6 out of 100 (88%). There is still meaningful room for improvement on OpenCompass · MMLU-Pro.
Does OpenCompass · MMLU-Pro predict performance on other benchmarks?
Yes · OpenCompass · MMLU-Pro scores correlate 0.98 with GPQA diamond across 10 shared models. Models that do well on OpenCompass · MMLU-Pro tend to do well on GPQA diamond.
How often is OpenCompass · MMLU-Pro data refreshed?
BenchGecko pulls updates daily. New model scores on OpenCompass · MMLU-Pro appear as soon as they are published by Epoch AI or the model provider.
Top on OpenCompass · MMLU-Pro
Qwen3.5 397B A17B · 87.6Kimi K2.5 · 86.2DeepSeek V3.2 · 85.8Gemini 2.5 Pro · 85.8DeepSeek V3.2 Speciale · 85.5More knowledge benchmarks
Same category · related evaluations