CMMLU
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with CMMLU
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
8 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 89.7 | |
| 2 | 85.7 | |
| 3 | 71.0 | |
| 4 | 64.4 | |
| 5 | 58.7 | |
| 6 | 41.5 | |
| 7 | 39.8 | |
| 8 | 36.9 |
Frequently asked
Pulled from the CMMLU dataset · updated daily
What does CMMLU measure?
CMMLU is a knowledge benchmark in the BenchGecko catalog. 8 AI models have been tested on it. Scores range from 36.9 to 89.7 out of 100.
Which model leads on CMMLU?
Qwen2-72B from Alibaba Qwen leads CMMLU with a score of 89.7. The median score across 8 tested models is 61.5.
Is CMMLU saturated?
No · the top score is 89.7 out of 100 (90%). There is still meaningful room for improvement on CMMLU.
Does CMMLU predict performance on other benchmarks?
Yes · CMMLU scores correlate 0.98 with BBH across 5 shared models. Models that do well on CMMLU tend to do well on BBH.
How often is CMMLU data refreshed?
BenchGecko pulls updates daily. New model scores on CMMLU appear as soon as they are published by Epoch AI or the model provider.
More knowledge benchmarks
Same category · related evaluations