Beta
Benchmark · Knowledge

CMMLU

Updated 2024-09-19
Models tested
8
Top score
89.7
Qwen2-72B
Median
61.5
min 36.9
Top-5 spread
σ 12.0
wide open

Best score over time · one chart, every benchmark

CMMLU3 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Apr 24May 24Jul 24Aug 24Sep 24RELEASE DATE →benchgecko.ai/benchmark/cmmlu · frontier
Only 3 models have been tested on CMMLU · not enough history to compute a frontier yet.
Pink dots = frontier records · 1 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION0–1010–2020–30230–40140–50150–60160–70170–80280–9090–100MEDIAN · 61.5SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with CMMLU

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

8 models tested · sorted by score

Pulled from the CMMLU dataset · updated daily

What does CMMLU measure?

CMMLU is a knowledge benchmark in the BenchGecko catalog. 8 AI models have been tested on it. Scores range from 36.9 to 89.7 out of 100.

Which model leads on CMMLU?

Qwen2-72B from Alibaba Qwen leads CMMLU with a score of 89.7. The median score across 8 tested models is 61.5.

Is CMMLU saturated?

No · the top score is 89.7 out of 100 (90%). There is still meaningful room for improvement on CMMLU.

Does CMMLU predict performance on other benchmarks?

Yes · CMMLU scores correlate 0.98 with BBH across 5 shared models. Models that do well on CMMLU tend to do well on BBH.

How often is CMMLU data refreshed?

BenchGecko pulls updates daily. New model scores on CMMLU appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations