LIVETracking 971 AI models from 268 providers.

Models971·Providers268·Benchmarks128·Companies71·Agents165·TopQwen3 VL 235B A22B Instruct · 1415.8%·Updatedjust now·Data Points2,902·MCP Servers4,923

Benchmark · Knowledge

CMMLU

Updated 2024-09-19

Models tested

8

Top score

89.7

Qwen2-72B

Median

61.5

min 36.9

Top-5 spread

σ 12.0

wide open

The Frontier

Best score over time · one chart, every benchmark

Chart type

Only 3 models have been tested on CMMLU · not enough history to compute a frontier yet.

Pink dots = frontier records · 1 totalClick to open model page

Distribution

Where models cluster

Correlated benchmarks

Pearson r · original research

Correlation analysis

Benchmarks that track with CMMLU

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

MMLU-PROKnowledge

BBH (HuggingFace)Knowledge

MATH Level 5Knowledge

Full rankings

8 models tested · sorted by score

#	Model	Score	Price	Bar
1	Qwen2-72B· Alibaba Qwen	89.7	—
2	Qwen2.5 72B Instruct· Alibaba Qwen	85.7	$0.12
3	GPT-4 Turbo· OpenAI	71.0	$10.00
4	Llama 3.1 70B Instruct· Meta	64.4	$0.40
5	Qwen-14B· Alibaba Qwen	58.7	—
6	Falcon-180B· TII	41.5	—
7	LLaMA-13B· Meta	39.8	—
8	Llama 3 70B Instruct· Meta	36.9	$0.51

Frequently asked

Pulled from the CMMLU dataset · updated daily

What does CMMLU measure?

CMMLU is a knowledge benchmark in the BenchGecko catalog. 8 AI models have been tested on it. Scores range from 36.9 to 89.7 out of 100.

Which model leads on CMMLU?

Qwen2-72B from Alibaba Qwen leads CMMLU with a score of 89.7. The median score across 8 tested models is 61.5.

Is CMMLU saturated?

No · the top score is 89.7 out of 100 (90%). There is still meaningful room for improvement on CMMLU.

Does CMMLU predict performance on other benchmarks?

Yes · CMMLU scores correlate 0.98 with BBH across 5 shared models. Models that do well on CMMLU tend to do well on BBH.

How often is CMMLU data refreshed?

BenchGecko pulls updates daily. New model scores on CMMLU appear as soon as they are published by Epoch AI or the model provider.

Top on CMMLU

Qwen2-72B · 89.7 Qwen2.5 72B Instruct · 85.7 GPT-4 Turbo · 71.0 Llama 3.1 70B Instruct · 64.4 Qwen-14B · 58.7

Related topics

Knowledge category All benchmarks Model leaderboard Methodology

Compare models

Qwen2-72B vs Qwen2.5 72B Instruct Qwen2.5 72B Instruct vs GPT-4 Turbo GPT-4 Turbo vs Llama 3.1 70B Instruct Llama 3.1 70B Instruct vs Qwen-14B

More knowledge benchmarks

Same category · related evaluations

Chatbot Arena Elo · Overall

BBH (HuggingFace)

Artificial Analysis · Quality Index