Beta
Benchmark · Knowledge

MMLU-PRO

Updated 2025-07-24
Models tested
73
Top score
52.6
Qwen2-72B
Median
22.8
min 0.3
Top-5 spread
σ 1.4
settled

Best score over time · one chart, every benchmark

MMLU-PRO48 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Apr 24Aug 24Dec 24Mar 25Jul 25RELEASE DATE →benchgecko.ai/benchmark/hf-mmlu-pro · frontier
Frontier on MMLU-PRO rose from 41.2 to 51.9 in 5 months · +10.6 points · latest leader Qwen2.5 32B Instruct from Alibaba.
Pink dots = frontier records · 3 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION200–101410–201020–301630–40940–50450–6060–7070–8080–9090–100MEDIAN · 22.8SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with MMLU-PRO

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

73 models tested · sorted by score

#ModelScore
1Alibaba Qwen logoQwen2-72B52.6
2Alibaba logoQwen2.5 32B Instruct51.9
3Alibaba Qwen logoQwen2.5 72B Instruct51.4
4
HA
Qwen2.5 72B Instruct Abliterated
50.4
5Microsoft logoPhi 448.6
6Meta logoLlama 3.3 70B Instruct48.1
7Meta logoLlama 3.1 70B Instruct47.9
8Alibaba logoQwen2.5 14B Instruct43.2
9DeepSeek logoR1 Distill Llama 70B41.6
10nousresearch logoHermes 3 70B Instruct41.4
11Meta logoMeta Llama 3 8B41.2
12DeepSeek logoR1 Distill Qwen 32B41.0
13DeepSeek logoDeepSeek R1 Distill Qwen 14B40.7
14Microsoft logoWizardLM-2 8x22B40.0
15
D
Dolphin 2.9.1 Yi 1.5 34b
39.1
16Google DeepMind logoGemma 2 27B38.4
17Alibaba Qwen logoQwen2.5 Coder 32B Instruct37.9
18
U
Yi 6B
37.9
19Alibaba Qwen logoQwen2.5 7B Instruct36.5
20z-ai logoGLM 4 32B 34.9
21Alibaba logoQwen2 VL 7B Instruct34.4
22Microsoft logoPhi 3 Mini 4k Instruct33.6
23Microsoft logoPhi 3.5 Mini Instruct32.9
24Alibaba logoQwen2.5 Coder 14B Instruct32.7
25Microsoft logoPhi 4 Mini Instruct32.6
26Google DeepMind logoGemma 2 9B31.9
27Alibaba logoQwen2 7B Instruct31.6
28anthracite-org logoMagnum v4 72B31.4
29Meta logoLlama 3.1 8B Instruct30.9
30Meta logoMeta Llama 3 8B Instruct29.6
31Alibaba Qwen logoQwen2.5 Coder 7B Instruct26.1
32
U
Stable Beluga 2
25.9
33Meta logoLlama 3.1 405B25.7
34Alibaba logoQwen2.5 3B Instruct25.1
35Meta logoLlama 3.2 3B Instruct24.4
36Meta logoLLaMA-13B23.1
37nousresearch logoHermes 2 Pro - Llama-3 8B22.8
38Mistral AI logoMistral 7B V0.122.4
39Google DeepMind logoGemma 2B21.6
40Alibaba logoQwen2.5 1.5B Instruct20.0
41Mistral AI logoMistral 7B Instruct V0.219.1
42Microsoft logoPhi 218.1
43Meta logoLlama 3 8B Instruct17.8
44Google DeepMind logoGemma 2 2b It17.2
45Alibaba logoQwen2 1.5B Instruct16.7
46Meta logoLlama 3.2 3B Instruct (free)16.5
47Mistral AI logoMistral 7B Instruct v0.115.7
48TII logoFalcon-180B15.4
49
U
StarCoder 2 15B
15.0
50DeepSeek logoDeepSeek R1 Distill Qwen 7B14.7
51Google DeepMind logoGemma 2 2b13.5
52
L
Vicuna 7b V1.5
12.7
53DeepSeek logoDeepSeek R1 Distill Llama 8B12.1
54Meta logoLlama 2 7b Hf9.6
55Meta logoLlama 3.2 1B Instruct8.2
56Alibaba logoQwen2 0.5B8.0
57Alibaba logoQwen2.5 0.5B Instruct8.0
58Microsoft logoPhi-1.57.7
59Meta logoLlama 2 7b Chat Hf7.6
60Alibaba logoQwen2 0.5B Instruct5.9
61
U
MPT-30B
2.3
62Alibaba Qwen logoQwQ 32B2.2
63DeepSeek logoDeepSeek R1 Distill Qwen 1.5B2.1
64
D
Distilgpt2
2.1
65OpenAI logoGpt2 Medium2.0
66OpenAI logoGpt21.8
67OpenAI logoGpt2 Large1.6
68
U
INTELLECT-1
1.3
69eleutherai logoPythia 160m1.3
70
HF
SmolLM2 135M Instruct
1.3
71
T
TinyLlama 1.1B Chat V1.0
1.1
72
HF
SmolLM2 135M
1.1
73eleutherai logoGpt Neo 125m0.3

Pulled from the MMLU-PRO dataset · updated daily

What does MMLU-PRO measure?

MMLU-PRO is a knowledge benchmark in the BenchGecko catalog. 73 AI models have been tested on it. Scores range from 0.3 to 52.6 out of 100.

Which model leads on MMLU-PRO?

Qwen2-72B from Alibaba Qwen leads MMLU-PRO with a score of 52.6. The median score across 73 tested models is 22.8.

Is MMLU-PRO saturated?

No · the top score is 52.6 out of 100 (53%). There is still meaningful room for improvement on MMLU-PRO.

Does MMLU-PRO predict performance on other benchmarks?

Yes · MMLU-PRO scores correlate 0.94 with BBH (HuggingFace) across 73 shared models. Models that do well on MMLU-PRO tend to do well on BBH (HuggingFace).

How often is MMLU-PRO data refreshed?

BenchGecko pulls updates daily. New model scores on MMLU-PRO appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations