Beta
Benchmark · Knowledge

MATH Level 5

Updated 2025-04-15
Models tested
70
Top score
62.5
Qwen2.5 32B Instruct
Median
12.9
min 0.1
Top-5 spread
σ 2.5
competitive

Best score over time · one chart, every benchmark

MATH LEVEL 546 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Apr 24Jul 24Oct 24Jan 25Apr 25RELEASE DATE →benchgecko.ai/benchmark/hf-math-lvl5 · frontier
Frontier on MATH Level 5 rose from 27.6 to 62.5 in 4 months · +34.9 points · latest leader Qwen2.5 32B Instruct from Alibaba.
Pink dots = frontier records · 4 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION340–101410–20720–30630–40240–50550–60260–7070–8080–9090–100MEDIAN · 12.9SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with MATH Level 5

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

70 models tested · sorted by score

#ModelScore
1Alibaba logoQwen2.5 32B Instruct62.5
2
HA
Qwen2.5 72B Instruct Abliterated
60.1
3Alibaba Qwen logoQwen2.5 72B Instruct59.8
4DeepSeek logoDeepSeek R1 Distill Qwen 14B57.0
5Alibaba logoQwen2.5 14B Instruct55.3
6Microsoft logoPhi 450.0
7Alibaba Qwen logoQwen2.5 7B Instruct50.0
8Alibaba Qwen logoQwen2.5 Coder 32B Instruct49.5
9Meta logoLlama 3.3 70B Instruct48.3
10Meta logoLlama 3.1 70B Instruct38.1
11Alibaba Qwen logoQwen2.5 Coder 7B Instruct37.2
12Alibaba logoQwen2.5 3B Instruct36.8
13Alibaba logoQwen2.5 Coder 14B Instruct32.5
14Alibaba Qwen logoQwen2-72B31.1
15DeepSeek logoR1 Distill Llama 70B30.7
16Alibaba logoQwen2 7B Instruct27.6
17Microsoft logoWizardLM-2 8x22B25.0
18Google DeepMind logoGemma 2 27B23.9
19Alibaba logoQwen2.5 1.5B Instruct22.1
20DeepSeek logoDeepSeek R1 Distill Llama 8B22.0
21nousresearch logoHermes 3 70B Instruct21.0
22anthracite-org logoMagnum v4 72B20.0
23Alibaba logoQwen2 VL 7B Instruct19.9
24Microsoft logoPhi 3.5 Mini Instruct19.6
25DeepSeek logoDeepSeek R1 Distill Qwen 7B19.6
26Google DeepMind logoGemma 2 9B19.5
27
D
Dolphin 2.9.1 Yi 1.5 34b
18.7
28Meta logoMeta Llama 3 8B18.6
29Meta logoLlama 3.2 3B Instruct17.7
30DeepSeek logoR1 Distill Qwen 32B17.1
31Microsoft logoPhi 4 Mini Instruct17.0
32DeepSeek logoDeepSeek R1 Distill Qwen 1.5B16.9
33Microsoft logoPhi 3 Mini 4k Instruct16.4
34Alibaba Qwen logoQwQ 32B16.1
35Meta logoLlama 3.1 8B Instruct15.5
36Alibaba logoQwen2.5 0.5B Instruct10.3
37Meta logoMeta Llama 3 8B Instruct8.7
38nousresearch logoHermes 2 Pro - Llama-3 8B8.4
39Meta logoLlama 3.2 1B Instruct8.2
40Google DeepMind logoGemma 2B7.4
41Alibaba logoQwen2 1.5B Instruct7.2
42
U
StarCoder 2 15B
6.0
43
U
Yi 6B
5.1
44
U
Stable Beluga 2
4.4
45Meta logoLlama 3 8B Instruct3.9
46Meta logoLLaMA-13B3.1
47Google DeepMind logoGemma 2 2b3.0
48Mistral AI logoMistral 7B Instruct V0.23.0
49Mistral AI logoMistral 7B V0.13.0
50Microsoft logoPhi 23.0
51Alibaba logoQwen2 0.5B Instruct2.9
52TII logoFalcon-180B2.8
53Alibaba logoQwen2 0.5B2.6
54Mistral AI logoMistral 7B Instruct v0.12.3
55Meta logoLlama 2 7b Chat Hf2.0
56Meta logoLlama 3.2 3B Instruct (free)1.9
57Microsoft logoPhi-1.51.8
58Meta logoLlama 2 7b Hf1.7
59
U
MPT-30B
1.6
60
T
TinyLlama 1.1B Chat V1.0
1.5
61
L
Vicuna 7b V1.5
1.4
62OpenAI logoGpt2 Large1.2
63
HF
SmolLM2 135M
1.2
64eleutherai logoPythia 160m0.9
65OpenAI logoGpt2 Medium0.8
66
D
Distilgpt2
0.6
67eleutherai logoGpt Neo 125m0.6
68
HF
SmolLM2 135M Instruct
0.3
69OpenAI logoGpt20.2
70Google DeepMind logoGemma 2 2b It0.1

Pulled from the MATH Level 5 dataset · updated daily

What does MATH Level 5 measure?

MATH Level 5 is a knowledge benchmark in the BenchGecko catalog. 70 AI models have been tested on it. Scores range from 0.1 to 62.5 out of 100.

Which model leads on MATH Level 5?

Qwen2.5 32B Instruct from Alibaba leads MATH Level 5 with a score of 62.5. The median score across 70 tested models is 12.9.

Is MATH Level 5 saturated?

No · the top score is 62.5 out of 100 (63%). There is still meaningful room for improvement on MATH Level 5.

Does MATH Level 5 predict performance on other benchmarks?

Yes · MATH Level 5 scores correlate 0.98 with MATH level 5 across 9 shared models. Models that do well on MATH Level 5 tend to do well on MATH level 5.

How often is MATH Level 5 data refreshed?

BenchGecko pulls updates daily. New model scores on MATH Level 5 appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations