Benchmark · KnowledgeSettled

MMLU

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

Updated 2025-04-15
Models tested
67
Top score
82.9
DeepSeek V3
Median
64.5
min 1.1
Top-5 spread
σ 0.8
Settled

Best score over time · one chart, every benchmark

MMLU19 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jun 24Sep 24Nov 24Feb 25Apr 25RELEASE DATE →benchgecko.ai/benchmark/mmlu · frontier
Only 19 models have been tested on MMLU · not enough history to compute a frontier yet.
Pink dots = frontier records · 1 totalClick to open model page

67 models tested · sorted by score

#ModelScore
1DeepSeek logoDeepSeek V382.9
2Anthropic logoClaude 3.5 Sonnet82.0
3OpenAI logoGPT-4 (older v0314)81.9
4Meta logoLlama 3.3 70B Instruct (free)81.7
5Alibaba Qwen logoQwen2.5 72B Instruct80.4
6Microsoft logoPhi 479.7
7Anthropic logoClaude 3 Opus79.5
8Meta logoLlama 3.1 405B79.3
9OpenAI logoGPT-4o (2024-08-06)79.1
10OpenAI logoGPT-4o (2024-11-20)79.1
11OpenAI logoGPT-4o (2024-05-13)78.9
12Google DeepMind logoGemini 1.5 Pro (Feb 2024)76.9
13OpenAI logoGPT-4 Turbo76.5
14Alibaba Qwen logoQwen2-72B76.5
15OpenAI logoGPT-4o-mini75.7
16OpenAI logoGPT-4o-mini (2024-07-18)75.7
17Meta logoLlama 3.2 90B73.7
18Meta logoLlama 3.1 70B Instruct73.5
19Mistral AI logoMistral Large 240773.3
20Google DeepMind logoGemini 2.0 Flash72.9
21Meta logoLlama 3 70B Instruct72.4
22Alibaba Qwen logoQwen2.5 Coder 32B Instruct72.1
23Anthropic logoClaude 271.3
24DeepSeek logoDeepSeek-V2 (MoE-236B, May 2024)71.2
25Microsoft logophi-3-medium 14B70.7
26Google DeepMind logoGemini 1.5 Flash (May 2024)70.5
27Mistral AI logoMixtral 8x22B Instruct70.4
28Anthropic logoClaude 3 Sonnet67.9
29Google DeepMind logoGemma 2 27B67.6
30Microsoft logophi-3-small 7.4B67.6
31Anthropic logoClaude 3.5 Haiku65.7
32Anthropic logoClaude 3 Haiku65.1
33Anthropic logoClaude 2.164.7
34Anthropic logoClaude Instant64.5
35Google DeepMind logoGemma 2 9B62.8
36TII logoFalcon-180B60.8
37Mistral AI logoMixtral 8x7B Instruct60.8
38Google DeepMind logoGemini 1.0 Pro60.0
39Meta logoLlama 3 8B Instruct58.4
40Mistral AI logoMistral Large58.4
41Microsoft logophi-3-mini 3.8B58.4
42
U
Stable Beluga 2
58.1
43Alibaba Qwen logoQwen2.5 Coder 7B Instruct57.3
44OpenAI logoGPT-3.5 Turbo (older v0613)56.4
45Alibaba Qwen logoQwen-14B55.1
46
U
StarCoder 2 15B
52.1
47
U
Yi 6B
52.0
48Mistral AI logoMistral 7B V0.150.0
49
U
Nemotron-4 15B
44.9
50TII logoFalcon 2 11B44.5
51Microsoft logoPhi 244.5
52Meta logoLlama 3.1 8B Instruct41.5
53Meta logoLlama 2-13B40.8
54
U
Baichuan 2-7B
38.9
55Alibaba logoQwen2.5 Coder 1.5B Instruct38.1
56
U
INTELLECT-1
33.2
57
U
MPT-30B
30.5
58Meta logoLLaMA-13B30.3
59
U
Baichuan1-7B
23.1
60Google DeepMind logoGemma 2B23.1
61DeepSeek logoDeepSeek Coder 33B19.2
62Microsoft logoPhi-1.516.8
63DeepSeek logoDeepSeek Coder 6.7B15.2
64
U
XGen-7B
15.1
65OpenAI logoCerebras-GPT-13B1.6
66
U
Dolly 2.0-12b
1.6
67DeepSeek logoDeepSeek Coder 1.3B1.1

Same category · related evaluations