Which model leads on MMLU?

DeepSeek V3 from DeepSeek leads MMLU with a score of 82.9. The median score across 67 tested models is 64.5.

No · the top score is 82.9 out of 100 (83%). There is still meaningful room for improvement on MMLU.

Does MMLU predict performance on other benchmarks?

Yes · MMLU scores correlate 0.99 with ScienceQA across 5 shared models. Models that do well on MMLU tend to do well on ScienceQA.

How often is MMLU data refreshed?

BenchGecko pulls updates daily. New model scores on MMLU appear as soon as they are published by Epoch AI or the model provider.

Benchmark · KnowledgeSettled

MMLU

Name: MMLU Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

Updated 2025-04-15

Models tested

Top score

82.9

DeepSeek V3

Median

64.5

min 1.1

Top-5 spread

σ 0.8

Settled

The Frontier

Best score over time · one chart, every benchmark

Chart type

Only 19 models have been tested on MMLU · not enough history to compute a frontier yet.

Pink dots = frontier records · 1 totalClick to open model page

Full rankings

67 models tested · sorted by score

#	Model	Score	Price
1	DeepSeek V3· DeepSeek	82.9	$0.32
2	Claude 3.5 Sonnet· Anthropic	82.0	—
3	GPT-4 (older v0314)· OpenAI	81.9	$30.00
4	Llama 3.3 70B Instruct (free)· Meta	81.7	$0.00
5	Qwen2.5 72B Instruct· Alibaba Qwen	80.4	$0.36
6	Phi 4· Microsoft	79.7	$0.07
7	Claude 3 Opus· Anthropic	79.5	—
8	Llama 3.1 405B· Meta	79.3	—
9	GPT-4o (2024-08-06)· OpenAI	79.1	$2.50
10	GPT-4o (2024-11-20)· OpenAI	79.1	$2.50
11	GPT-4o (2024-05-13)· OpenAI	78.9	$5.00
12	Gemini 1.5 Pro (Feb 2024)· Google DeepMind	76.9	—
13	GPT-4 Turbo· OpenAI	76.5	$10.00
14	Qwen2-72B· Alibaba Qwen	76.5	—
15	GPT-4o-mini· OpenAI	75.7	$0.15
16	GPT-4o-mini (2024-07-18)· OpenAI	75.7	$0.15
17	Llama 3.2 90B· Meta	73.7	—
18	Llama 3.1 70B Instruct· Meta	73.5	$0.40
19	Mistral Large 2407· Mistral AI	73.3	$2.00
20	Gemini 2.0 Flash· Google DeepMind	72.9	$0.10
21	Llama 3 70B Instruct· Meta	72.4	$0.51
22	Qwen2.5 Coder 32B Instruct· Alibaba Qwen	72.1	$0.66
23	Claude 2· Anthropic	71.3	—
24	DeepSeek-V2 (MoE-236B, May 2024)· DeepSeek	71.2	—
25	phi-3-medium 14B· Microsoft	70.7	—
26	Gemini 1.5 Flash (May 2024)· Google DeepMind	70.5	—
27	Mixtral 8x22B Instruct· Mistral AI	70.4	$2.00
28	Claude 3 Sonnet· Anthropic	67.9	—
29	Gemma 2 27B· Google DeepMind	67.6	$0.65
30	phi-3-small 7.4B· Microsoft	67.6	—
31	Claude 3.5 Haiku· Anthropic	65.7	$0.80
32	Claude 3 Haiku· Anthropic	65.1	$0.25
33	Claude 2.1· Anthropic	64.7	—
34	Claude Instant· Anthropic	64.5	—
35	Gemma 2 9B· Google DeepMind	62.8	$0.03
36	Falcon-180B· TII	60.8	—
37	Mixtral 8x7B Instruct· Mistral AI	60.8	$0.54
38	Gemini 1.0 Pro· Google DeepMind	60.0	—
39	Llama 3 8B Instruct· Meta	58.4	$0.03
40	Mistral Large· Mistral AI	58.4	$2.00
41	phi-3-mini 3.8B· Microsoft	58.4	—
42	U Stable Beluga 2· Unknown	58.1	—
43	Qwen2.5 Coder 7B Instruct· Alibaba Qwen	57.3	$0.03
44	GPT-3.5 Turbo (older v0613)· OpenAI	56.4	$1.00
45	Qwen-14B· Alibaba Qwen	55.1	—
46	U StarCoder 2 15B· Unknown	52.1	—
47	U Yi 6B· Unknown	52.0	—
48	Mistral 7B V0.1· Mistral AI	50.0	—
49	U Nemotron-4 15B· Unknown	44.9	—
50	Falcon 2 11B· TII	44.5	—
51	Phi 2· Microsoft	44.5	—
52	Llama 3.1 8B Instruct· Meta	41.5	$0.02
53	Llama 2-13B· Meta	40.8	—
54	U Baichuan 2-7B· Unknown	38.9	—
55	Qwen2.5 Coder 1.5B Instruct· Alibaba	38.1	—
56	U INTELLECT-1· Unknown	33.2	—
57	U MPT-30B· Unknown	30.5	—
58	LLaMA-13B· Meta	30.3	—
59	U Baichuan1-7B· Unknown	23.1	—
60	Gemma 2B· Google DeepMind	23.1	—
61	DeepSeek Coder 33B· DeepSeek	19.2	—
62	Phi-1.5· Microsoft	16.8	—
63	DeepSeek Coder 6.7B· DeepSeek	15.2	—
64	U XGen-7B· Unknown	15.1	—
65	Cerebras-GPT-13B· OpenAI	1.6	—
66	U Dolly 2.0-12b· Unknown	1.6	—
67	DeepSeek Coder 1.3B· DeepSeek	1.1	—