MMLU
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
67 models tested · sorted by score
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with MMLU
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About MMLU
What does MMLU measure?
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge. 67 AI models have been tested on it. Scores range from 1.1 to 82.9 out of 100.
Which model leads on MMLU?
DeepSeek V3 from DeepSeek leads MMLU with a score of 82.9. The median score across 67 tested models is 64.5.
Is MMLU saturated?
No · the top score is 82.9 out of 100 (83%). There is still meaningful room for improvement on MMLU.
Does MMLU predict performance on other benchmarks?
Yes · MMLU scores correlate 0.99 with ScienceQA across 5 shared models. Models that do well on MMLU tend to do well on ScienceQA.
How often is MMLU data refreshed?
BenchGecko pulls updates daily. New model scores on MMLU appear as soon as they are published by Epoch AI or the model provider.
- Category
- Knowledge
- Max score
- 100
- Models
- 67
- Updated
- 2025-04-15
Top on MMLU
DeepSeek V3 · 82.9Claude 3.5 Sonnet · 82.0GPT-4 (older v0314) · 81.9Llama 3.3 70B Instruct (free) · 81.7Qwen2.5 72B Instruct · 80.4More knowledge benchmarks
Same category · related evaluations