MMLU Pro
A harder, more discriminating version of MMLU with 10 answer choices instead of 4 and reasoning-heavy questions.
A harder, more discriminating version of MMLU with 10 answer choices instead of 4 and reasoning-heavy questions.
Basic
MMLU Pro was released in 2024 to address the saturation of the original MMLU benchmark. Frontier models had pushed MMLU scores above 90%, compressing the range where models could differentiate. MMLU Pro uses 12,000+ questions across 14 subjects, with 10 answer choices per question instead of 4 and a heavier emphasis on multi-step reasoning. Top of the board sits around 78-82% in 2026.
Deep
MMLU Pro filters trivial and ambiguous questions from the original MMLU and adds new reasoning-heavy items. Each question has 10 options to reduce lucky-guess inflation from 25% to 10% baseline. Subject coverage: Math, Physics, Chemistry, Biology, Computer Science, Engineering, Economics, Business, Law, Psychology, Philosophy, History, Health, and Other. The "Math" subset is the hardest · top models score below 70%. Pro reports significantly larger gaps between frontier and mid-tier models than plain MMLU, making it the better discriminator in 2026.
Expert
Construction methodology: Start with the full MMLU corpus. Filter using multiple expert annotators for quality (remove ambiguous, trivial, or label-noise items). Regenerate answer choices with distractor generation methods that use semantic similarity + factual plausibility. Validate the 10-choice format preserves separability. Hard subset: questions where majority of frontier models answered incorrectly or diverged · concentrates discriminating power. Frontier correlation: MMLU Pro scores correlate 0.85+ with Chatbot Arena rankings, better than MMLU.
MMLU Pro has quietly replaced plain MMLU in most new-model launch charts · frontier models now cite MMLU Pro and MMLU-Pro Math alongside GPQA Diamond.
Depending on why you're here
- ·14 subjects, 12K+ questions, 10 answer choices
- ·Filtered for label noise · higher quality than MMLU
- ·Math subset is the hardest · discriminates top 5 models
- ·Use MMLU Pro scores instead of MMLU when choosing a general-purpose model
- ·Weight the subset relevant to your domain
- ·Don't use either benchmark for agent workload prediction · that's SWE-bench territory
- ·MMLU Pro still differentiates frontier labs · MMLU has saturated
- ·Math subset shows biggest gap between frontier and mid-tier
- ·Correlates well with Arena · good proxy for consumer preference
- ·A harder general-knowledge test than the old one
- ·10 answer choices instead of 4 means less guessing
- ·The big-brain test that replaced MMLU
Often confused with
MMLU has 4 answer choices and is saturated at 92%+. MMLU Pro has 10 choices, reasoning-heavy, tops out around 82%. Pro is the active benchmark.
If a 2026 model announcement cites plain MMLU instead of MMLU Pro, someone's hiding something.
Frequently Asked Questions
Read the primary sources
- MMLU Pro paper (2024)arxiv.org