Beta
BenchmarksReading · ~3 min · 80 words deep

MMLU Pro

A harder, more discriminating version of MMLU with 10 answer choices instead of 4 and reasoning-heavy questions.

MMLU Pro leaderboard
TL;DR

A harder, more discriminating version of MMLU with 10 answer choices instead of 4 and reasoning-heavy questions.

Level 1

MMLU Pro was released in 2024 to address the saturation of the original MMLU benchmark. Frontier models had pushed MMLU scores above 90%, compressing the range where models could differentiate. MMLU Pro uses 12,000+ questions across 14 subjects, with 10 answer choices per question instead of 4 and a heavier emphasis on multi-step reasoning. Top of the board sits around 78-82% in 2026.

Level 2

MMLU Pro filters trivial and ambiguous questions from the original MMLU and adds new reasoning-heavy items. Each question has 10 options to reduce lucky-guess inflation from 25% to 10% baseline. Subject coverage: Math, Physics, Chemistry, Biology, Computer Science, Engineering, Economics, Business, Law, Psychology, Philosophy, History, Health, and Other. The "Math" subset is the hardest · top models score below 70%. Pro reports significantly larger gaps between frontier and mid-tier models than plain MMLU, making it the better discriminator in 2026.

Level 3

Construction methodology: Start with the full MMLU corpus. Filter using multiple expert annotators for quality (remove ambiguous, trivial, or label-noise items). Regenerate answer choices with distractor generation methods that use semantic similarity + factual plausibility. Validate the 10-choice format preserves separability. Hard subset: questions where majority of frontier models answered incorrectly or diverged · concentrates discriminating power. Frontier correlation: MMLU Pro scores correlate 0.85+ with Chatbot Arena rankings, better than MMLU.

Why this matters now

MMLU Pro has quietly replaced plain MMLU in most new-model launch charts · frontier models now cite MMLU Pro and MMLU-Pro Math alongside GPQA Diamond.

The takeaway for you
If you are a
Researcher
  • ·14 subjects, 12K+ questions, 10 answer choices
  • ·Filtered for label noise · higher quality than MMLU
  • ·Math subset is the hardest · discriminates top 5 models
If you are a
Builder
  • ·Use MMLU Pro scores instead of MMLU when choosing a general-purpose model
  • ·Weight the subset relevant to your domain
  • ·Don't use either benchmark for agent workload prediction · that's SWE-bench territory
If you are a
Investor
  • ·MMLU Pro still differentiates frontier labs · MMLU has saturated
  • ·Math subset shows biggest gap between frontier and mid-tier
  • ·Correlates well with Arena · good proxy for consumer preference
If you are a
Curious · Normie
  • ·A harder general-knowledge test than the old one
  • ·10 answer choices instead of 4 means less guessing
  • ·The big-brain test that replaced MMLU
Don't mix them up
MMLU ProvsMMLU

MMLU has 4 answer choices and is saturated at 92%+. MMLU Pro has 10 choices, reasoning-heavy, tops out around 82%. Pro is the active benchmark.

Gecko's take

If a 2026 model announcement cites plain MMLU instead of MMLU Pro, someone's hiding something.

Mathematics. Top models score 65-70% on the math subset vs 78-82% overall.
Canonical sources