Does MMLU Pro replace MMLU?

Effectively yes. MMLU remains historically cited but saturation makes it a weak differentiator for frontier models.

TIGER-Lab (University of Waterloo) in collaboration with researchers at the Allen Institute and others, released June 2024.

BenchmarksReading · ~3 min · 80 words deep

MMLU Pro

A harder, more discriminating version of MMLU with 10 answer choices instead of 4 and reasoning-heavy questions.

MMLU Pro leaderboard

TL;DR

A harder, more discriminating version of MMLU with 10 answer choices instead of 4 and reasoning-heavy questions.

Level 1

Basic

MMLU Pro was released in 2024 to address the saturation of the original MMLU benchmark. Frontier models had pushed MMLU scores above 90%, compressing the range where models could differentiate. MMLU Pro uses 12,000+ questions across 14 subjects, with 10 answer choices per question instead of 4 and a heavier emphasis on multi-step reasoning. Top of the board sits around 78-82% in 2026.

Level 2

Deep

MMLU Pro filters trivial and ambiguous questions from the original MMLU and adds new reasoning-heavy items. Each question has 10 options to reduce lucky-guess inflation from 25% to 10% baseline. Subject coverage: Math, Physics, Chemistry, Biology, Computer Science, Engineering, Economics, Business, Law, Psychology, Philosophy, History, Health, and Other. The "Math" subset is the hardest · top models score below 70%. Pro reports significantly larger gaps between frontier and mid-tier models than plain MMLU, making it the better discriminator in 2026.

Level 3

Expert

Construction methodology: Start with the full MMLU corpus. Filter using multiple expert annotators for quality (remove ambiguous, trivial, or label-noise items). Regenerate answer choices with distractor generation methods that use semantic similarity + factual plausibility. Validate the 10-choice format preserves separability. Hard subset: questions where majority of frontier models answered incorrectly or diverged · concentrates discriminating power. Frontier correlation: MMLU Pro scores correlate 0.85+ with Chatbot Arena rankings, better than MMLU.

Why this matters now

MMLU Pro has quietly replaced plain MMLU in most new-model launch charts · frontier models now cite MMLU Pro and MMLU-Pro Math alongside GPQA Diamond.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·14 subjects, 12K+ questions, 10 answer choices
·Filtered for label noise · higher quality than MMLU
·Math subset is the hardest · discriminates top 5 models

If you are a

Builder

·Use MMLU Pro scores instead of MMLU when choosing a general-purpose model
·Weight the subset relevant to your domain
·Don't use either benchmark for agent workload prediction · that's SWE-bench territory

If you are a

Investor

·MMLU Pro still differentiates frontier labs · MMLU has saturated
·Math subset shows biggest gap between frontier and mid-tier
·Correlates well with Arena · good proxy for consumer preference

If you are a

Curious · Normie

·A harder general-knowledge test than the old one
·10 answer choices instead of 4 means less guessing
·The big-brain test that replaced MMLU

Don't mix them up

Often confused with

MMLU ProvsMMLU

MMLU has 4 answer choices and is saturated at 92%+. MMLU Pro has 10 choices, reasoning-heavy, tops out around 82%. Pro is the active benchmark.

Gecko's take

If a 2026 model announcement cites plain MMLU instead of MMLU Pro, someone's hiding something.

Frequently Asked Questions

Mathematics. Top models score 65-70% on the math subset vs 78-82% overall.

Canonical sources

Read the primary sources

MMLU Pro paper (2024)arxiv.org

MMLU Pro

Basic

Deep

Expert

Depending on why you're here

Often confused with

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed