Beta
Benchmark · Knowledge

HELM · MMLU-Pro

Updated 2026-01-21
Models tested
34
Top score
90.3
Gemini 3 Pro
Median
78.0
min 53.7
Top-5 spread
σ 1.8
settled

Best score over time · one chart, every benchmark

HELM · MMLU-PRO30 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Dec 24Apr 25Sep 25Jan 26RELEASE DATE →benchgecko.ai/benchmark/helm-mmlu-pro · frontier
Frontier on HELM · MMLU-Pro rose from 60.3 to 86.3 in 11 months · +26.0 points · latest leader Gemini 2.5 Pro from Google DeepMind.
Pink dots = frontier records · 11 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION0–1010–2020–3030–4040–50450–60560–701570–80980–90190–100MEDIAN · 78.0SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

34 models tested · sorted by score

Pulled from the HELM · MMLU-Pro dataset · updated daily

What does HELM · MMLU-Pro measure?

HELM · MMLU-Pro is a knowledge benchmark in the BenchGecko catalog. 34 AI models have been tested on it. Scores range from 53.7 to 90.3 out of 100.

Which model leads on HELM · MMLU-Pro?

Gemini 3 Pro from Google DeepMind leads HELM · MMLU-Pro with a score of 90.3. The median score across 34 tested models is 78.0.

Is HELM · MMLU-Pro saturated?

No · the top score is 90.3 out of 100 (90%). There is still meaningful room for improvement on HELM · MMLU-Pro.

Does HELM · MMLU-Pro predict performance on other benchmarks?

Yes · HELM · MMLU-Pro scores correlate 0.94 with HELM · GPQA across 34 shared models. Models that do well on HELM · MMLU-Pro tend to do well on HELM · GPQA.

How often is HELM · MMLU-Pro data refreshed?

BenchGecko pulls updates daily. New model scores on HELM · MMLU-Pro appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations