Beta
Benchmark · Knowledge

OpenCompass · MMLU-Pro

Updated 2026-02-16
Models tested
32
Top score
87.6
Qwen3.5 397B A17B
Median
81.8
min 63.0
Top-5 spread
σ 0.7
settled

Best score over time · one chart, every benchmark

OPENCOMPASS · MMLU-PRO32 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Mar 25Jun 25Aug 25Nov 25Feb 26RELEASE DATE →benchgecko.ai/benchmark/oc-mmlu-pro · frontier
Frontier on OpenCompass · MMLU-Pro rose from 67.8 to 87.6 in 11 months · +19.8 points · latest leader Qwen3.5 397B A17B from Alibaba Qwen.
Pink dots = frontier records · 7 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION0–1010–2020–3030–4040–5050–60260–701070–802080–9090–100MEDIAN · 81.8SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with OpenCompass · MMLU-Pro

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

32 models tested · sorted by score

Pulled from the OpenCompass · MMLU-Pro dataset · updated daily

What does OpenCompass · MMLU-Pro measure?

OpenCompass · MMLU-Pro is a knowledge benchmark in the BenchGecko catalog. 32 AI models have been tested on it. Scores range from 63.0 to 87.6 out of 100.

Which model leads on OpenCompass · MMLU-Pro?

Qwen3.5 397B A17B from Alibaba Qwen leads OpenCompass · MMLU-Pro with a score of 87.6. The median score across 32 tested models is 81.8.

Is OpenCompass · MMLU-Pro saturated?

No · the top score is 87.6 out of 100 (88%). There is still meaningful room for improvement on OpenCompass · MMLU-Pro.

Does OpenCompass · MMLU-Pro predict performance on other benchmarks?

Yes · OpenCompass · MMLU-Pro scores correlate 0.98 with GPQA diamond across 10 shared models. Models that do well on OpenCompass · MMLU-Pro tend to do well on GPQA diamond.

How often is OpenCompass · MMLU-Pro data refreshed?

BenchGecko pulls updates daily. New model scores on OpenCompass · MMLU-Pro appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations