Beta
Benchmark · Knowledge

JMMLU

Updated 2025-01-20
Models tested
11
Top score
63.4
DeepSeek R1 Distill Qwen 14B
Median
42.3
min 24.2
Top-5 spread
σ 6.9
wide open

Best score over time · one chart, every benchmark

JMMLU9 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Apr 24Jun 24Sep 24Nov 24Jan 25RELEASE DATE →benchgecko.ai/benchmark/jp-jmmlu · frontier
Frontier on JMMLU rose from 44.7 to 63.4 in 9 months · +18.7 points · latest leader DeepSeek R1 Distill Qwen 14B from DeepSeek.
Pink dots = frontier records · 4 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION0–1010–20220–30330–40340–50250–60160–7070–8080–9090–100MEDIAN · 42.3SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with JMMLU

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

11 models tested · sorted by score

Pulled from the JMMLU dataset · updated daily

What does JMMLU measure?

JMMLU is a knowledge benchmark in the BenchGecko catalog. 11 AI models have been tested on it. Scores range from 24.2 to 63.4 out of 100.

Which model leads on JMMLU?

DeepSeek R1 Distill Qwen 14B from DeepSeek leads JMMLU with a score of 63.4. The median score across 11 tested models is 42.3.

Is JMMLU saturated?

No · the top score is 63.4 out of 100 (63%). There is still meaningful room for improvement on JMMLU.

Does JMMLU predict performance on other benchmarks?

Yes · JMMLU scores correlate 0.89 with LLM-JP · Overall across 11 shared models. Models that do well on JMMLU tend to do well on LLM-JP · Overall.

How often is JMMLU data refreshed?

BenchGecko pulls updates daily. New model scores on JMMLU appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations