JMMLU
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with JMMLU
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
11 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 63.4 | |
| 2 | 56.5 | |
| 3 | 56.3 | |
| 4 | 46.7 | |
| 5 | 44.7 | |
| 6 | 42.3 | |
| 7 | 38.4 | |
| 8 | 37.8 | |
| 9 | 33.3 | |
| 10 | 28.6 | |
| 11 | HF SmolLM2 135M Instruct | 24.2 |
Frequently asked
Pulled from the JMMLU dataset · updated daily
What does JMMLU measure?
JMMLU is a knowledge benchmark in the BenchGecko catalog. 11 AI models have been tested on it. Scores range from 24.2 to 63.4 out of 100.
Which model leads on JMMLU?
DeepSeek R1 Distill Qwen 14B from DeepSeek leads JMMLU with a score of 63.4. The median score across 11 tested models is 42.3.
Is JMMLU saturated?
No · the top score is 63.4 out of 100 (63%). There is still meaningful room for improvement on JMMLU.
Does JMMLU predict performance on other benchmarks?
Yes · JMMLU scores correlate 0.89 with LLM-JP · Overall across 11 shared models. Models that do well on JMMLU tend to do well on LLM-JP · Overall.
How often is JMMLU data refreshed?
BenchGecko pulls updates daily. New model scores on JMMLU appear as soon as they are published by Epoch AI or the model provider.
More knowledge benchmarks
Same category · related evaluations