LLM-JP · Overall
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with LLM-JP · Overall
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
11 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 56.8 | |
| 2 | 53.0 | |
| 3 | 51.7 | |
| 4 | 49.6 | |
| 5 | 48.9 | |
| 6 | 41.4 | |
| 7 | 40.5 | |
| 8 | 39.3 | |
| 9 | 38.3 | |
| 10 | 37.2 | |
| 11 | HF SmolLM2 135M Instruct | 15.6 |
Frequently asked
Pulled from the LLM-JP · Overall dataset · updated daily
What does LLM-JP · Overall measure?
LLM-JP · Overall is a knowledge benchmark in the BenchGecko catalog. 11 AI models have been tested on it. Scores range from 15.6 to 56.8 out of 100.
Which model leads on LLM-JP · Overall?
DeepSeek R1 Distill Qwen 14B from DeepSeek leads LLM-JP · Overall with a score of 56.8. The median score across 11 tested models is 41.4.
Is LLM-JP · Overall saturated?
No · the top score is 56.8 out of 100 (57%). There is still meaningful room for improvement on LLM-JP · Overall.
Does LLM-JP · Overall predict performance on other benchmarks?
Yes · LLM-JP · Overall scores correlate 0.90 with JCommonsenseQA across 11 shared models. Models that do well on LLM-JP · Overall tend to do well on JCommonsenseQA.
How often is LLM-JP · Overall data refreshed?
BenchGecko pulls updates daily. New model scores on LLM-JP · Overall appear as soon as they are published by Epoch AI or the model provider.
More knowledge benchmarks
Same category · related evaluations