Beta
Benchmark · Knowledge

LLM-JP · Overall

Updated 2025-01-20
Models tested
11
Top score
56.8
DeepSeek R1 Distill Qwen 14B
Median
41.4
min 15.6
Top-5 spread
σ 2.8
competitive

Best score over time · one chart, every benchmark

LLM-JP · OVERALL9 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Apr 24Jun 24Sep 24Nov 24Jan 25RELEASE DATE →benchgecko.ai/benchmark/jp-overall · frontier
Frontier on LLM-JP · Overall rose from 48.9 to 56.8 in 9 months · +8.0 points · latest leader DeepSeek R1 Distill Qwen 14B from DeepSeek.
Pink dots = frontier records · 5 totalClick to open model page

Where models cluster

SCORE DISTRIBUTION0–10110–2020–30330–40440–50350–6060–7070–8080–9090–100MEDIAN · 41.4SCORE BUCKET → (0 TO 100)MODELSbenchgecko.ai

Pearson r · original research

Correlation analysis

Benchmarks that track with LLM-JP · Overall

Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.

11 models tested · sorted by score

Pulled from the LLM-JP · Overall dataset · updated daily

What does LLM-JP · Overall measure?

LLM-JP · Overall is a knowledge benchmark in the BenchGecko catalog. 11 AI models have been tested on it. Scores range from 15.6 to 56.8 out of 100.

Which model leads on LLM-JP · Overall?

DeepSeek R1 Distill Qwen 14B from DeepSeek leads LLM-JP · Overall with a score of 56.8. The median score across 11 tested models is 41.4.

Is LLM-JP · Overall saturated?

No · the top score is 56.8 out of 100 (57%). There is still meaningful room for improvement on LLM-JP · Overall.

Does LLM-JP · Overall predict performance on other benchmarks?

Yes · LLM-JP · Overall scores correlate 0.90 with JCommonsenseQA across 11 shared models. Models that do well on LLM-JP · Overall tend to do well on JCommonsenseQA.

How often is LLM-JP · Overall data refreshed?

BenchGecko pulls updates daily. New model scores on LLM-JP · Overall appear as soon as they are published by Epoch AI or the model provider.

Same category · related evaluations