LiveBench · Reasoning
The Frontier
Best score over time · one chart, every benchmark
Distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with LiveBench · Reasoning
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Full rankings
29 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 84.6 | |
| 2 | 82.0 | |
| 3 | 77.7 | |
| 4 | 75.8 | |
| 5 | 74.8 | |
| 6 | 72.5 | |
| 7 | 69.7 | |
| 8 | 69.1 | |
| 9 | 64.7 | |
| 10 | 63.5 | |
| 11 | 62.1 | |
| 12 | 59.7 | |
| 13 | 59.4 | |
| 14 | 59.4 | |
| 15 | 59.3 | |
| 16 | 58.6 | |
| 17 | 58.4 | |
| 18 | 58.2 | |
| 19 | 56.1 | |
| 20 | 54.8 | |
| 21 | 45.5 | |
| 22 | 44.3 | |
| 23 | 39.2 | |
| 24 | 37.2 | |
| 25 | 35.5 | |
| 26 | 34.4 | |
| 27 | 27.7 | |
| 28 | 21.9 | |
| 29 | 17.4 |
Frequently asked
Pulled from the LiveBench · Reasoning dataset · updated daily
What does LiveBench · Reasoning measure?
LiveBench · Reasoning is a knowledge benchmark in the BenchGecko catalog. 29 AI models have been tested on it. Scores range from 17.4 to 84.6 out of 100.
Which model leads on LiveBench · Reasoning?
GPT-5.1-Codex-Max from OpenAI leads LiveBench · Reasoning with a score of 84.6. The median score across 29 tested models is 59.3.
Is LiveBench · Reasoning saturated?
No · the top score is 84.6 out of 100 (85%). There is still meaningful room for improvement on LiveBench · Reasoning.
Does LiveBench · Reasoning predict performance on other benchmarks?
Yes · LiveBench · Reasoning scores correlate 0.92 with LiveBench · Overall across 29 shared models. Models that do well on LiveBench · Reasoning tend to do well on LiveBench · Overall.
How often is LiveBench · Reasoning data refreshed?
BenchGecko pulls updates daily. New model scores on LiveBench · Reasoning appear as soon as they are published by Epoch AI or the model provider.
Top on LiveBench · Reasoning
GPT-5.1-Codex-Max · 84.6GPT-5.1-Codex · 82.0GPT-5.2-Codex · 77.7Qwen3.6 Plus · 75.8MiniMax M2.7 · 74.8More knowledge benchmarks
Same category · related evaluations