Chess Puzzles
Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
24 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 58.6 | |
| 2 | 55.0 | |
| 3 | 49.0 | |
| 4 | 44.0 | |
| 5 | 38.0 | |
| 6 | 37.0 | |
| 7 | 32.0 | |
| 8 | 31.0 | |
| 9 | 28.0 | |
| 10 | 26.0 | |
| 11 | 20.0 | |
| 12 | 20.0 | |
| 13 | 20.0 | |
| 14 | 17.0 | |
| 15 | 17.0 | |
| 16 | 14.0 | |
| 17 | 13.0 | |
| 18 | 12.0 | |
| 19 | 12.0 | |
| 20 | 12.0 | |
| 21 | 12.0 | |
| 22 | 10.0 | |
| 23 | 6.0 | |
| 24 | 4.0 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with Chess Puzzles
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About Chess Puzzles
What does Chess Puzzles measure?
Chess Puzzles · tests strategic and tactical reasoning by having models solve chess puzzle positions, evaluating lookahead and pattern recognition abilities. 24 AI models have been tested on it. Scores range from 4.0 to 58.6 out of 100.
Which model leads on Chess Puzzles?
GPT-5.4 Pro from OpenAI leads Chess Puzzles with a score of 58.6. The median score across 24 tested models is 20.0.
Is Chess Puzzles saturated?
No · the top score is 58.6 out of 100 (59%). There is still meaningful room for improvement on Chess Puzzles.
Does Chess Puzzles predict performance on other benchmarks?
Yes · Chess Puzzles scores correlate 0.84 with VPCT across 9 shared models. Models that do well on Chess Puzzles tend to do well on VPCT.
How often is Chess Puzzles data refreshed?
BenchGecko pulls updates daily. New model scores on Chess Puzzles appear as soon as they are published by Epoch AI or the model provider.
- Category
- Reasoning
- Max score
- 100
- Models
- 24
- Updated
- 2026-03-05
Top on Chess Puzzles
GPT-5.4 Pro · 58.6Gemini 3.1 Pro Preview · 55.0GPT-5.2 · 49.0GPT-5.4 · 44.0Gemini 3 Flash Preview · 38.0More reasoning benchmarks
Same category · related evaluations