GSM8K
Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
32 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 92.0 | |
| 2 | 91.3 | |
| 3 | 91.3 | |
| 4 | 91.1 | |
| 5 | 90.0 | |
| 6 | 86.7 | |
| 7 | 86.7 | |
| 8 | 84.9 | |
| 9 | 84.2 | |
| 10 | 82.4 | |
| 11 | 82.4 | |
| 12 | 74.4 | |
| 13 | U Stable Beluga 2 | 69.6 |
| 14 | 65.8 | |
| 15 | 61.3 | |
| 16 | 57.8 | |
| 17 | U StarCoder 2 15B | 57.7 |
| 18 | 54.4 | |
| 19 | 54.4 | |
| 20 | 53.8 | |
| 21 | U Nemotron-4 15B | 46.0 |
| 22 | U Yi 6B | 44.9 |
| 23 | U INTELLECT-1 | 38.6 |
| 24 | 36.9 | |
| 25 | 35.4 | |
| 26 | U MPT-30B | 34.4 |
| 27 | U Baichuan 2-7B | 24.6 |
| 28 | 21.3 | |
| 29 | 20.6 | |
| 30 | 17.7 | |
| 31 | U Baichuan1-7B | 9.2 |
| 32 | 4.4 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with GSM8K
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About GSM8K
What does GSM8K measure?
Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve. 32 AI models have been tested on it. Scores range from 4.4 to 92.0 out of 100.
Which model leads on GSM8K?
GPT-4 (older v0314) from OpenAI leads GSM8K with a score of 92.0. The median score across 32 tested models is 57.7.
Is GSM8K saturated?
No · the top score is 92.0 out of 100 (92%). There is still meaningful room for improvement on GSM8K.
Does GSM8K predict performance on other benchmarks?
Yes · GSM8K scores correlate 0.98 with Chatbot Arena Elo · Overall across 7 shared models. Models that do well on GSM8K tend to do well on Chatbot Arena Elo · Overall.
How often is GSM8K data refreshed?
BenchGecko pulls updates daily. New model scores on GSM8K appear as soon as they are published by Epoch AI or the model provider.
- Category
- Math
- Max score
- 100
- Models
- 32
- Updated
- 2025-04-15
Top on GSM8K
GPT-4 (older v0314) · 92.0GPT-4o-mini · 91.3GPT-4o-mini (2024-07-18) · 91.3Qwen2.5 Coder 32B Instruct · 91.1GPT-4 Turbo · 90.0More math benchmarks
Same category · related evaluations