Which model leads on GSM8K?

GPT-4 (older v0314) from OpenAI leads GSM8K with a score of 92.0. The median score across 32 tested models is 57.7.

No · the top score is 92.0 out of 100 (92%). There is still meaningful room for improvement on GSM8K.

Does GSM8K predict performance on other benchmarks?

Yes · GSM8K scores correlate 0.98 with Chatbot Arena Elo · Overall across 7 shared models. Models that do well on GSM8K tend to do well on Chatbot Arena Elo · Overall.

How often is GSM8K data refreshed?

BenchGecko pulls updates daily. New model scores on GSM8K appear as soon as they are published by Epoch AI or the model provider.

Benchmark · MathSettled

GSM8K

Name: GSM8K Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

Updated 2025-04-15

Models tested

Top score

92.0

GPT-4 (older v0314)

Median

57.7

min 4.4

Top-5 spread

σ 0.7

Settled

The Frontier

Best score over time · one chart, every benchmark

Chart type

Only 8 models have been tested on GSM8K · not enough history to compute a frontier yet.

Pink dots = frontier records · 0 totalClick to open model page

Full rankings

32 models tested · sorted by score

#	Model	Score	Price
1	GPT-4 (older v0314)· OpenAI	92.0	$30.00
2	GPT-4o-mini· OpenAI	91.3	$0.15
3	GPT-4o-mini (2024-07-18)· OpenAI	91.3	$0.15
4	Qwen2.5 Coder 32B Instruct· Alibaba Qwen	91.1	$0.66
5	GPT-4 Turbo· OpenAI	90.0	$10.00
6	Claude Instant· Anthropic	86.7	—
7	Qwen2.5 Coder 7B Instruct· Alibaba Qwen	86.7	$0.03
8	Gemma 2 9B· Google DeepMind	84.9	$0.03
9	Mistral Nemo· Mistral AI	84.2	$0.02
10	Gemini 1.5 Flash (May 2024)· Google DeepMind	82.4	—
11	Llama 3.1 8B Instruct· Meta	82.4	$0.02
12	Mixtral 8x7B Instruct· Mistral AI	74.4	$0.54
13	U Stable Beluga 2· Unknown	69.6	—
14	Qwen2.5 Coder 1.5B Instruct· Alibaba	65.8	—
15	Qwen-14B· Alibaba Qwen	61.3	—
16	GPT-3.5 Turbo (older v0613)· OpenAI	57.8	$1.00
17	U StarCoder 2 15B· Unknown	57.7	—
18	Falcon-180B· TII	54.4	—
19	Mistral 7B V0.1· Mistral AI	54.4	—
20	Falcon 2 11B· TII	53.8	—
21	U Nemotron-4 15B· Unknown	46.0	—
22	U Yi 6B· Unknown	44.9	—
23	U INTELLECT-1· Unknown	38.6	—
24	Llama 2-13B· Meta	36.9	—
25	DeepSeek Coder 33B· DeepSeek	35.4	—
26	U MPT-30B· Unknown	34.4	—
27	U Baichuan 2-7B· Unknown	24.6	—
28	DeepSeek Coder 6.7B· DeepSeek	21.3	—
29	LLaMA-13B· Meta	20.6	—
30	Gemma 2B· Google DeepMind	17.7	—
31	U Baichuan1-7B· Unknown	9.2	—
32	DeepSeek Coder 1.3B· DeepSeek	4.4	—