Benchmark · MathSettled

GSM8K

Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

Updated 2025-04-15
Models tested
32
Top score
92.0
GPT-4 (older v0314)
Median
57.7
min 4.4
Top-5 spread
σ 0.7
Settled

Best score over time · one chart, every benchmark

GSM8K8 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jun 24Sep 24Nov 24Feb 25Apr 25RELEASE DATE →benchgecko.ai/benchmark/gsm8k · frontier
Only 8 models have been tested on GSM8K · not enough history to compute a frontier yet.
Pink dots = frontier records · 0 totalClick to open model page

Same category · related evaluations