Winogrande
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
The Frontier
Best score over time · one chart, every benchmark
Full rankings
38 models tested · sorted by score
| # | Model | Score |
|---|---|---|
| 1 | 78.4 | |
| 2 | 77.0 | |
| 3 | 75.0 | |
| 4 | 75.0 | |
| 5 | 74.2 | |
| 6 | 72.6 | |
| 7 | 70.4 | |
| 8 | 67.0 | |
| 9 | 64.6 | |
| 10 | 63.2 | |
| 11 | 63.0 | |
| 12 | 63.0 | |
| 13 | 61.6 | |
| 14 | 56.6 | |
| 15 | U Nemotron-4 15B | 56.0 |
| 16 | 54.4 | |
| 17 | 51.4 | |
| 18 | 50.6 | |
| 19 | 50.2 | |
| 20 | 48.4 | |
| 21 | 46.8 | |
| 22 | 46.0 | |
| 23 | 45.8 | |
| 24 | 45.6 | |
| 25 | U Yi 6B | 42.6 |
| 26 | U MPT-30B | 42.0 |
| 27 | 41.6 | |
| 28 | U INTELLECT-1 | 31.6 |
| 29 | 30.8 | |
| 30 | U XGen-7B | 29.8 |
| 31 | U StarCoder 2 15B | 28.6 |
| 32 | 24.0 | |
| 33 | U Dolly 2.0-12b | 23.6 |
| 34 | 21.6 | |
| 35 | 21.4 | |
| 36 | 15.2 | |
| 37 | 9.4 | |
| 38 | 6.6 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Benchmarks that track with Winogrande
Pearson correlation across models scored on both benchmarks. Closer to 1 = strongly predictive.
Frequently asked
About Winogrande
What does Winogrande measure?
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs. 38 AI models have been tested on it. Scores range from 6.6 to 78.4 out of 100.
Which model leads on Winogrande?
Llama 3.1 405B from Meta leads Winogrande with a score of 78.4. The median score across 38 tested models is 49.3.
Is Winogrande saturated?
No · the top score is 78.4 out of 100 (78%). There is still meaningful room for improvement on Winogrande.
Does Winogrande predict performance on other benchmarks?
Yes · Winogrande scores correlate 0.94 with PIQA across 15 shared models. Models that do well on Winogrande tend to do well on PIQA.
How often is Winogrande data refreshed?
BenchGecko pulls updates daily. New model scores on Winogrande appear as soon as they are published by Epoch AI or the model provider.
- Category
- Reasoning
- Max score
- 100
- Models
- 38
- Updated
- 2025-04-15
Top on Winogrande
Llama 3.1 405B · 78.4Claude 3 Opus · 77.0GPT-4 (older v0314) · 75.0GPT-4 Turbo · 75.0Falcon-180B · 74.2More reasoning benchmarks
Same category · related evaluations