Which model leads on Winogrande?

Llama 3.1 405B from Meta leads Winogrande with a score of 78.4. The median score across 38 tested models is 49.3.

Is Winogrande saturated?

No · the top score is 78.4 out of 100 (78%). There is still meaningful room for improvement on Winogrande.

Yes · Winogrande scores correlate 0.94 with PIQA across 15 shared models. Models that do well on Winogrande tend to do well on PIQA.

BenchGecko pulls updates daily. New model scores on Winogrande appear as soon as they are published by Epoch AI or the model provider.

Benchmark · ReasoningSettled

Name: Winogrande Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

Updated 2025-04-15

Models tested

Top score

78.4

Llama 3.1 405B

Median

49.3

min 6.6

Top-5 spread

σ 1.5

Settled

Best score over time · one chart, every benchmark

Chart type

Only 6 models have been tested on Winogrande · not enough history to compute a frontier yet.

Pink dots = frontier records · 1 totalClick to open model page

38 models tested · sorted by score

#	Model	Score	Price
1	Llama 3.1 405B· Meta	78.4	—
2	Claude 3 Opus· Anthropic	77.0	—
3	GPT-4 (older v0314)· OpenAI	75.0	$30.00
4	GPT-4 Turbo· OpenAI	75.0	$10.00
5	Falcon-180B· TII	74.2	—
6	DeepSeek-V2 (MoE-236B, May 2024)· DeepSeek	72.6	—
7	DeepSeek V3· DeepSeek	70.4	$0.32
8	Llama 3 70B Instruct· Meta	67.0	$0.51
9	Qwen2.5 72B Instruct· Alibaba Qwen	64.6	$0.36
10	GPT-3.5 Turbo (older v0613)· OpenAI	63.2	$1.00
11	phi-3-medium 14B· Microsoft	63.0	—
12	phi-3-small 7.4B· Microsoft	63.0	—
13	Qwen2.5 Coder 32B Instruct· Alibaba Qwen	61.6	$0.66
14	Falcon 2 11B· TII	56.6	—
15	U Nemotron-4 15B· Unknown	56.0	—
16	Mixtral 8x7B Instruct· Mistral AI	54.4	$0.54
17	Llama 3 8B Instruct· Meta	51.4	$0.03
18	Mistral 7B V0.1· Mistral AI	50.6	—
19	Claude 3 Sonnet· Anthropic	50.2	—
20	Claude 3 Haiku· Anthropic	48.4	$0.25
21	Phi-1.5· Microsoft	46.8	—
22	LLaMA-13B· Meta	46.0	—
23	Qwen2.5 Coder 7B Instruct· Alibaba Qwen	45.8	$0.03
24	Llama 2-13B· Meta	45.6	—
25	U Yi 6B· Unknown	42.6	—
26	U MPT-30B· Unknown	42.0	—
27	phi-3-mini 3.8B· Microsoft	41.6	—
28	U INTELLECT-1· Unknown	31.6	—
29	Gemma 2B· Google DeepMind	30.8	—
30	U XGen-7B· Unknown	29.8	—
31	U StarCoder 2 15B· Unknown	28.6	—
32	DeepSeek Coder 33B· DeepSeek	24.0	—
33	U Dolly 2.0-12b· Unknown	23.6	—
34	Cerebras-GPT-13B· OpenAI	21.6	—
35	Qwen2.5 Coder 1.5B Instruct· Alibaba	21.4	—
36	DeepSeek Coder 6.7B· DeepSeek	15.2	—
37	Phi 2· Microsoft	9.4	—
38	DeepSeek Coder 1.3B· DeepSeek	6.6	—