Benchmark · ReasoningSettled

Winogrande

WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

Updated 2025-04-15
Models tested
38
Top score
78.4
Llama 3.1 405B
Median
49.3
min 6.6
Top-5 spread
σ 1.5
Settled

Best score over time · one chart, every benchmark

WINOGRANDE6 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Sep 24Nov 24Feb 25Apr 25RELEASE DATE →benchgecko.ai/benchmark/winogrande · frontier
Only 6 models have been tested on Winogrande · not enough history to compute a frontier yet.
Pink dots = frontier records · 1 totalClick to open model page
Details
Category
Reasoning
Max score
100
Models
38
Updated
2025-04-15

Same category · related evaluations