Benchmark · ReasoningCompetitive

HellaSwag

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

Updated 2025-04-15
Models tested
29
Top score
93.7
GPT-4 Turbo
Median
72.3
min 30.1
Top-5 spread
σ 3.7
Competitive

Best score over time · one chart, every benchmark

HELLASWAG6 MODELS · FRONTIER RUNNING MAX0255075100SCORE ↑Jul 24Sep 24Nov 24Feb 25Apr 25RELEASE DATE →benchgecko.ai/benchmark/hellaswag · frontier
Only 6 models have been tested on HellaSwag · not enough history to compute a frontier yet.
Pink dots = frontier records · 0 totalClick to open model page
Details
Category
Reasoning
Max score
100
Models
29
Updated
2025-04-15

Same category · related evaluations