Which model leads on HellaSwag?

GPT-4 Turbo from OpenAI leads HellaSwag with a score of 93.7. The median score across 29 tested models is 72.3.

Is HellaSwag saturated?

No · the top score is 93.7 out of 100 (94%). There is still meaningful room for improvement on HellaSwag.

Yes · HellaSwag scores correlate 0.93 with PIQA across 17 shared models. Models that do well on HellaSwag tend to do well on PIQA.

BenchGecko pulls updates daily. New model scores on HellaSwag appear as soon as they are published by Epoch AI or the model provider.

Benchmark · ReasoningCompetitive

Name: HellaSwag Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

Updated 2025-04-15

Models tested

Top score

93.7

GPT-4 Turbo

Median

72.3

min 30.1

Top-5 spread

σ 3.7

Competitive

Best score over time · one chart, every benchmark

Chart type

Only 6 models have been tested on HellaSwag · not enough history to compute a frontier yet.

Pink dots = frontier records · 0 totalClick to open model page

29 models tested · sorted by score

#	Model	Score	Price
1	GPT-4 Turbo· OpenAI	93.7	$10.00
2	Llama 3.1 405B· Meta	85.6	—
3	Falcon-180B· TII	85.3	—
4	DeepSeek V3· DeepSeek	85.2	$0.32
5	DeepSeek-V2 (MoE-236B, May 2024)· DeepSeek	82.8	—
6	Mixtral 8x7B Instruct· Mistral AI	82.3	$0.54
7	Qwen2.5 72B Instruct· Alibaba Qwen	79.7	$0.36
8	U Stable Beluga 2· Unknown	78.8	—
9	Qwen2.5 Coder 32B Instruct· Alibaba Qwen	77.3	$0.66
10	Falcon 2 11B· TII	77.2	—
11	U Nemotron-4 15B· Unknown	76.5	—
12	phi-3-medium 14B· Microsoft	76.5	—
13	Mistral 7B V0.1· Mistral AI	74.7	—
14	Llama 2-13B· Meta	74.3	—
15	LLaMA-13B· Meta	72.3	—
16	phi-3-small 7.4B· Microsoft	69.3	—
17	Qwen2.5 Coder 7B Instruct· Alibaba Qwen	69.1	$0.03
18	phi-3-mini 3.8B· Microsoft	68.9	—
19	U MPT-30B· Unknown	68.5	—
20	U Yi 6B· Unknown	65.9	—
21	U XGen-7B· Unknown	65.6	—
22	U INTELLECT-1· Unknown	61.9	—
23	Gemma 2B· Google DeepMind	61.9	—
24	U Dolly 2.0-12b· Unknown	61.1	—
25	U Baichuan 2-7B· Unknown	57.3	—
26	Qwen2.5 Coder 1.5B Instruct· Alibaba	49.1	—
27	Cerebras-GPT-13B· OpenAI	45.9	—
28	Phi 2· Microsoft	38.1	—
29	Phi-1.5· Microsoft	30.1	—